From: Hans Reiser This is the main reiserfs4 filesystem. Q&A wrt this patch: - A really short guide to how to get up and running with this filesystem. Reiser4 is a file system based on dancing tree algorithms, and is described at http://www.namesys.com. One should be able to get it up and running just like any of the other filesystems supported by Linux. Configure it to be compiled either builtin or as a module. Create reiser4 filesystem with mkfs.reiser4, mount and use it. More detailed info can be found at http://thebsh.namesys.com/snapshots/LATEST/READ.ME. - The direct URL which people use to obtain the mkfs tool for this filesystem. Also fsck and anything else. Reiser4 userland tools can be obtained at ftp://ftp.namesys.com/pub/reiser4progs. ftp://ftp.namesys.com/pub/reiser4progs/README contains detailed instructions on how to compile and install these tools. Also all reiser4 progs have man pages. - Any known shortcomings, caveats, etc. Reiser4 has been tested on i386 yet only. Quota support is not ready yet. Should be ready soon. Reiser4 was tested extensively, and we got to where the mailing list was not able to hit any bugs, but then we told people that, got an order of magnitude increase in users, and they are able to hit bugs that we are working on now. Reiser's Law of Software Engineering: Each order of magnitude increase in users finds more bugs, in a quantity equal to the previous order of magnitude increase in users. Success for software developers is measured by how long the frustration lasts. Only the very core functionality is working. Exotic plugins, an API for multiple operation transactions and accessing multiple small files in one syscall, compression, inheritance, all have been postponed until after the core functionality is shipped. The compression plugin needs a code review before anyone should use it. - A statement on compatibility with reiserfs3 filesytems. To upgrade from reiserfs V3 to V4, use tar, or sponsor us to write a convertfs. - Bear in mind that people will immediately benchmark this filesytem, and first impressions count. Now is your chance to communicate any tuning guidelines, mount options or whatever which you'd like people to understand BEFORE they start publishing benchmark info. Reiser4 is not tuned for fsync/sync/O_SYNC performance yet. If you see results that are much different from those at www.namesys.com/benchmarks.html, let us know. If you see performance characteristics that don't quite make sense, email reiserfs-list@namesys.com, such things are always of interest. reiser4 is not tuned for mmaping and dirtying more than physical ram like IOzone does. This is quite different in its code path from writing and dirtying more than physical ram. There are those who think that what IOZone does is rarely done by real programs, and therefor we should not bother to optimize what it does. All I know is, this month we are not optimized for it. Please consider its space savings when you benchmark it also. Signed-off-by: Andrew Morton --- fs/reiser4/Kconfig | 90 fs/reiser4/Makefile | 96 fs/reiser4/README | 125 fs/reiser4/as_ops.c | 656 ++++ fs/reiser4/block_alloc.c | 1128 +++++++ fs/reiser4/block_alloc.h | 175 + fs/reiser4/blocknrset.c | 365 ++ fs/reiser4/carry.c | 1429 ++++++++ fs/reiser4/carry.h | 418 ++ fs/reiser4/carry_ops.c | 2107 +++++++++++++ fs/reiser4/carry_ops.h | 41 fs/reiser4/cluster.c | 71 fs/reiser4/cluster.h | 289 + fs/reiser4/context.c | 303 + fs/reiser4/context.h | 282 + fs/reiser4/coord.c | 959 +++++ fs/reiser4/coord.h | 337 ++ fs/reiser4/crypt.c | 92 fs/reiser4/debug.c | 447 ++ fs/reiser4/debug.h | 353 ++ fs/reiser4/dformat.h | 164 + fs/reiser4/dscale.c | 173 + fs/reiser4/dscale.h | 27 fs/reiser4/emergency_flush.c | 913 +++++ fs/reiser4/emergency_flush.h | 75 fs/reiser4/entd.c | 375 ++ fs/reiser4/entd.h | 83 fs/reiser4/eottl.c | 373 ++ fs/reiser4/estimate.c | 101 fs/reiser4/file_ops.c | 421 ++ fs/reiser4/flush.c | 3499 +++++++++++++++++++++ fs/reiser4/flush.h | 283 + fs/reiser4/flush_queue.c | 753 ++++ fs/reiser4/forward.h | 258 + fs/reiser4/init_super.c | 526 +++ fs/reiser4/init_super.h | 4 fs/reiser4/inode.c | 771 ++++ fs/reiser4/inode.h | 424 ++ fs/reiser4/inode_ops.c | 612 +++ fs/reiser4/ioctl.h | 41 fs/reiser4/jnode.c | 2035 ++++++++++++ fs/reiser4/jnode.h | 772 ++++ fs/reiser4/kassign.c | 738 ++++ fs/reiser4/kassign.h | 97 fs/reiser4/kcond.c | 283 + fs/reiser4/kcond.h | 56 fs/reiser4/key.c | 168 + fs/reiser4/key.h | 389 ++ fs/reiser4/ktxnmgrd.c | 274 + fs/reiser4/ktxnmgrd.h | 63 fs/reiser4/lib.h | 75 fs/reiser4/lock.c | 1402 ++++++++ fs/reiser4/lock.h | 251 + fs/reiser4/oid.c | 166 + fs/reiser4/page_cache.c | 779 ++++ fs/reiser4/page_cache.h | 62 fs/reiser4/plugin/compress/compress.c | 429 ++ fs/reiser4/plugin/compress/compress.h | 36 fs/reiser4/plugin/compress/lzoconf.h | 433 ++ fs/reiser4/plugin/compress/minilzo.c | 2388 ++++++++++++++ fs/reiser4/plugin/compress/minilzo.h | 100 fs/reiser4/plugin/cryptcompress.c | 3459 +++++++++++++++++++++ fs/reiser4/plugin/cryptcompress.h | 518 +++ fs/reiser4/plugin/digest.c | 32 fs/reiser4/plugin/dir/dir.c | 1885 +++++++++++ fs/reiser4/plugin/dir/dir.h | 106 fs/reiser4/plugin/dir/hashed_dir.c | 1459 +++++++++ fs/reiser4/plugin/dir/hashed_dir.h | 46 fs/reiser4/plugin/dir/pseudo_dir.c | 97 fs/reiser4/plugin/dir/pseudo_dir.h | 29 fs/reiser4/plugin/disk_format/disk_format.c | 38 fs/reiser4/plugin/disk_format/disk_format.h | 41 fs/reiser4/plugin/disk_format/disk_format40.c | 556 +++ fs/reiser4/plugin/disk_format/disk_format40.h | 100 fs/reiser4/plugin/fibration.c | 173 + fs/reiser4/plugin/fibration.h | 37 fs/reiser4/plugin/file/file.c | 2740 +++++++++++++++++ fs/reiser4/plugin/file/file.h | 152 fs/reiser4/plugin/file/funcs.h | 25 fs/reiser4/plugin/file/invert.c | 511 +++ fs/reiser4/plugin/file/pseudo.c | 180 + fs/reiser4/plugin/file/pseudo.h | 39 fs/reiser4/plugin/file/symfile.c | 98 fs/reiser4/plugin/file/tail_conversion.c | 720 ++++ fs/reiser4/plugin/hash.c | 346 ++ fs/reiser4/plugin/item/acl.h | 64 fs/reiser4/plugin/item/blackbox.c | 142 fs/reiser4/plugin/item/blackbox.h | 33 fs/reiser4/plugin/item/cde.c | 1070 ++++++ fs/reiser4/plugin/item/cde.h | 78 fs/reiser4/plugin/item/ctail.c | 1627 ++++++++++ fs/reiser4/plugin/item/ctail.h | 82 fs/reiser4/plugin/item/extent.c | 181 + fs/reiser4/plugin/item/extent.h | 173 + fs/reiser4/plugin/item/extent_file_ops.c | 1542 +++++++++ fs/reiser4/plugin/item/extent_flush_ops.c | 1003 ++++++ fs/reiser4/plugin/item/extent_item_ops.c | 791 ++++ fs/reiser4/plugin/item/internal.c | 398 ++ fs/reiser4/plugin/item/internal.h | 51 fs/reiser4/plugin/item/item.c | 760 ++++ fs/reiser4/plugin/item/item.h | 387 ++ fs/reiser4/plugin/item/sde.c | 216 + fs/reiser4/plugin/item/sde.h | 64 fs/reiser4/plugin/item/static_stat.c | 1319 ++++++++ fs/reiser4/plugin/item/static_stat.h | 220 + fs/reiser4/plugin/item/tail.c | 682 ++++ fs/reiser4/plugin/item/tail.h | 54 fs/reiser4/plugin/node/node.c | 127 fs/reiser4/plugin/node/node.h | 266 + fs/reiser4/plugin/node/node40.c | 2783 +++++++++++++++++ fs/reiser4/plugin/node/node40.h | 117 fs/reiser4/plugin/object.c | 1640 ++++++++++ fs/reiser4/plugin/object.h | 42 fs/reiser4/plugin/plugin.c | 623 +++ fs/reiser4/plugin/plugin.h | 832 +++++ fs/reiser4/plugin/plugin_header.h | 136 fs/reiser4/plugin/plugin_set.c | 347 ++ fs/reiser4/plugin/plugin_set.h | 77 fs/reiser4/plugin/pseudo/pseudo.c | 1801 +++++++++++ fs/reiser4/plugin/pseudo/pseudo.h | 176 + fs/reiser4/plugin/security/perm.c | 91 fs/reiser4/plugin/security/perm.h | 88 fs/reiser4/plugin/space/bitmap.c | 1646 ++++++++++ fs/reiser4/plugin/space/bitmap.h | 41 fs/reiser4/plugin/space/space_allocator.h | 80 fs/reiser4/plugin/symlink.c | 85 fs/reiser4/plugin/symlink.h | 24 fs/reiser4/plugin/tail_policy.c | 109 fs/reiser4/pool.c | 226 + fs/reiser4/pool.h | 69 fs/reiser4/readahead.c | 378 ++ fs/reiser4/readahead.h | 50 fs/reiser4/reiser4.h | 277 + fs/reiser4/safe_link.c | 350 ++ fs/reiser4/safe_link.h | 29 fs/reiser4/seal.c | 238 + fs/reiser4/seal.h | 51 fs/reiser4/search.c | 1633 ++++++++++ fs/reiser4/spin_macros.h | 477 ++ fs/reiser4/status_flags.c | 195 + fs/reiser4/status_flags.h | 43 fs/reiser4/super.c | 480 ++ fs/reiser4/super.h | 538 +++ fs/reiser4/tap.c | 389 ++ fs/reiser4/tap.h | 73 fs/reiser4/tree.c | 1825 +++++++++++ fs/reiser4/tree.h | 551 +++ fs/reiser4/tree_mod.c | 364 ++ fs/reiser4/tree_mod.h | 29 fs/reiser4/tree_walk.c | 1232 +++++++ fs/reiser4/tree_walk.h | 117 fs/reiser4/txnmgr.c | 4172 ++++++++++++++++++++++++++ fs/reiser4/txnmgr.h | 646 ++++ fs/reiser4/type_safe_hash.h | 320 + fs/reiser4/type_safe_list.h | 436 ++ fs/reiser4/vfs_ops.c | 1640 ++++++++++ fs/reiser4/vfs_ops.h | 139 fs/reiser4/wander.c | 2185 +++++++++++++ fs/reiser4/wander.h | 135 fs/reiser4/writeout.h | 21 fs/reiser4/znode.c | 1141 +++++++ fs/reiser4/znode.h | 451 ++ 162 files changed, 86990 insertions(+) diff -puN /dev/null fs/reiser4/as_ops.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/as_ops.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,656 @@ +/* Copyright 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Interface to VFS. Reiser4 address_space_operations are defined here. */ + +#include "forward.h" +#include "debug.h" +#include "dformat.h" +#include "coord.h" +#include "plugin/item/item.h" +#include "plugin/file/file.h" +#include "plugin/security/perm.h" +#include "plugin/disk_format/disk_format.h" +#include "plugin/plugin.h" +#include "plugin/plugin_set.h" +#include "plugin/object.h" +#include "txnmgr.h" +#include "jnode.h" +#include "znode.h" +#include "block_alloc.h" +#include "tree.h" +#include "vfs_ops.h" +#include "inode.h" +#include "page_cache.h" +#include "ktxnmgrd.h" +#include "super.h" +#include "reiser4.h" +#include "entd.h" +#include "emergency_flush.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* address space operations */ + +static int reiser4_readpage(struct file *, struct page *); + +static int reiser4_prepare_write(struct file *, + struct page *, unsigned, unsigned); + +static int reiser4_commit_write(struct file *, + struct page *, unsigned, unsigned); + +static int reiser4_set_page_dirty (struct page *); +static sector_t reiser4_bmap(struct address_space *, sector_t); +/* static int reiser4_direct_IO(int, struct inode *, + struct kiobuf *, unsigned long, int); */ + +/* address space operations */ + +/* clear PAGECACHE_TAG_DIRTY tag of a page. This is used in uncapture_page. This resembles test_clear_page_dirty. The + only difference is that page's mapping exists and REISER4_MOVED tag is checked */ +reiser4_internal void +reiser4_clear_page_dirty(struct page *page) +{ + struct address_space *mapping; + unsigned long flags; + + mapping = page->mapping; + BUG_ON(mapping == NULL); + + read_lock_irqsave(&mapping->tree_lock, flags); + if (TestClearPageDirty(page)) { + read_unlock_irqrestore(&mapping->tree_lock, flags); + if (mapping_cap_account_dirty(mapping)) + dec_page_state(nr_dirty); + return; + } + read_unlock_irqrestore(&mapping->tree_lock, flags); +} + +/* as_ops->set_page_dirty() VFS method in reiser4_address_space_operations. + + It is used by others (except reiser4) to set reiser4 pages dirty. Reiser4 + itself uses set_page_dirty_internal(). + + The difference is that reiser4_set_page_dirty sets MOVED tag on the page and clears DIRTY tag. Pages tagged as MOVED + get processed by reiser4_writepages() to do reiser4 specific work over dirty pages (allocation jnode, capturing, atom + creation) which cannot be done in the contexts where reiser4_set_page_dirty is called. + set_page_dirty_internal sets DIRTY tag and clear MOVED +*/ +static int reiser4_set_page_dirty(struct page *page /* page to mark dirty */) +{ + /* this page can be unformatted only */ + assert("vs-1734", (page->mapping && + page->mapping->host && + get_super_fake(page->mapping->host->i_sb) != page->mapping->host && + get_cc_fake(page->mapping->host->i_sb) != page->mapping->host && + get_super_private(page->mapping->host->i_sb)->bitmap != page->mapping->host)); + + if (!TestSetPageDirty(page)) { + struct address_space *mapping = page->mapping; + + if (mapping) { + read_lock_irq(&mapping->tree_lock); + /* check for race with truncate */ + if (page->mapping) { + assert("vs-1652", page->mapping == mapping); + if (mapping_cap_account_dirty(mapping)) + inc_page_state(nr_dirty); + radix_tree_tag_set(&mapping->page_tree, + page->index, PAGECACHE_TAG_REISER4_MOVED); + } + read_unlock_irq(&mapping->tree_lock); + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); + } + } + return 0; +} + +/* ->readpage() VFS method in reiser4 address_space_operations + method serving file mmapping +*/ +static int +reiser4_readpage(struct file *f /* file to read from */ , + struct page *page /* page where to read data + * into */ ) +{ + struct inode *inode; + file_plugin *fplug; + int result; + reiser4_context ctx; + + /* + * basically calls ->readpage method of object plugin and handles + * errors. + */ + + assert("umka-078", f != NULL); + assert("umka-079", page != NULL); + assert("nikita-2280", PageLocked(page)); + assert("vs-976", !PageUptodate(page)); + + assert("vs-318", page->mapping && page->mapping->host); + assert("nikita-1352", (f == NULL) || (f->f_dentry->d_inode == page->mapping->host)); + + /* ->readpage can be called from page fault service routine */ + assert("nikita-3174", schedulable()); + + inode = page->mapping->host; + init_context(&ctx, inode->i_sb); + fplug = inode_file_plugin(inode); + if (fplug->readpage != NULL) + result = fplug->readpage(f, page); + else + result = RETERR(-EINVAL); + if (result != 0) { + SetPageError(page); + unlock_page(page); + } + + reiser4_exit_context(&ctx); + return 0; +} + +static int filler(void *vp, struct page *page) +{ + return reiser4_readpage(vp, page); +} + +/* ->readpages() VFS method in reiser4 address_space_operations + method serving page cache readahead + + if readpages hook is set in file data - it is called + otherwise read_cache_pages is used +*/ +static int +reiser4_readpages(struct file *file, struct address_space *mapping, + struct list_head *pages, unsigned nr_pages) +{ + reiser4_context ctx; + reiser4_file_fsdata *fsdata; + + init_context(&ctx, mapping->host->i_sb); + fsdata = reiser4_get_file_fsdata(file); + if (IS_ERR(fsdata)) { + reiser4_exit_context(&ctx); + return PTR_ERR(fsdata); + } + + if (fsdata->ra2.readpages) + fsdata->ra2.readpages(mapping, pages, fsdata->ra2.data); + else { + assert("vs-1738", lock_stack_isclean(get_current_lock_stack())); + read_cache_pages(mapping, pages, filler, file); + } + reiser4_exit_context(&ctx); + return 0; +} + +/* prepares @page to be written. This means, that if we want to modify only some + part of page, page should be read first and than modified. Actually this function + almost the same as reiser4_readpage(). The differentce is only that, it does not + unlock the page in the case of error. This is needed because loop back device + driver expects it locked. */ +static int reiser4_prepare_write(struct file *file, struct page *page, + unsigned from, unsigned to) +{ + int result; + file_plugin * fplug; + struct inode * inode; + reiser4_context ctx; + + inode = page->mapping->host; + init_context(&ctx, inode->i_sb); + fplug = inode_file_plugin(inode); + + if (fplug->prepare_write != NULL) + result = fplug->prepare_write(file, page, from, to); + else + result = RETERR(-EINVAL); + + /* don't commit transaction under inode semaphore */ + context_set_commit_async(&ctx); + reiser4_exit_context(&ctx); + + return result; +} + +/* captures jnode of @page to current atom. */ +static int reiser4_commit_write(struct file *file, struct page *page, + unsigned from, unsigned to) +{ + int result; + file_plugin *fplug; + struct inode *inode; + reiser4_context ctx; + + assert("umka-3101", file != NULL); + assert("umka-3102", page != NULL); + assert("umka-3093", PageLocked(page)); + + SetPageUptodate(page); + + inode = page->mapping->host; + init_context(&ctx, inode->i_sb); + fplug = inode_file_plugin(inode); + + if (fplug->capturepage) + result = fplug->capturepage(page); + else + result = RETERR(-EINVAL); + + /* here page is return locked. */ + assert("umka-3103", PageLocked(page)); + + /* don't commit transaction under inode semaphore */ + context_set_commit_async(&ctx); + reiser4_exit_context(&ctx); + return result; +} + +/* ->writepages() + ->vm_writeback() + ->set_page_dirty() + ->prepare_write() + ->commit_write() +*/ + +/* ->bmap() VFS method in reiser4 address_space_operations */ +reiser4_internal int +reiser4_lblock_to_blocknr(struct address_space *mapping, + sector_t lblock, reiser4_block_nr *blocknr) +{ + file_plugin *fplug; + int result; + reiser4_context ctx; + + init_context(&ctx, mapping->host->i_sb); + + fplug = inode_file_plugin(mapping->host); + if (fplug && fplug->get_block) { + *blocknr = generic_block_bmap(mapping, lblock, fplug->get_block); + result = 0; + } else + result = RETERR(-EINVAL); + reiser4_exit_context(&ctx); + return result; +} + +/* ->bmap() VFS method in reiser4 address_space_operations */ +static sector_t +reiser4_bmap(struct address_space *mapping, sector_t lblock) +{ + reiser4_block_nr blocknr; + int result; + + result = reiser4_lblock_to_blocknr(mapping, lblock, &blocknr); + if (result == 0) + if (sizeof blocknr == sizeof(sector_t) || + !blocknr_is_fake(&blocknr)) + return blocknr; + else + return 0; + else + return result; +} + +/* ->invalidatepage method for reiser4 */ + +/* + * this is called for each truncated page from + * truncate_inode_pages()->truncate_{complete,partial}_page(). + * + * At the moment of call, page is under lock, and outstanding io (if any) has + * completed. + */ + +reiser4_internal int +reiser4_invalidatepage(struct page *page /* page to invalidate */, + unsigned long offset /* starting offset for partial + * invalidation */) +{ + int ret = 0; + reiser4_context ctx; + struct inode *inode; + + /* + * This is called to truncate file's page. + * + * Originally, reiser4 implemented truncate in a standard way + * (vmtruncate() calls ->invalidatepage() on all truncated pages + * first, then file system ->truncate() call-back is invoked). + * + * This lead to the problem when ->invalidatepage() was called on a + * page with jnode that was captured into atom in ASTAGE_PRE_COMMIT + * process. That is, truncate was bypassing transactions. To avoid + * this, try_capture_page_to_invalidate() call was added here. + * + * After many troubles with vmtruncate() based truncate (including + * races with flush, tail conversion, etc.) it was re-written in the + * top-to-bottom style: items are killed in cut_tree_object() and + * pages belonging to extent are invalidated in kill_hook_extent(). So + * probably now additional call to capture is not needed here. + * + */ + + assert("nikita-3137", PageLocked(page)); + assert("nikita-3138", !PageWriteback(page)); + inode = page->mapping->host; + + /* + * ->invalidatepage() should only be called for the unformatted + * jnodes. Destruction of all other types of jnodes is performed + * separately. But, during some corner cases (like handling errors + * during mount) it is simpler to let ->invalidatepage to be called on + * them. Check for this, and do nothing. + */ + if (get_super_fake(inode->i_sb) == inode) + return 0; + if (get_cc_fake(inode->i_sb) == inode) + return 0; + if (get_super_private(inode->i_sb)->bitmap == inode) + return 0; + + assert("vs-1426", PagePrivate(page)); + assert("vs-1427", page->mapping == jnode_get_mapping(jnode_by_page(page))); + + init_context(&ctx, inode->i_sb); + /* capture page being truncated. */ + ret = try_capture_page_to_invalidate(page); + if (ret != 0) { + warning("nikita-3141", "Cannot capture: %i", ret); + print_page("page", page); + } + + if (offset == 0) { + jnode *node; + + /* remove jnode from transaction and detach it from page. */ + node = jnode_by_page(page); + if (node != NULL) { + assert("vs-1435", !JF_ISSET(node, JNODE_CC)); + jref(node); + JF_SET(node, JNODE_HEARD_BANSHEE); + /* page cannot be detached from jnode concurrently, + * because it is locked */ + uncapture_page(page); + + /* this detaches page from jnode, so that jdelete will not try to lock page which is already locked */ + UNDER_SPIN_VOID(jnode, + node, + page_clear_jnode(page, node)); + unhash_unformatted_jnode(node); + + jput(node); + } + } + reiser4_exit_context(&ctx); + return ret; +} + +#define INC_STAT(page, node, counter) \ + reiser4_stat_inc_at(page->mapping->host->i_sb, \ + level[jnode_get_level(node)].counter); + +#define INC_NSTAT(node, counter) INC_STAT(jnode_page(node), node, counter) + +int is_cced(const jnode *node); + +/* help function called from reiser4_releasepage(). It returns true if jnode + * can be detached from its page and page released. */ +static int +releasable(const jnode *node /* node to check */) +{ + assert("nikita-2781", node != NULL); + assert("nikita-2783", spin_jnode_is_locked(node)); + + /* is some thread is currently using jnode page, later cannot be + * detached */ + if (atomic_read(&node->d_count) != 0) { + return 0; + } + + assert("vs-1214", !jnode_is_loaded(node)); + + /* this jnode is just a copy. Its page cannot be released, because + * otherwise next jload() would load obsolete data from disk + * (up-to-date version may still be in memory). */ + if (is_cced(node)) { + return 0; + } + + /* emergency flushed page can be released. This is what emergency + * flush is all about after all. */ + if (JF_ISSET(node, JNODE_EFLUSH)) { + return 1; /* yeah! */ + } + + /* can only release page if real block number is assigned to + it. Simple check for ->atom wouldn't do, because it is possible for + node to be clean, not it atom yet, and still having fake block + number. For example, node just created in jinit_new(). */ + if (blocknr_is_fake(jnode_get_block(node))) { + return 0; + } + /* dirty jnode cannot be released. It can however be submitted to disk + * as part of early flushing, but only after getting flush-prepped. */ + if (jnode_is_dirty(node)) { + return 0; + } + /* overwrite set is only written by log writer. */ + if (JF_ISSET(node, JNODE_OVRWR)) { + return 0; + } + /* jnode is already under writeback */ + if (JF_ISSET(node, JNODE_WRITEBACK)) { + return 0; + } +#if 0 + /* page was modified through mmap, but its jnode is not yet + * captured. Don't discard modified data. */ + if (jnode_is_unformatted(node) && JF_ISSET(node, JNODE_KEEPME)) { + return 0; + } +#endif + BUG_ON(JF_ISSET(node, JNODE_KEEPME)); + /* don't flush bitmaps or journal records */ + if (!jnode_is_znode(node) && !jnode_is_unformatted(node)) { + return 0; + } + return 1; +} + +#if REISER4_DEBUG +int jnode_is_releasable(jnode *node) +{ + return UNDER_SPIN(jload, node, releasable(node)); +} +#endif + +/* + * ->releasepage method for reiser4 + * + * This is called by VM scanner when it comes across clean page. What we have + * to do here is to check whether page can really be released (freed that is) + * and if so, detach jnode from it and remove page from the page cache. + * + * Check for releasability is done by releasable() function. + */ +reiser4_internal int +reiser4_releasepage(struct page *page, int gfp UNUSED_ARG) +{ + jnode *node; + void *oid; + + assert("nikita-2257", PagePrivate(page)); + assert("nikita-2259", PageLocked(page)); + assert("nikita-2892", !PageWriteback(page)); + assert("nikita-3019", schedulable()); + + /* NOTE-NIKITA: this can be called in the context of reiser4 call. It + is not clear what to do in this case. A lot of deadlocks seems be + possible. */ + + node = jnode_by_page(page); + assert("nikita-2258", node != NULL); + assert("reiser4-4", page->mapping != NULL); + assert("reiser4-5", page->mapping->host != NULL); + + oid = (void *)(unsigned long)get_inode_oid(page->mapping->host); + + /* is_page_cache_freeable() check + (mapping + private + page_cache_get() by shrink_cache()) */ + if (page_count(page) > 3) + return 0; + + if (PageDirty(page)) + return 0; + + /* releasable() needs jnode lock, because it looks at the jnode fields + * and we need jload_lock here to avoid races with jload(). */ + LOCK_JNODE(node); + LOCK_JLOAD(node); + if (releasable(node)) { + struct address_space *mapping; + + mapping = page->mapping; + jref(node); + /* there is no need to synchronize against + * jnode_extent_write() here, because pages seen by + * jnode_extent_write() are !releasable(). */ + page_clear_jnode(page, node); + UNLOCK_JLOAD(node); + UNLOCK_JNODE(node); + + /* we are under memory pressure so release jnode also. */ + jput(node); + + write_lock_irq(&mapping->tree_lock); + /* shrink_list() + radix-tree */ + if (page_count(page) == 2) { + __remove_from_page_cache(page); + __put_page(page); + } + write_unlock_irq(&mapping->tree_lock); + + return 1; + } else { + UNLOCK_JLOAD(node); + UNLOCK_JNODE(node); + assert("nikita-3020", schedulable()); + return 0; + } +} + +#undef INC_NSTAT +#undef INC_STAT + +reiser4_internal void +move_inode_out_from_sync_inodes_loop(struct address_space * mapping) +{ + /* work around infinite loop in pdflush->sync_sb_inodes. */ + /* Problem: ->writepages() is supposed to submit io for the pages from + * ->io_pages list and to clean this list. */ + mapping->host->dirtied_when = jiffies; + spin_lock(&inode_lock); + list_move(&mapping->host->i_list, &mapping->host->i_sb->s_dirty); + spin_unlock(&inode_lock); + +} + +/* reiser4 writepages() address space operation this captures anonymous pages + and anonymous jnodes. Anonymous pages are pages which are dirtied via + mmapping. Anonymous jnodes are ones which were created by reiser4_writepage + */ +reiser4_internal int +reiser4_writepages(struct address_space *mapping, + struct writeback_control *wbc) +{ + int ret = 0; + struct inode *inode; + file_plugin *fplug; + + inode = mapping->host; + fplug = inode_file_plugin(inode); + if (fplug != NULL && fplug->capture != NULL) + /* call file plugin method to capture anonymous pages and + anonymous jnodes */ + ret = fplug->capture(inode, wbc); + + move_inode_out_from_sync_inodes_loop(mapping); + return ret; +} + +/* start actual IO on @page */ +reiser4_internal int reiser4_start_up_io(struct page *page) +{ + block_sync_page(page); + return 0; +} + +/* + * reiser4 methods for VM + */ +struct address_space_operations reiser4_as_operations = { + /* called during memory pressure by kswapd */ + .writepage = reiser4_writepage, + /* called to read page from the storage when page is added into page + cache. This is done by page-fault handler. */ + .readpage = reiser4_readpage, + /* Start IO on page. This is called from wait_on_page_bit() and + lock_page() and its purpose is to actually start io by jabbing + device drivers. */ + .sync_page = reiser4_start_up_io, + /* called from + * reiser4_sync_inodes()->generic_sync_sb_inodes()->...->do_writepages() + * + * captures anonymous pages for given inode + */ + .writepages = reiser4_writepages, + /* marks page dirty. Note that this is never called by reiser4 + * directly. Reiser4 uses set_page_dirty_internal(). Reiser4 set page + * dirty is called for pages dirtied though mmap and moves dirty page + * to the special ->moved_list in its mapping. */ + .set_page_dirty = reiser4_set_page_dirty, + /* called during read-ahead */ + .readpages = reiser4_readpages, + .prepare_write = reiser4_prepare_write, /* loop back device driver and generic_file_write() call-back */ + .commit_write = reiser4_commit_write, /* loop back device driver and generic_file_write() call-back */ + /* map logical block number to disk block number. Used by FIBMAP ioctl + * and ..bmap pseudo file. */ + .bmap = reiser4_bmap, + /* called just before page is taken out from address space (on + truncate, umount, or similar). */ + .invalidatepage = reiser4_invalidatepage, + /* called when VM is about to take page from address space (due to + memory pressure). */ + .releasepage = reiser4_releasepage, + /* not yet implemented */ + .direct_IO = NULL +}; + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/block_alloc.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/block_alloc.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1128 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "debug.h" +#include "dformat.h" +#include "plugin/plugin.h" +#include "txnmgr.h" +#include "znode.h" +#include "block_alloc.h" +#include "tree.h" +#include "super.h" +#include "lib.h" + +#include /* for __u?? */ +#include /* for struct super_block */ +#include + +/* THE REISER4 DISK SPACE RESERVATION SCHEME. */ + +/* We need to be able to reserve enough disk space to ensure that an atomic + operation will have enough disk space to flush (see flush.c and + http://namesys.com/v4/v4.html) and commit it once it is started. + + In our design a call for reserving disk space may fail but not an actual + block allocation. + + All free blocks, already allocated blocks, and all kinds of reserved blocks + are counted in different per-fs block counters. + + A reiser4 super block's set of block counters currently is: + + free -- free blocks, + used -- already allocated blocks, + + grabbed -- initially reserved for performing an fs operation, those blocks + are taken from free blocks, then grabbed disk space leaks from grabbed + blocks counter to other counters like "fake allocated", "flush + reserved", "used", the rest of not used grabbed space is returned to + free space at the end of fs operation; + + fake allocated -- counts all nodes without real disk block numbers assigned, + we have separate accounting for formatted and unformatted + nodes (for easier debugging); + + flush reserved -- disk space needed for flushing and committing an atom. + Each dirty already allocated block could be written as a + part of atom's overwrite set or as a part of atom's + relocate set. In both case one additional block is needed, + it is used as a wandered block if we do overwrite or as a + new location for a relocated block. + + In addition, blocks in some states are counted on per-thread and per-atom + basis. A reiser4 context has a counter of blocks grabbed by this transaction + and the sb's grabbed blocks counter is a sum of grabbed blocks counter values + of each reiser4 context. Each reiser4 atom has a counter of "flush reserved" + blocks, which are reserved for flush processing and atom commit. */ + +/* AN EXAMPLE: suppose we insert new item to the reiser4 tree. We estimate + number of blocks to grab for most expensive case of balancing when the leaf + node we insert new item to gets split and new leaf node is allocated. + + So, we need to grab blocks for + + 1) one block for possible dirtying the node we insert an item to. That block + would be used for node relocation at flush time or for allocating of a + wandered one, it depends what will be a result (what set, relocate or + overwrite the node gets assigned to) of the node processing by the flush + algorithm. + + 2) one block for either allocating a new node, or dirtying of right or left + clean neighbor, only one case may happen. + + VS-FIXME-HANS: why can only one case happen? I would expect to see dirtying of left neighbor, right neighbor, current + node, and creation of new node. have I forgotten something? email me. + + These grabbed blocks are counted in both reiser4 context "grabbed blocks" + counter and in the fs-wide one (both ctx->grabbed_blocks and + sbinfo->blocks_grabbed get incremented by 2), sb's free blocks counter is + decremented by 2. + + Suppose both two blocks were spent for dirtying of an already allocated clean + node (one block went from "grabbed" to "flush reserved") and for new block + allocating (one block went from "grabbed" to "fake allocated formatted"). + + Inserting of a child pointer to the parent node caused parent node to be + split, the balancing code takes care about this grabbing necessary space + immediately by calling reiser4_grab with BA_RESERVED flag set which means + "can use the 5% reserved disk space". + + At this moment insertion completes and grabbed blocks (if they were not used) + should be returned to the free space counter. + + However the atom life-cycle is not completed. The atom had one "flush + reserved" block added by our insertion and the new fake allocated node is + counted as a "fake allocated formatted" one. The atom has to be fully + processed by flush before commit. Suppose that the flush moved the first, + already allocated node to the atom's overwrite list, the new fake allocated + node, obviously, went into the atom relocate set. The reiser4 flush + allocates the new node using one unit from "fake allocated formatted" + counter, the log writer uses one from "flush reserved" for wandered block + allocation. + + And, it is not the end. When the wandered block is deallocated after the + atom gets fully played (see wander.c for term description), the disk space + occupied for it is returned to free blocks. */ + +/* BLOCK NUMBERS */ + +/* Any reiser4 node has a block number assigned to it. We use these numbers for + indexing in hash tables, so if a block has not yet been assigned a location + on disk we need to give it a temporary fake block number. + + Current implementation of reiser4 uses 64-bit integers for block numbers. We + use highest bit in 64-bit block number to distinguish fake and real block + numbers. So, only 63 bits may be used to addressing of real device + blocks. That "fake" block numbers space is divided into subspaces of fake + block numbers for data blocks and for shadow (working) bitmap blocks. + + Fake block numbers for data blocks are generated by a cyclic counter, which + gets incremented after each real block allocation. We assume that it is + impossible to overload this counter during one transaction life. */ + +/* Initialize a blocknr hint. */ +reiser4_internal void +blocknr_hint_init(reiser4_blocknr_hint * hint) +{ + memset(hint, 0, sizeof (reiser4_blocknr_hint)); +} + +/* Release any resources of a blocknr hint. */ +reiser4_internal void +blocknr_hint_done(reiser4_blocknr_hint * hint UNUSED_ARG) +{ + /* No resources should be freed in current blocknr_hint implementation.*/ +} + +/* see above for explanation of fake block number. */ +/* Audited by: green(2002.06.11) */ +reiser4_internal int +blocknr_is_fake(const reiser4_block_nr * da) +{ + /* The reason for not simply returning result of '&' operation is that + while return value is (possibly 32bit) int, the reiser4_block_nr is + at least 64 bits long, and high bit (which is the only possible + non zero bit after the masking) would be stripped off */ + return (*da & REISER4_FAKE_BLOCKNR_BIT_MASK) ? 1 : 0; +} + +/* Static functions for / block counters + arithmetic. Mostly, they are isolated to not to code same assertions in + several places. */ +static void +sub_from_ctx_grabbed(reiser4_context *ctx, __u64 count) +{ + BUG_ON(ctx->grabbed_blocks < count); + assert("zam-527", ctx->grabbed_blocks >= count); + ctx->grabbed_blocks -= count; +} + +static void +add_to_ctx_grabbed(reiser4_context *ctx, __u64 count) +{ +#if REISER4_DEBUG +#ifdef CONFIG_FRAME_POINTER + ctx->grabbed_at[0] = __builtin_return_address(0); + ctx->grabbed_at[1] = __builtin_return_address(1); + ctx->grabbed_at[2] = __builtin_return_address(2); + ctx->grabbed_at[3] = __builtin_return_address(3); +#endif +#endif + ctx->grabbed_blocks += count; +} + +static void +sub_from_sb_grabbed(reiser4_super_info_data *sbinfo, __u64 count) +{ + assert("zam-525", sbinfo->blocks_grabbed >= count); + sbinfo->blocks_grabbed -= count; +} + +/* Decrease the counter of block reserved for flush in super block. */ +static void +sub_from_sb_flush_reserved (reiser4_super_info_data *sbinfo, __u64 count) +{ + assert ("vpf-291", sbinfo->blocks_flush_reserved >= count); + sbinfo->blocks_flush_reserved -= count; +} + +static void +sub_from_sb_fake_allocated(reiser4_super_info_data *sbinfo, __u64 count, reiser4_ba_flags_t flags) +{ + if (flags & BA_FORMATTED) { + assert("zam-806", sbinfo->blocks_fake_allocated >= count); + sbinfo->blocks_fake_allocated -= count; + } else { + assert("zam-528", sbinfo->blocks_fake_allocated_unformatted >= count); + sbinfo->blocks_fake_allocated_unformatted -= count; + } +} + +static void +sub_from_sb_used(reiser4_super_info_data *sbinfo, __u64 count) +{ + assert("zam-530", sbinfo->blocks_used >= count + sbinfo->min_blocks_used); + sbinfo->blocks_used -= count; +} + +static void +sub_from_cluster_reserved(reiser4_super_info_data *sbinfo, __u64 count) +{ + assert("edward-501", sbinfo->blocks_clustered >= count); + sbinfo->blocks_clustered -= count; +} + +/* Increase the counter of block reserved for flush in atom. */ +static void +add_to_atom_flush_reserved_nolock (txn_atom * atom, __u32 count) +{ + assert ("zam-772", atom != NULL); + assert ("zam-773", spin_atom_is_locked (atom)); + atom->flush_reserved += count; +} + +/* Decrease the counter of block reserved for flush in atom. */ +static void +sub_from_atom_flush_reserved_nolock (txn_atom * atom, __u32 count) +{ + assert ("zam-774", atom != NULL); + assert ("zam-775", spin_atom_is_locked (atom)); + assert ("nikita-2790", atom->flush_reserved >= count); + atom->flush_reserved -= count; +} + +/* super block has 6 counters: free, used, grabbed, fake allocated + (formatted and unformatted) and flush reserved. Their sum must be + number of blocks on a device. This function checks this */ +reiser4_internal int +check_block_counters(const struct super_block *super) +{ + __u64 sum; + + sum = reiser4_grabbed_blocks(super) + reiser4_free_blocks(super) + + reiser4_data_blocks(super) + reiser4_fake_allocated(super) + + reiser4_fake_allocated_unformatted(super) + flush_reserved(super) + + reiser4_clustered_blocks(super); + if (reiser4_block_count(super) != sum) { + printk("super block counters: " + "used %llu, free %llu, " + "grabbed %llu, fake allocated (formatetd %llu, unformatted %llu), " + "reserved %llu, clustered %llu, sum %llu, must be (block count) %llu\n", + (unsigned long long)reiser4_data_blocks(super), + (unsigned long long)reiser4_free_blocks(super), + (unsigned long long)reiser4_grabbed_blocks(super), + (unsigned long long)reiser4_fake_allocated(super), + (unsigned long long)reiser4_fake_allocated_unformatted(super), + (unsigned long long)flush_reserved(super), + (unsigned long long)reiser4_clustered_blocks(super), + (unsigned long long)sum, + (unsigned long long)reiser4_block_count(super)); + return 0; + } + return 1; +} + +/* Adjust "working" free blocks counter for number of blocks we are going to + allocate. Record number of grabbed blocks in fs-wide and per-thread + counters. This function should be called before bitmap scanning or + allocating fake block numbers + + @super -- pointer to reiser4 super block; + @count -- number of blocks we reserve; + + @return -- 0 if success, -ENOSPC, if all + free blocks are preserved or already allocated. +*/ + +static int +reiser4_grab(reiser4_context *ctx, __u64 count, reiser4_ba_flags_t flags) +{ + __u64 free_blocks; + int ret = 0, use_reserved = flags & BA_RESERVED; + reiser4_super_info_data *sbinfo; + + assert("vs-1276", ctx == get_current_context()); + + /* Do not grab anything on ro-mounted fs. */ + if (rofs_super(ctx->super)) { + ctx->grab_enabled = 0; + return 0; + } + + sbinfo = get_super_private(ctx->super); + + reiser4_spin_lock_sb(sbinfo); + + free_blocks = sbinfo->blocks_free; + + if ((use_reserved && free_blocks < count) || + (!use_reserved && free_blocks < count + sbinfo->blocks_reserved)) { + ret = RETERR(-ENOSPC); + goto unlock_and_ret; + } + + add_to_ctx_grabbed(ctx, count); + + sbinfo->blocks_grabbed += count; + sbinfo->blocks_free -= count; + +#if REISER4_DEBUG + ctx->grabbed_initially = count; +#endif + + assert("nikita-2986", check_block_counters(ctx->super)); + + /* disable grab space in current context */ + ctx->grab_enabled = 0; + +unlock_and_ret: + reiser4_spin_unlock_sb(sbinfo); + + return ret; +} + +reiser4_internal int +reiser4_grab_space(__u64 count, reiser4_ba_flags_t flags) +{ + int ret; + reiser4_context *ctx; + + assert("nikita-2964", ergo(flags & BA_CAN_COMMIT, + lock_stack_isclean(get_current_lock_stack()))); + ctx = get_current_context(); + if (!(flags & BA_FORCE) && !is_grab_enabled(ctx)) { + return 0; + } + + ret = reiser4_grab(ctx, count, flags); + if (ret == -ENOSPC) { + + /* Trying to commit the all transactions if BA_CAN_COMMIT flag present */ + if (flags & BA_CAN_COMMIT) { + txnmgr_force_commit_all(ctx->super, 0); + ctx->grab_enabled = 1; + ret = reiser4_grab(ctx, count, flags); + } + } + /* + * allocation from reserved pool cannot fail. This is severe error. + */ + assert("nikita-3005", ergo(flags & BA_RESERVED, ret == 0)); + return ret; +} + +/* + * SPACE RESERVED FOR UNLINK/TRUNCATE + * + * Unlink and truncate require space in transaction (to update stat data, at + * least). But we don't want rm(1) to fail with "No space on device" error. + * + * Solution is to reserve 5% of disk space for truncates and + * unlinks. Specifically, normal space grabbing requests don't grab space from + * reserved area. Only requests with BA_RESERVED bit in flags are allowed to + * drain it. Per super block delete_sema semaphore is used to allow only one + * thread at a time to grab from reserved area. + * + * Grabbing from reserved area should always be performed with BA_CAN_COMMIT + * flag. + * + */ + +reiser4_internal int reiser4_grab_reserved(struct super_block *super, + __u64 count, reiser4_ba_flags_t flags) +{ + reiser4_super_info_data *sbinfo = get_super_private(super); + + assert("nikita-3175", flags & BA_CAN_COMMIT); + + /* Check the delete semaphore already taken by us, we assume that + * reading of machine word is atomic. */ + if (sbinfo->delete_sema_owner == current) { + if (reiser4_grab_space(count, (flags | BA_RESERVED) & ~BA_CAN_COMMIT)) { + warning("zam-1003", "nested call of grab_reserved fails count=(%llu)", + (unsigned long long)count); + reiser4_release_reserved(super); + return RETERR(-ENOSPC); + } + return 0; + } + + if (reiser4_grab_space(count, flags)) { + down(&sbinfo->delete_sema); + assert("nikita-2929", sbinfo->delete_sema_owner == NULL); + sbinfo->delete_sema_owner = current; + + if (reiser4_grab_space(count, flags | BA_RESERVED)) { + warning("zam-833", + "reserved space is not enough (%llu)", (unsigned long long)count); + reiser4_release_reserved(super); + return RETERR(-ENOSPC); + } + } + return 0; +} + +reiser4_internal void +reiser4_release_reserved(struct super_block *super) +{ + reiser4_super_info_data *info; + + info = get_super_private(super); + if (info->delete_sema_owner == current) { + info->delete_sema_owner = NULL; + up(&info->delete_sema); + } +} + +static reiser4_super_info_data * +grabbed2fake_allocated_head(void) +{ + reiser4_context *ctx; + reiser4_super_info_data *sbinfo; + + ctx = get_current_context(); + sub_from_ctx_grabbed(ctx, 1); + + sbinfo = get_super_private(ctx->super); + reiser4_spin_lock_sb(sbinfo); + + sub_from_sb_grabbed(sbinfo, 1); + /* return sbinfo locked */ + return sbinfo; +} + +/* is called after @count fake block numbers are allocated and pointer to + those blocks are inserted into tree. */ +static void +grabbed2fake_allocated_formatted(void) +{ + reiser4_super_info_data *sbinfo; + + sbinfo = grabbed2fake_allocated_head(); + sbinfo->blocks_fake_allocated ++; + + assert("vs-922", check_block_counters(reiser4_get_current_sb())); + + reiser4_spin_unlock_sb(sbinfo); +} + +static void +grabbed2fake_allocated_unformatted(void) +{ + reiser4_super_info_data *sbinfo; + + sbinfo = grabbed2fake_allocated_head(); + sbinfo->blocks_fake_allocated_unformatted ++; + + assert("vs-9221", check_block_counters(reiser4_get_current_sb())); + + reiser4_spin_unlock_sb(sbinfo); +} + +reiser4_internal void +grabbed2cluster_reserved(int count) +{ + reiser4_context *ctx; + reiser4_super_info_data *sbinfo; + + ctx = get_current_context(); + sub_from_ctx_grabbed(ctx, count); + + sbinfo = get_super_private(ctx->super); + reiser4_spin_lock_sb(sbinfo); + + sub_from_sb_grabbed(sbinfo, count); + sbinfo->blocks_clustered += count; + + assert("edward-504", check_block_counters(ctx->super)); + + reiser4_spin_unlock_sb(sbinfo); +} + +reiser4_internal void +cluster_reserved2grabbed(int count) +{ + reiser4_context *ctx; + reiser4_super_info_data *sbinfo; + + ctx = get_current_context(); + + sbinfo = get_super_private(ctx->super); + reiser4_spin_lock_sb(sbinfo); + + sub_from_cluster_reserved(sbinfo, count); + sbinfo->blocks_grabbed += count; + + assert("edward-505", check_block_counters(ctx->super)); + + reiser4_spin_unlock_sb(sbinfo); + add_to_ctx_grabbed(ctx, count); +} + +reiser4_internal void +cluster_reserved2free(int count) +{ + reiser4_context *ctx; + reiser4_super_info_data *sbinfo; + + assert("edward-503", get_current_context()->grabbed_blocks == 0); + + ctx = get_current_context(); + sbinfo = get_super_private(ctx->super); + reiser4_spin_lock_sb(sbinfo); + + sub_from_cluster_reserved(sbinfo, count); + sbinfo->blocks_free += count; + + assert("edward-502", check_block_counters(ctx->super)); + + reiser4_spin_unlock_sb(sbinfo); +} + +static spinlock_t fake_lock = SPIN_LOCK_UNLOCKED; +static reiser4_block_nr fake_gen = 0; + +/* obtain a block number for new formatted node which will be used to refer + to this newly allocated node until real allocation is done */ +static inline void assign_fake_blocknr(reiser4_block_nr *blocknr) +{ + spin_lock(&fake_lock); + *blocknr = fake_gen++; + spin_unlock(&fake_lock); + + *blocknr &= ~REISER4_BLOCKNR_STATUS_BIT_MASK; + *blocknr |= REISER4_UNALLOCATED_STATUS_VALUE; + assert("zam-394", zlook(current_tree, blocknr) == NULL); +} + +reiser4_internal int +assign_fake_blocknr_formatted(reiser4_block_nr *blocknr) +{ + assign_fake_blocknr(blocknr); + grabbed2fake_allocated_formatted(); + + return 0; +} + +/* return fake blocknr which will be used for unformatted nodes */ +reiser4_internal reiser4_block_nr +fake_blocknr_unformatted(void) +{ + reiser4_block_nr blocknr; + + assign_fake_blocknr(&blocknr); + grabbed2fake_allocated_unformatted(); + + /*XXXXX*/inc_unalloc_unfm_ptr(); + return blocknr; +} + + +/* adjust sb block counters, if real (on-disk) block allocation immediately + follows grabbing of free disk space. */ +static void +grabbed2used(reiser4_context *ctx, reiser4_super_info_data *sbinfo, __u64 count) +{ + sub_from_ctx_grabbed(ctx, count); + + reiser4_spin_lock_sb(sbinfo); + + sub_from_sb_grabbed(sbinfo, count); + sbinfo->blocks_used += count; + + assert("nikita-2679", check_block_counters(ctx->super)); + + reiser4_spin_unlock_sb(sbinfo); +} + +/* adjust sb block counters when @count unallocated blocks get mapped to disk */ +static void +fake_allocated2used(reiser4_super_info_data *sbinfo, __u64 count, reiser4_ba_flags_t flags) +{ + reiser4_spin_lock_sb(sbinfo); + + sub_from_sb_fake_allocated(sbinfo, count, flags); + sbinfo->blocks_used += count; + + assert("nikita-2680", check_block_counters(reiser4_get_current_sb())); + + reiser4_spin_unlock_sb(sbinfo); +} + +static void +flush_reserved2used(txn_atom * atom, __u64 count) +{ + reiser4_super_info_data *sbinfo; + + assert("zam-787", atom != NULL); + assert("zam-788", spin_atom_is_locked(atom)); + + sub_from_atom_flush_reserved_nolock(atom, (__u32)count); + + sbinfo = get_current_super_private(); + reiser4_spin_lock_sb(sbinfo); + + sub_from_sb_flush_reserved(sbinfo, count); + sbinfo->blocks_used += count; + + assert ("zam-789", check_block_counters(reiser4_get_current_sb())); + + reiser4_spin_unlock_sb(sbinfo); +} + +/* update the per fs blocknr hint default value. */ +reiser4_internal void +update_blocknr_hint_default (const struct super_block *s, const reiser4_block_nr * block) +{ + reiser4_super_info_data *sbinfo = get_super_private(s); + + assert("nikita-3342", !blocknr_is_fake(block)); + + reiser4_spin_lock_sb(sbinfo); + if (*block < sbinfo->block_count) { + sbinfo->blocknr_hint_default = *block; + } else { + warning("zam-676", + "block number %llu is too large to be used in a blocknr hint\n", (unsigned long long) *block); + dump_stack(); + DEBUGON(1); + } + reiser4_spin_unlock_sb(sbinfo); +} + +/* get current value of the default blocknr hint. */ +reiser4_internal void get_blocknr_hint_default(reiser4_block_nr * result) +{ + reiser4_super_info_data * sbinfo = get_current_super_private(); + + reiser4_spin_lock_sb(sbinfo); + *result = sbinfo->blocknr_hint_default; + assert("zam-677", *result < sbinfo->block_count); + reiser4_spin_unlock_sb(sbinfo); +} + +/* Allocate "real" disk blocks by calling a proper space allocation plugin + * method. Blocks are allocated in one contiguous disk region. The plugin + * independent part accounts blocks by subtracting allocated amount from grabbed + * or fake block counter and add the same amount to the counter of allocated + * blocks. + * + * @hint -- a reiser4 blocknr hint object which contains further block + * allocation hints and parameters (search start, a stage of block + * which will be mapped to disk, etc.), + * @blk -- an out parameter for the beginning of the allocated region, + * @len -- in/out parameter, it should contain the maximum number of allocated + * blocks, after block allocation completes, it contains the length of + * allocated disk region. + * @flags -- see reiser4_ba_flags_t description. + * + * @return -- 0 if success, error code otherwise. + */ +reiser4_internal int +reiser4_alloc_blocks(reiser4_blocknr_hint * hint, reiser4_block_nr * blk, + reiser4_block_nr * len, reiser4_ba_flags_t flags) +{ + __u64 needed = *len; + reiser4_context *ctx; + reiser4_super_info_data *sbinfo; + int ret; + + assert ("zam-986", hint != NULL); + + ctx = get_current_context(); + sbinfo = get_super_private(ctx->super); + + /* For write-optimized data we use default search start value, which is + * close to last write location. */ + if (flags & BA_USE_DEFAULT_SEARCH_START) { + get_blocknr_hint_default(&hint->blk); + } + + /* VITALY: allocator should grab this for internal/tx-lists/similar only. */ +/* VS-FIXME-HANS: why is this comment above addressed to vitaly (from vitaly)? */ + if (hint->block_stage == BLOCK_NOT_COUNTED) { + ret = reiser4_grab_space_force(*len, flags); + if (ret != 0) + return ret; + } + + ret = sa_alloc_blocks(get_space_allocator(ctx->super), hint, (int) needed, blk, len); + + if (!ret) { + assert("zam-680", *blk < reiser4_block_count(ctx->super)); + assert("zam-681", *blk + *len <= reiser4_block_count(ctx->super)); + + if (flags & BA_PERMANENT) { + /* we assume that current atom exists at this moment */ + txn_atom * atom = get_current_atom_locked (); + atom -> nr_blocks_allocated += *len; + UNLOCK_ATOM (atom); + } + + switch (hint->block_stage) { + case BLOCK_NOT_COUNTED: + case BLOCK_GRABBED: + grabbed2used(ctx, sbinfo, *len); + break; + case BLOCK_UNALLOCATED: + fake_allocated2used(sbinfo, *len, flags); + break; + case BLOCK_FLUSH_RESERVED: + { + txn_atom *atom = get_current_atom_locked (); + flush_reserved2used(atom, *len); + UNLOCK_ATOM (atom); + } + break; + default: + impossible("zam-531", "wrong block stage"); + } + } else { + assert ("zam-821", ergo(hint->max_dist == 0 && !hint->backward, ret != -ENOSPC)); + if (hint->block_stage == BLOCK_NOT_COUNTED) + grabbed2free(ctx, sbinfo, needed); + } + + return ret; +} + +/* used -> fake_allocated -> grabbed -> free */ + +/* adjust sb block counters when @count unallocated blocks get unmapped from + disk */ +static void +used2fake_allocated(reiser4_super_info_data *sbinfo, __u64 count, int formatted) +{ + reiser4_spin_lock_sb(sbinfo); + + if (formatted) + sbinfo->blocks_fake_allocated += count; + else + sbinfo->blocks_fake_allocated_unformatted += count; + + sub_from_sb_used(sbinfo, count); + + assert("nikita-2681", check_block_counters(reiser4_get_current_sb())); + + reiser4_spin_unlock_sb(sbinfo); +} + +static void +used2flush_reserved(reiser4_super_info_data *sbinfo, txn_atom * atom, __u64 count, + reiser4_ba_flags_t flags UNUSED_ARG) +{ + assert("nikita-2791", atom != NULL); + assert("nikita-2792", spin_atom_is_locked(atom)); + + add_to_atom_flush_reserved_nolock(atom, (__u32)count); + + reiser4_spin_lock_sb(sbinfo); + + sbinfo->blocks_flush_reserved += count; + /*add_to_sb_flush_reserved(sbinfo, count);*/ + sub_from_sb_used(sbinfo, count); + + assert("nikita-2681", check_block_counters(reiser4_get_current_sb())); + + reiser4_spin_unlock_sb(sbinfo); +} + +/* disk space, virtually used by fake block numbers is counted as "grabbed" again. */ +static void +fake_allocated2grabbed(reiser4_context *ctx, reiser4_super_info_data *sbinfo, __u64 count, reiser4_ba_flags_t flags) +{ + add_to_ctx_grabbed(ctx, count); + + reiser4_spin_lock_sb(sbinfo); + + assert("nikita-2682", check_block_counters(ctx->super)); + + sbinfo->blocks_grabbed += count; + sub_from_sb_fake_allocated(sbinfo, count, flags & BA_FORMATTED); + + assert("nikita-2683", check_block_counters(ctx->super)); + + reiser4_spin_unlock_sb(sbinfo); +} + +reiser4_internal void +fake_allocated2free(__u64 count, reiser4_ba_flags_t flags) +{ + reiser4_context *ctx; + reiser4_super_info_data *sbinfo; + + ctx = get_current_context(); + sbinfo = get_super_private(ctx->super); + + fake_allocated2grabbed(ctx, sbinfo, count, flags); + grabbed2free(ctx, sbinfo, count); +} + +reiser4_internal void +grabbed2free_mark(__u64 mark) +{ + reiser4_context *ctx; + reiser4_super_info_data *sbinfo; + + ctx = get_current_context(); + sbinfo = get_super_private(ctx->super); + + assert("nikita-3007", (__s64)mark >= 0); + assert("nikita-3006", + ctx->grabbed_blocks >= mark); + grabbed2free(ctx, sbinfo, ctx->grabbed_blocks - mark); +} + +/* Adjust free blocks count for blocks which were reserved but were not used. */ +reiser4_internal void +grabbed2free(reiser4_context *ctx, reiser4_super_info_data *sbinfo, + __u64 count) +{ + sub_from_ctx_grabbed(ctx, count); + + + reiser4_spin_lock_sb(sbinfo); + + sub_from_sb_grabbed(sbinfo, count); + sbinfo->blocks_free += count; + assert("nikita-2684", check_block_counters(ctx->super)); + + reiser4_spin_unlock_sb(sbinfo); +} + +reiser4_internal void +grabbed2flush_reserved_nolock(txn_atom * atom, __u64 count) +{ + reiser4_context *ctx; + reiser4_super_info_data *sbinfo; + + assert("vs-1095", atom); + + ctx = get_current_context(); + sbinfo = get_super_private(ctx->super); + + sub_from_ctx_grabbed(ctx, count); + + add_to_atom_flush_reserved_nolock(atom, count); + + reiser4_spin_lock_sb(sbinfo); + + sbinfo->blocks_flush_reserved += count; + sub_from_sb_grabbed(sbinfo, count); + + assert ("vpf-292", check_block_counters(ctx->super)); + + reiser4_spin_unlock_sb(sbinfo); +} + +reiser4_internal void +grabbed2flush_reserved(__u64 count) +{ + txn_atom * atom = get_current_atom_locked (); + + grabbed2flush_reserved_nolock (atom, count); + + UNLOCK_ATOM (atom); +} + +reiser4_internal void flush_reserved2grabbed(txn_atom * atom, __u64 count) +{ + reiser4_context *ctx; + reiser4_super_info_data *sbinfo; + + assert("nikita-2788", atom != NULL); + assert("nikita-2789", spin_atom_is_locked(atom)); + + ctx = get_current_context(); + sbinfo = get_super_private(ctx->super); + + add_to_ctx_grabbed(ctx, count); + + sub_from_atom_flush_reserved_nolock(atom, (__u32)count); + + reiser4_spin_lock_sb(sbinfo); + + sbinfo->blocks_grabbed += count; + sub_from_sb_flush_reserved(sbinfo, count); + + assert ("vpf-292", check_block_counters (ctx->super)); + + reiser4_spin_unlock_sb (sbinfo); +} + +/* release all blocks grabbed in context which where not used. */ +reiser4_internal void +all_grabbed2free(void) +{ + reiser4_context *ctx = get_current_context(); + + grabbed2free(ctx, get_super_private(ctx->super), ctx->grabbed_blocks); +} + +/* adjust sb block counters if real (on-disk) blocks do not become unallocated + after freeing, @count blocks become "grabbed". */ +static void +used2grabbed(reiser4_context *ctx, reiser4_super_info_data *sbinfo, __u64 count) +{ + add_to_ctx_grabbed(ctx, count); + + reiser4_spin_lock_sb(sbinfo); + + sbinfo->blocks_grabbed += count; + sub_from_sb_used(sbinfo, count); + + assert("nikita-2685", check_block_counters(ctx->super)); + + reiser4_spin_unlock_sb(sbinfo); +} + +/* this used to be done through used2grabbed and grabbed2free*/ +static void +used2free(reiser4_super_info_data *sbinfo, __u64 count) +{ + reiser4_spin_lock_sb(sbinfo); + + sbinfo->blocks_free += count; + sub_from_sb_used(sbinfo, count); + + assert("nikita-2685", check_block_counters(reiser4_get_current_sb())); + + reiser4_spin_unlock_sb(sbinfo); +} + +#if REISER4_DEBUG + +/* check "allocated" state of given block range */ +static void +reiser4_check_blocks(const reiser4_block_nr * start, const reiser4_block_nr * len, int desired) +{ + sa_check_blocks(start, len, desired); +} + +/* check "allocated" state of given block */ +void +reiser4_check_block(const reiser4_block_nr * block, int desired) +{ + const reiser4_block_nr one = 1; + + reiser4_check_blocks(block, &one, desired); +} + +#endif + +/* Blocks deallocation function may do an actual deallocation through space + plugin allocation or store deleted block numbers in atom's delete_set data + structure depend on @defer parameter. */ + +/* if BA_DEFER bit is not turned on, @target_stage means the stage of blocks which + will be deleted from WORKING bitmap. They might be just unmapped from disk, or + freed but disk space is still grabbed by current thread, or these blocks must + not be counted in any reiser4 sb block counters, see block_stage_t comment */ + +/* BA_FORMATTED bit is only used when BA_DEFER in not present: it is used to + distinguish blocks allocated for unformatted and formatted nodes */ + +reiser4_internal int +reiser4_dealloc_blocks(const reiser4_block_nr * start, + const reiser4_block_nr * len, + block_stage_t target_stage, reiser4_ba_flags_t flags) +{ + txn_atom *atom = NULL; + int ret; + reiser4_context *ctx; + reiser4_super_info_data *sbinfo; + + ctx = get_current_context(); + sbinfo = get_super_private(ctx->super); + + if (REISER4_DEBUG) { + assert("zam-431", *len != 0); + assert("zam-432", *start != 0); + assert("zam-558", !blocknr_is_fake(start)); + + reiser4_spin_lock_sb(sbinfo); + assert("zam-562", *start < sbinfo->block_count); + reiser4_spin_unlock_sb(sbinfo); + } + + if (flags & BA_DEFER) { + blocknr_set_entry *bsep = NULL; + + /* storing deleted block numbers in a blocknr set + datastructure for further actual deletion */ + do { + atom = get_current_atom_locked(); + assert("zam-430", atom != NULL); + + ret = blocknr_set_add_extent(atom, &atom->delete_set, &bsep, start, len); + + if (ret == -ENOMEM) + return ret; + + /* This loop might spin at most two times */ + } while (ret == -E_REPEAT); + + assert("zam-477", ret == 0); + assert("zam-433", atom != NULL); + + UNLOCK_ATOM(atom); + + } else { + assert("zam-425", get_current_super_private() != NULL); + sa_dealloc_blocks(get_space_allocator(ctx->super), *start, *len); + + if (flags & BA_PERMANENT) { + /* These blocks were counted as allocated, we have to revert it + * back if allocation is discarded. */ + txn_atom * atom = get_current_atom_locked (); + atom->nr_blocks_allocated -= *len; + UNLOCK_ATOM (atom); + } + + switch (target_stage) { + case BLOCK_NOT_COUNTED: + assert("vs-960", flags & BA_FORMATTED); + /* VITALY: This is what was grabbed for internal/tx-lists/similar only */ + used2free(sbinfo, *len); + break; + + case BLOCK_GRABBED: + used2grabbed(ctx, sbinfo, *len); + break; + + case BLOCK_UNALLOCATED: + used2fake_allocated(sbinfo, *len, flags & BA_FORMATTED); + break; + + case BLOCK_FLUSH_RESERVED: { + txn_atom *atom; + + atom = get_current_atom_locked(); + used2flush_reserved(sbinfo, atom, *len, flags & BA_FORMATTED); + UNLOCK_ATOM(atom); + break; + } + default: + impossible("zam-532", "wrong block stage"); + } + } + + return 0; +} + +/* wrappers for block allocator plugin methods */ +reiser4_internal int +pre_commit_hook(void) +{ + assert("zam-502", get_current_super_private() != NULL); + sa_pre_commit_hook(); + return 0; +} + +/* an actor which applies delete set to block allocator data */ +static int +apply_dset(txn_atom * atom UNUSED_ARG, const reiser4_block_nr * a, const reiser4_block_nr * b, void *data UNUSED_ARG) +{ + reiser4_context *ctx; + reiser4_super_info_data *sbinfo; + + __u64 len = 1; + + ctx = get_current_context(); + sbinfo = get_super_private(ctx->super); + + assert("zam-877", atom->stage >= ASTAGE_PRE_COMMIT); + assert("zam-552", sbinfo != NULL); + + if (b != NULL) + len = *b; + + if (REISER4_DEBUG) { + reiser4_spin_lock_sb(sbinfo); + + assert("zam-554", *a < reiser4_block_count(ctx->super)); + assert("zam-555", *a + len <= reiser4_block_count(ctx->super)); + + reiser4_spin_unlock_sb(sbinfo); + } + + sa_dealloc_blocks(&sbinfo->space_allocator, *a, len); + /* adjust sb block counters */ + used2free(sbinfo, len); + return 0; +} + +reiser4_internal void +post_commit_hook(void) +{ + txn_atom *atom; + + atom = get_current_atom_locked(); + assert("zam-452", atom->stage == ASTAGE_POST_COMMIT); + UNLOCK_ATOM(atom); + + /* do the block deallocation which was deferred + until commit is done */ + blocknr_set_iterator(atom, &atom->delete_set, apply_dset, NULL, 1); + + assert("zam-504", get_current_super_private() != NULL); + sa_post_commit_hook(); +} + +reiser4_internal void +post_write_back_hook(void) +{ + assert("zam-504", get_current_super_private() != NULL); + + sa_post_commit_hook(); +} + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/block_alloc.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/block_alloc.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,175 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#if !defined (__FS_REISER4_BLOCK_ALLOC_H__) +#define __FS_REISER4_BLOCK_ALLOC_H__ + +#include "dformat.h" +#include "forward.h" + +#include /* for __u?? */ +#include + +/* Mask when is applied to given block number shows is that block number is a fake one */ +#define REISER4_FAKE_BLOCKNR_BIT_MASK 0x8000000000000000ULL +/* Mask which isolates a type of object this fake block number was assigned to */ +#define REISER4_BLOCKNR_STATUS_BIT_MASK 0xC000000000000000ULL + +/*result after applying the REISER4_BLOCKNR_STATUS_BIT_MASK should be compared + against these two values to understand is the object unallocated or bitmap + shadow object (WORKING BITMAP block, look at the plugin/space/bitmap.c) */ +#define REISER4_UNALLOCATED_STATUS_VALUE 0xC000000000000000ULL +#define REISER4_BITMAP_BLOCKS_STATUS_VALUE 0x8000000000000000ULL + +/* specification how block allocation was counted in sb block counters */ +typedef enum { + BLOCK_NOT_COUNTED = 0, /* reiser4 has no info about this block yet */ + BLOCK_GRABBED = 1, /* free space grabbed for further allocation + of this block */ + BLOCK_FLUSH_RESERVED = 2, /* block is reserved for flush needs. */ + BLOCK_UNALLOCATED = 3, /* block is used for existing in-memory object + ( unallocated formatted or unformatted + node) */ + BLOCK_ALLOCATED = 4 /* block is mapped to disk, real on-disk block + number assigned */ +} block_stage_t; + +/* a hint for block allocator */ +struct reiser4_blocknr_hint { + /* FIXME: I think we want to add a longterm lock on the bitmap block here. This + is to prevent jnode_flush() calls from interleaving allocations on the same + bitmap, once a hint is established. */ + + /* search start hint */ + reiser4_block_nr blk; + /* if not zero, it is a region size we search for free blocks in */ + reiser4_block_nr max_dist; + /* level for allocation, may be useful have branch-level and higher + write-optimized. */ + tree_level level; + /* block allocator assumes that blocks, which will be mapped to disk, + are in this specified block_stage */ + block_stage_t block_stage; + /* If direction = 1 allocate blocks in backward direction from the end + * of disk to the beginning of disk. */ + int backward:1; + +}; + +/* These flags control block allocation/deallocation behavior */ +enum reiser4_ba_flags { + /* do allocatations from reserved (5%) area */ + BA_RESERVED = (1 << 0), + + /* block allocator can do commit trying to recover free space */ + BA_CAN_COMMIT = (1 << 1), + + /* if operation will be applied to formatted block */ + BA_FORMATTED = (1 << 2), + + /* defer actual block freeing until transaction commit */ + BA_DEFER = (1 << 3), + + /* allocate blocks for permanent fs objects (formatted or unformatted), not + wandered of log blocks */ + BA_PERMANENT = (1 << 4), + + /* grab space even it was disabled */ + BA_FORCE = (1 << 5), + + /* use default start value for free blocks search. */ + BA_USE_DEFAULT_SEARCH_START = (1 << 6) +}; + +typedef enum reiser4_ba_flags reiser4_ba_flags_t; + +extern void blocknr_hint_init(reiser4_blocknr_hint * hint); +extern void blocknr_hint_done(reiser4_blocknr_hint * hint); +extern void update_blocknr_hint_default(const struct super_block *, const reiser4_block_nr *); +extern void get_blocknr_hint_default(reiser4_block_nr *); + +extern reiser4_block_nr reiser4_fs_reserved_space(struct super_block * super); + +int assign_fake_blocknr_formatted(reiser4_block_nr *); +reiser4_block_nr fake_blocknr_unformatted(void); + + +/* free -> grabbed -> fake_allocated -> used */ + + +int reiser4_grab_space (__u64 count, reiser4_ba_flags_t flags); +void all_grabbed2free (void); +void grabbed2free (reiser4_context *, + reiser4_super_info_data *, __u64 count); +void fake_allocated2free (__u64 count, reiser4_ba_flags_t flags); +void grabbed2flush_reserved_nolock(txn_atom * atom, __u64 count); +void grabbed2flush_reserved (__u64 count); +int reiser4_alloc_blocks (reiser4_blocknr_hint * hint, + reiser4_block_nr * start, + reiser4_block_nr * len, + reiser4_ba_flags_t flags); +int reiser4_dealloc_blocks (const reiser4_block_nr *, + const reiser4_block_nr *, + block_stage_t, reiser4_ba_flags_t flags); + +static inline int reiser4_alloc_block (reiser4_blocknr_hint * hint, reiser4_block_nr * start, + reiser4_ba_flags_t flags) +{ + reiser4_block_nr one = 1; + return reiser4_alloc_blocks(hint, start, &one, flags); +} + +static inline int reiser4_dealloc_block (const reiser4_block_nr * block, block_stage_t stage, reiser4_ba_flags_t flags) +{ + const reiser4_block_nr one = 1; + return reiser4_dealloc_blocks(block, &one, stage, flags); +} + +#define reiser4_grab_space_force(count, flags) \ + reiser4_grab_space(count, flags | BA_FORCE) + +extern void grabbed2free_mark(__u64 mark); +extern int reiser4_grab_reserved(struct super_block *, + __u64, reiser4_ba_flags_t); +extern void reiser4_release_reserved(struct super_block *super); + +/* grabbed -> fake_allocated */ + +/* fake_allocated -> used */ + +/* used -> fake_allocated -> grabbed -> free */ + +extern void flush_reserved2grabbed(txn_atom * atom, __u64 count); + +extern int blocknr_is_fake(const reiser4_block_nr * da); + +extern void grabbed2cluster_reserved(int count); +extern void cluster_reserved2grabbed(int count); +extern void cluster_reserved2free(int count); + +extern int check_block_counters(const struct super_block *); + +#if REISER4_DEBUG + +extern void reiser4_check_block(const reiser4_block_nr *, int); + +#else + +# define reiser4_check_block(beg, val) noop + +#endif + +extern int pre_commit_hook(void); +extern void post_commit_hook(void); +extern void post_write_back_hook(void); + +#endif /* __FS_REISER4_BLOCK_ALLOC_H__ */ + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/blocknrset.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/blocknrset.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,365 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* This file contains code for various block number sets used by the atom to + track the deleted set and wandered block mappings. */ + +#include "debug.h" +#include "dformat.h" +#include "type_safe_list.h" +#include "txnmgr.h" + +#include + +/* The proposed data structure for storing unordered block number sets is a + list of elements, each of which contains an array of block number or/and + array of block number pairs. That element called blocknr_set_entry is used + to store block numbers from the beginning and for extents from the end of + the data field (char data[...]). The ->nr_blocks and ->nr_pairs fields + count numbers of blocks and extents. + + +------------------- blocknr_set_entry->data ------------------+ + |block1|block2| ... ... |pair3|pair2|pair1| + +------------------------------------------------------------+ + + When current blocknr_set_entry is full, allocate a new one. */ + +/* Usage examples: blocknr sets are used in reiser4 for storing atom's delete + * set (single blocks and block extents), in that case blocknr pair represent an + * extent; atom's wandered map is also stored as a blocknr set, blocknr pairs + * there represent a (real block) -> (wandered block) mapping. */ + +typedef struct blocknr_pair blocknr_pair; + +/* The total size of a blocknr_set_entry. */ +#define BLOCKNR_SET_ENTRY_SIZE 128 + +/* The number of blocks that can fit the blocknr data area. */ +#define BLOCKNR_SET_ENTRIES_NUMBER \ + ((BLOCKNR_SET_ENTRY_SIZE - \ + 2 * sizeof (unsigned) - \ + sizeof (blocknr_set_list_link)) / \ + sizeof (reiser4_block_nr)) + +/* An entry of the blocknr_set */ +struct blocknr_set_entry { + unsigned nr_singles; + unsigned nr_pairs; + blocknr_set_list_link link; + reiser4_block_nr entries[BLOCKNR_SET_ENTRIES_NUMBER]; +}; + +/* A pair of blocks as recorded in the blocknr_set_entry data. */ +struct blocknr_pair { + reiser4_block_nr a; + reiser4_block_nr b; +}; + +/* The list definition. */ +TYPE_SAFE_LIST_DEFINE(blocknr_set, blocknr_set_entry, link); + +/* Return the number of blocknr slots available in a blocknr_set_entry. */ +/* Audited by: green(2002.06.11) */ +static unsigned +bse_avail(blocknr_set_entry * bse) +{ + unsigned used = bse->nr_singles + 2 * bse->nr_pairs; + + assert("jmacd-5088", BLOCKNR_SET_ENTRIES_NUMBER >= used); + cassert(sizeof (blocknr_set_entry) == BLOCKNR_SET_ENTRY_SIZE); + + return BLOCKNR_SET_ENTRIES_NUMBER - used; +} + +/* Initialize a blocknr_set_entry. */ +/* Audited by: green(2002.06.11) */ +static void +bse_init(blocknr_set_entry * bse) +{ + bse->nr_singles = 0; + bse->nr_pairs = 0; + blocknr_set_list_clean(bse); +} + +/* Allocate and initialize a blocknr_set_entry. */ +/* Audited by: green(2002.06.11) */ +static blocknr_set_entry * +bse_alloc(void) +{ + blocknr_set_entry *e; + + if ((e = (blocknr_set_entry *) kmalloc(sizeof (blocknr_set_entry), GFP_KERNEL)) == NULL) { + return NULL; + } + + bse_init(e); + + return e; +} + +/* Free a blocknr_set_entry. */ +/* Audited by: green(2002.06.11) */ +static void +bse_free(blocknr_set_entry * bse) +{ + kfree(bse); +} + +/* Add a block number to a blocknr_set_entry */ +/* Audited by: green(2002.06.11) */ +static void +bse_put_single(blocknr_set_entry * bse, const reiser4_block_nr * block) +{ + assert("jmacd-5099", bse_avail(bse) >= 1); + + bse->entries[bse->nr_singles++] = *block; +} + +/* Get a pair of block numbers */ +/* Audited by: green(2002.06.11) */ +static inline blocknr_pair * +bse_get_pair(blocknr_set_entry * bse, unsigned pno) +{ + assert("green-1", BLOCKNR_SET_ENTRIES_NUMBER >= 2 * (pno + 1)); + + return (blocknr_pair *) (bse->entries + BLOCKNR_SET_ENTRIES_NUMBER - 2 * (pno + 1)); +} + +/* Add a pair of block numbers to a blocknr_set_entry */ +/* Audited by: green(2002.06.11) */ +static void +bse_put_pair(blocknr_set_entry * bse, const reiser4_block_nr * a, const reiser4_block_nr * b) +{ + blocknr_pair *pair; + + assert("jmacd-5100", bse_avail(bse) >= 2 && a != NULL && b != NULL); + + pair = bse_get_pair(bse, bse->nr_pairs++); + + pair->a = *a; + pair->b = *b; +} + +/* Add either a block or pair of blocks to the block number set. The first + blocknr (@a) must be non-NULL. If @b is NULL a single blocknr is added, if + @b is non-NULL a pair is added. The block number set belongs to atom, and + the call is made with the atom lock held. There may not be enough space in + the current blocknr_set_entry. If new_bsep points to a non-NULL + blocknr_set_entry then it will be added to the blocknr_set and new_bsep + will be set to NULL. If new_bsep contains NULL then the atom lock will be + released and a new bse will be allocated in new_bsep. E_REPEAT will be + returned with the atom unlocked for the operation to be tried again. If + the operation succeeds, 0 is returned. If new_bsep is non-NULL and not + used during the call, it will be freed automatically. */ +/* Audited by: green(2002.06.11) */ +static int +blocknr_set_add(txn_atom * atom, + blocknr_set * bset, + blocknr_set_entry ** new_bsep, const reiser4_block_nr * a, const reiser4_block_nr * b) +{ + blocknr_set_entry *bse; + unsigned entries_needed; + + assert("jmacd-5101", a != NULL); + + entries_needed = (b == NULL) ? 1 : 2; + if (blocknr_set_list_empty(&bset->entries) || bse_avail(blocknr_set_list_front(&bset->entries)) + < entries_needed) { + /* See if a bse was previously allocated. */ + if (*new_bsep == NULL) { + UNLOCK_ATOM(atom); + *new_bsep = bse_alloc(); + return (*new_bsep != NULL) ? -E_REPEAT : RETERR(-ENOMEM); + } + + /* Put it on the head of the list. */ + blocknr_set_list_push_front(&bset->entries, *new_bsep); + + *new_bsep = NULL; + } + + /* Add the single or pair. */ + bse = blocknr_set_list_front(&bset->entries); + if (b == NULL) { + bse_put_single(bse, a); + } else { + bse_put_pair(bse, a, b); + } + + /* If new_bsep is non-NULL then there was an allocation race, free this copy. */ + if (*new_bsep != NULL) { + bse_free(*new_bsep); + *new_bsep = NULL; + } + + return 0; +} + +/* Add an extent to the block set. If the length is 1, it is treated as a + single block (e.g., reiser4_set_add_block). */ +/* Audited by: green(2002.06.11) */ +/* Auditor note: Entire call chain cannot hold any spinlocks, because + kmalloc might schedule. The only exception is atom spinlock, which is + properly freed. */ +reiser4_internal int +blocknr_set_add_extent(txn_atom * atom, + blocknr_set * bset, + blocknr_set_entry ** new_bsep, const reiser4_block_nr * start, const reiser4_block_nr * len) +{ + assert("jmacd-5102", start != NULL && len != NULL && *len > 0); + return blocknr_set_add(atom, bset, new_bsep, start, *len == 1 ? NULL : len); +} + +/* Add a block pair to the block set. It adds exactly a pair, which is checked + * by an assertion that both arguments are not null.*/ +/* Audited by: green(2002.06.11) */ +/* Auditor note: Entire call chain cannot hold any spinlocks, because + kmalloc might schedule. The only exception is atom spinlock, which is + properly freed. */ +reiser4_internal int +blocknr_set_add_pair(txn_atom * atom, + blocknr_set * bset, + blocknr_set_entry ** new_bsep, const reiser4_block_nr * a, const reiser4_block_nr * b) +{ + assert("jmacd-5103", a != NULL && b != NULL); + return blocknr_set_add(atom, bset, new_bsep, a, b); +} + +/* Initialize a blocknr_set. */ +/* Audited by: green(2002.06.11) */ +reiser4_internal void +blocknr_set_init(blocknr_set * bset) +{ + blocknr_set_list_init(&bset->entries); +} + +/* Release the entries of a blocknr_set. */ +/* Audited by: green(2002.06.11) */ +reiser4_internal void +blocknr_set_destroy(blocknr_set * bset) +{ + while (!blocknr_set_list_empty(&bset->entries)) { + bse_free(blocknr_set_list_pop_front(&bset->entries)); + } +} + +/* Merge blocknr_set entries out of @from into @into. */ +/* Audited by: green(2002.06.11) */ +/* Auditor comments: This merge does not know if merged sets contain + blocks pairs (As for wandered sets) or extents, so it cannot really merge + overlapping ranges if there is some. So I believe it may lead to + some blocks being presented several times in one blocknr_set. To help + debugging such problems it might help to check for duplicate entries on + actual processing of this set. Testing this kind of stuff right here is + also complicated by the fact that these sets are not sorted and going + through whole set on each element addition is going to be CPU-heavy task */ +reiser4_internal void +blocknr_set_merge(blocknr_set * from, blocknr_set * into) +{ + blocknr_set_entry *bse_into = NULL; + + /* If @from is empty, no work to perform. */ + if (blocknr_set_list_empty(&from->entries)) { + return; + } + + /* If @into is not empty, try merging partial-entries. */ + if (!blocknr_set_list_empty(&into->entries)) { + + /* Neither set is empty, pop the front to members and try to combine them. */ + blocknr_set_entry *bse_from; + unsigned into_avail; + + bse_into = blocknr_set_list_pop_front(&into->entries); + bse_from = blocknr_set_list_pop_front(&from->entries); + + /* Combine singles. */ + for (into_avail = bse_avail(bse_into); into_avail != 0 && bse_from->nr_singles != 0; into_avail -= 1) { + bse_put_single(bse_into, &bse_from->entries[--bse_from->nr_singles]); + } + + /* Combine pairs. */ + for (; into_avail > 1 && bse_from->nr_pairs != 0; into_avail -= 2) { + blocknr_pair *pair = bse_get_pair(bse_from, --bse_from->nr_pairs); + bse_put_pair(bse_into, &pair->a, &pair->b); + } + + /* If bse_from is empty, delete it now. */ + if (bse_avail(bse_from) == BLOCKNR_SET_ENTRIES_NUMBER) { + bse_free(bse_from); + } else { + /* Otherwise, bse_into is full or nearly full (e.g., + it could have one slot avail and bse_from has one + pair left). Push it back onto the list. bse_from + becomes bse_into, which will be the new partial. */ + blocknr_set_list_push_front(&into->entries, bse_into); + bse_into = bse_from; + } + } + + /* Splice lists together. */ + blocknr_set_list_splice(&into->entries, &from->entries); + + /* Add the partial entry back to the head of the list. */ + if (bse_into != NULL) { + blocknr_set_list_push_front(&into->entries, bse_into); + } +} + +/* Iterate over all blocknr set elements. */ +reiser4_internal int +blocknr_set_iterator(txn_atom * atom, blocknr_set * bset, blocknr_set_actor_f actor, void *data, int delete) +{ + + blocknr_set_entry *entry; + + assert("zam-429", atom != NULL); + assert("zam-430", atom_is_protected(atom)); + assert("zam-431", bset != 0); + assert("zam-432", actor != NULL); + + entry = blocknr_set_list_front(&bset->entries); + while (!blocknr_set_list_end(&bset->entries, entry)) { + blocknr_set_entry *tmp = blocknr_set_list_next(entry); + unsigned int i; + int ret; + + for (i = 0; i < entry->nr_singles; i++) { + ret = actor(atom, &entry->entries[i], NULL, data); + + /* We can't break a loop if delete flag is set. */ + if (ret != 0 && !delete) + return ret; + } + + for (i = 0; i < entry->nr_pairs; i++) { + struct blocknr_pair *ab; + + ab = bse_get_pair(entry, i); + + ret = actor(atom, &ab->a, &ab->b, data); + + if (ret != 0 && !delete) + return ret; + } + + if (delete) { + blocknr_set_list_remove(entry); + bse_free(entry); + } + + entry = tmp; + } + + return 0; +} + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/carry.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/carry.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1429 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ +/* Functions to "carry" tree modification(s) upward. */ +/* Tree is modified one level at a time. As we modify a level we accumulate a + set of changes that need to be propagated to the next level. We manage + node locking such that any searches that collide with carrying are + restarted, from the root if necessary. + + Insertion of a new item may result in items being moved among nodes and + this requires the delimiting key to be updated at the least common parent + of the nodes modified to preserve search tree invariants. Also, insertion + may require allocation of a new node. A pointer to the new node has to be + inserted into some node on the parent level, etc. + + Tree carrying is meant to be analogous to arithmetic carrying. + + A carry operation is always associated with some node (&carry_node). + + Carry process starts with some initial set of operations to be performed + and an initial set of already locked nodes. Operations are performed one + by one. Performing each single operation has following possible effects: + + - content of carry node associated with operation is modified + - new carry nodes are locked and involved into carry process on this level + - new carry operations are posted to the next level + + After all carry operations on this level are done, process is repeated for + the accumulated sequence on carry operations for the next level. This + starts by trying to lock (in left to right order) all carry nodes + associated with carry operations on the parent level. After this, we decide + whether more nodes are required on the left of already locked set. If so, + all locks taken on the parent level are released, new carry nodes are + added, and locking process repeats. + + It may happen that balancing process fails owing to unrecoverable error on + some of upper levels of a tree (possible causes are io error, failure to + allocate new node, etc.). In this case we should unmount the filesystem, + rebooting if it is the root, and possibly advise the use of fsck. + + USAGE: + + + int some_tree_operation( znode *node, ... ) + { + // Allocate on a stack pool of carry objects: operations and nodes. + // Most carry processes will only take objects from here, without + // dynamic allocation. + +I feel uneasy about this pool. It adds to code complexity, I understand why it exists, but.... -Hans + + carry_pool pool; + carry_level lowest_level; + carry_op *op; + + init_carry_pool( &pool ); + init_carry_level( &lowest_level, &pool ); + + // operation may be one of: + // COP_INSERT --- insert new item into node + // COP_CUT --- remove part of or whole node + // COP_PASTE --- increase size of item + // COP_DELETE --- delete pointer from parent node + // COP_UPDATE --- update delimiting key in least + // common ancestor of two + + op = post_carry( &lowest_level, operation, node, 0 ); + if( IS_ERR( op ) || ( op == NULL ) ) { + handle error + } else { + // fill in remaining fields in @op, according to carry.h:carry_op + result = carry( &lowest_level, NULL ); + } + done_carry_pool( &pool ); + } + + When you are implementing node plugin method that participates in carry + (shifting, insertion, deletion, etc.), do the following: + + int foo_node_method( znode *node, ..., carry_level *todo ) + { + carry_op *op; + + .... + + // note, that last argument to post_carry() is non-null + // here, because @op is to be applied to the parent of @node, rather + // than to the @node itself as in the previous case. + + op = node_post_carry( todo, operation, node, 1 ); + // fill in remaining fields in @op, according to carry.h:carry_op + + .... + + } + + BATCHING: + + One of the main advantages of level-by-level balancing implemented here is + ability to batch updates on a parent level and to peform them more + efficiently as a result. + + Description To Be Done (TBD). + + DIFFICULTIES AND SUBTLE POINTS: + + 1. complex plumbing is required, because: + + a. effective allocation through pools is needed + + b. target of operation is not exactly known when operation is + posted. This is worked around through bitfields in &carry_node and + logic in lock_carry_node() + + c. of interaction with locking code: node should be added into sibling + list when pointer to it is inserted into its parent, which is some time + after node was created. Between these moments, node is somewhat in + suspended state and is only registered in the carry lists + + 2. whole balancing logic is implemented here, in particular, insertion + logic is coded in make_space(). + + 3. special cases like insertion (add_tree_root()) or deletion + (kill_tree_root()) of tree root and morphing of paste into insert + (insert_paste()) have to be handled. + + 4. there is non-trivial interdependency between allocation of new nodes + and almost everything else. This is mainly due to the (1.c) above. I shall + write about this later. + +*/ + +#include "forward.h" +#include "debug.h" +#include "key.h" +#include "coord.h" +#include "plugin/item/item.h" +#include "plugin/item/extent.h" +#include "plugin/node/node.h" +#include "jnode.h" +#include "znode.h" +#include "tree_mod.h" +#include "tree_walk.h" +#include "block_alloc.h" +#include "pool.h" +#include "tree.h" +#include "carry.h" +#include "carry_ops.h" +#include "super.h" +#include "reiser4.h" + +#include + +/* level locking/unlocking */ +static int lock_carry_level(carry_level * level); +static void unlock_carry_level(carry_level * level, int failure); +static void done_carry_level(carry_level * level); +static void unlock_carry_node(carry_level * level, carry_node * node, int fail); + +int lock_carry_node(carry_level * level, carry_node * node); +int lock_carry_node_tail(carry_node * node); + +/* carry processing proper */ +static int carry_on_level(carry_level * doing, carry_level * todo); + +static carry_op *add_op(carry_level * level, pool_ordering order, carry_op * reference); + +/* handlers for carry operations. */ + +static void fatal_carry_error(carry_level * doing, int ecode); +static int add_new_root(carry_level * level, carry_node * node, znode * fake); + +static int carry_estimate_reserve(carry_level * level); + +static void print_level(const char *prefix, carry_level * level); + +#if REISER4_DEBUG +typedef enum { + CARRY_TODO, + CARRY_DOING +} carry_queue_state; +static int carry_level_invariant(carry_level * level, carry_queue_state state); +#endif + +static int +perthread_pages_reserve(int nrpages, int gfp) +{ + return 0; +} + +static void +perthread_pages_release(int nrpages) +{ +} + +static int +perthread_pages_count(void) +{ + return 0; +} + +/* main entry point for tree balancing. + + Tree carry performs operations from @doing and while doing so accumulates + information about operations to be performed on the next level ("carried" + to the parent level). Carried operations are performed, causing possibly + more operations to be carried upward etc. carry() takes care about + locking and pinning znodes while operating on them. + + For usage, see comment at the top of fs/reiser4/carry.c + +*/ +reiser4_internal int +carry(carry_level * doing /* set of carry operations to be performed */ , + carry_level * done /* set of nodes, already performed at the + * previous level. NULL in most cases */ ) +{ + int result = 0; + carry_level done_area; + carry_level todo_area; + /* queue of new requests */ + carry_level *todo; + int wasreserved; + int reserve; + ON_DEBUG(STORE_COUNTERS;) + + assert("nikita-888", doing != NULL); + + todo = &todo_area; + init_carry_level(todo, doing->pool); + if (done == NULL) { + /* queue of requests performed on the previous level */ + done = &done_area; + init_carry_level(done, doing->pool); + } + + wasreserved = perthread_pages_count(); + reserve = carry_estimate_reserve(doing); + result = perthread_pages_reserve(reserve, GFP_KERNEL); + if (result != 0) + return result; + + /* iterate until there is nothing more to do */ + while (result == 0 && doing->ops_num > 0) { + carry_level *tmp; + + /* at this point @done is locked. */ + /* repeat lock/do/unlock while + + (1) lock_carry_level() fails due to deadlock avoidance, or + + (2) carry_on_level() decides that more nodes have to + be involved. + + (3) some unexpected error occurred while balancing on the + upper levels. In this case all changes are rolled back. + + */ + while (1) { + result = lock_carry_level(doing); + if (result == 0) { + /* perform operations from @doing and + accumulate new requests in @todo */ + result = carry_on_level(doing, todo); + if (result == 0) + break; + else if (result != -E_REPEAT || + !doing->restartable) { + warning("nikita-1043", + "Fatal error during carry: %i", + result); + print_level("done", done); + print_level("doing", doing); + print_level("todo", todo); + /* do some rough stuff like aborting + all pending transcrashes and thus + pushing tree back to the consistent + state. Alternatvely, just panic. + */ + fatal_carry_error(doing, result); + return result; + } + } else if (result != -E_REPEAT) { + fatal_carry_error(doing, result); + return result; + } + unlock_carry_level(doing, 1); + } + /* at this point @done can be safely unlocked */ + done_carry_level(done); + + /* cyclically shift queues */ + tmp = done; + done = doing; + doing = todo; + todo = tmp; + init_carry_level(todo, doing->pool); + + /* give other threads chance to run */ + preempt_point(); + } + done_carry_level(done); + + assert("nikita-3460", perthread_pages_count() - wasreserved >= 0); + perthread_pages_release(perthread_pages_count() - wasreserved); + + /* all counters, but x_refs should remain the same. x_refs can change + owing to transaction manager */ + ON_DEBUG(CHECK_COUNTERS;) + return result; +} + +/* perform carry operations on given level. + + Optimizations proposed by pooh: + + (1) don't lock all nodes from queue at the same time. Lock nodes lazily as + required; + + (2) unlock node if there are no more operations to be performed upon it and + node didn't add any operation to @todo. This can be implemented by + attaching to each node two counters: counter of operaions working on this + node and counter and operations carried upward from this node. + +*/ +static int +carry_on_level(carry_level * doing /* queue of carry operations to + * do on this level */ , + carry_level * todo /* queue where new carry + * operations to be performed on + * the * parent level are + * accumulated during @doing + * processing. */ ) +{ + int result; + int (*f) (carry_op *, carry_level *, carry_level *); + carry_op *op; + carry_op *tmp_op; + + assert("nikita-1034", doing != NULL); + assert("nikita-1035", todo != NULL); + + /* @doing->nodes are locked. */ + + /* This function can be split into two phases: analysis and modification. + + Analysis calculates precisely what items should be moved between + nodes. This information is gathered in some structures attached to + each carry_node in a @doing queue. Analysis also determines whether + new nodes are to be allocated etc. + + After analysis is completed, actual modification is performed. Here + we can take advantage of "batch modification": if there are several + operations acting on the same node, modifications can be performed + more efficiently when batched together. + + Above is an optimization left for the future. + */ + /* Important, but delayed optimization: it's possible to batch + operations together and perform them more efficiently as a + result. For example, deletion of several neighboring items from a + node can be converted to a single ->cut() operation. + + Before processing queue, it should be scanned and "mergeable" + operations merged. + */ + result = 0; + for_all_ops(doing, op, tmp_op) { + carry_opcode opcode; + + assert("nikita-1041", op != NULL); + opcode = op->op; + assert("nikita-1042", op->op < COP_LAST_OP); + f = op_dispatch_table[op->op].handler; + result = f(op, doing, todo); + /* locking can fail with -E_REPEAT. Any different error is fatal + and will be handled by fatal_carry_error() sledgehammer. + */ + if (result != 0) + break; + } + if (result == 0) { + carry_plugin_info info; + carry_node *scan; + carry_node *tmp_scan; + + info.doing = doing; + info.todo = todo; + + assert("nikita-3002", carry_level_invariant(doing, CARRY_DOING)); + for_all_nodes(doing, scan, tmp_scan) { + znode *node; + + node = carry_real(scan); + assert("nikita-2547", node != NULL); + if (node_is_empty(node)) { + result = node_plugin_by_node(node)->prepare_removal(node, &info); + if (result != 0) + break; + } + } + } + return result; +} + +/* post carry operation + + This is main function used by external carry clients: node layout plugins + and tree operations to create new carry operation to be performed on some + level. + + New operation will be included in the @level queue. To actually perform it, + call carry( level, ... ). This function takes write lock on @node. Carry + manages all its locks by itself, don't worry about this. + + This function adds operation and node at the end of the queue. It is up to + caller to guarantee proper ordering of node queue. + +*/ +reiser4_internal carry_op * +post_carry(carry_level * level /* queue where new operation is to + * be posted at */ , + carry_opcode op /* opcode of operation */ , + znode * node /* node on which this operation + * will operate */ , + int apply_to_parent_p /* whether operation will operate + * directly on @node or on it + * parent. */ ) +{ + carry_op *result; + carry_node *child; + + assert("nikita-1046", level != NULL); + assert("nikita-1788", znode_is_write_locked(node)); + + result = add_op(level, POOLO_LAST, NULL); + if (IS_ERR(result)) + return result; + child = add_carry(level, POOLO_LAST, NULL); + if (IS_ERR(child)) { + reiser4_pool_free(&level->pool->op_pool, &result->header); + return (carry_op *) child; + } + result->node = child; + result->op = op; + child->parent = apply_to_parent_p; + if (ZF_ISSET(node, JNODE_ORPHAN)) + child->left_before = 1; + child->node = node; + return result; +} + +/* initialise carry queue */ +reiser4_internal void +init_carry_level(carry_level * level /* level to initialise */ , + carry_pool * pool /* pool @level will allocate objects + * from */ ) +{ + assert("nikita-1045", level != NULL); + assert("nikita-967", pool != NULL); + + memset(level, 0, sizeof *level); + level->pool = pool; + + pool_level_list_init(&level->nodes); + pool_level_list_init(&level->ops); +} + +/* allocate carry pool and initialise pools within queue */ +reiser4_internal carry_pool * +init_carry_pool(void) +{ + carry_pool * pool; + + pool = kmalloc(sizeof(carry_pool), GFP_KERNEL); + if (pool == NULL) + return ERR_PTR(RETERR(-ENOMEM)); + + reiser4_init_pool(&pool->op_pool, sizeof (carry_op), CARRIES_POOL_SIZE, (char *) pool->op); + reiser4_init_pool(&pool->node_pool, sizeof (carry_node), NODES_LOCKED_POOL_SIZE, (char *) pool->node); + return pool; +} + +/* finish with queue pools */ +reiser4_internal void +done_carry_pool(carry_pool * pool /* pool to destroy */ ) +{ + reiser4_done_pool(&pool->op_pool); + reiser4_done_pool(&pool->node_pool); + kfree(pool); +} + +/* add new carry node to the @level. + + Returns pointer to the new carry node allocated from pool. It's up to + callers to maintain proper order in the @level. Assumption is that if carry + nodes on one level are already sorted and modifications are peroformed from + left to right, carry nodes added on the parent level will be ordered + automatically. To control ordering use @order and @reference parameters. + +*/ +reiser4_internal carry_node * +add_carry_skip(carry_level * level /* &carry_level to add node + * to */ , + pool_ordering order /* where to insert: at the + * beginning of @level, + * before @reference, after + * @reference, at the end + * of @level */ , + carry_node * reference /* reference node for + * insertion */ ) +{ + ON_DEBUG(carry_node * orig_ref = reference); + + if (order == POOLO_BEFORE) { + reference = find_left_carry(reference, level); + if (reference == NULL) + reference = carry_node_front(level); + else + reference = carry_node_next(reference); + } else if (order == POOLO_AFTER) { + reference = find_right_carry(reference, level); + if (reference == NULL) + reference = carry_node_back(level); + else + reference = carry_node_prev(reference); + } + assert("nikita-2209", + ergo(orig_ref != NULL, + carry_real(reference) == carry_real(orig_ref))); + return add_carry(level, order, reference); +} + +reiser4_internal carry_node * +add_carry(carry_level * level /* &carry_level to add node + * to */ , + pool_ordering order /* where to insert: at the + * beginning of @level, before + * @reference, after @reference, + * at the end of @level */ , + carry_node * reference /* reference node for + * insertion */ ) +{ + carry_node *result; + + result = (carry_node *) add_obj(&level->pool->node_pool, &level->nodes, order, &reference->header); + if (!IS_ERR(result) && (result != NULL)) + ++level->nodes_num; + return result; +} + +/* add new carry operation to the @level. + + Returns pointer to the new carry operations allocated from pool. It's up to + callers to maintain proper order in the @level. To control ordering use + @order and @reference parameters. + +*/ +static carry_op * +add_op(carry_level * level /* &carry_level to add node to */ , + pool_ordering order /* where to insert: at the beginning of + * @level, before @reference, after + * @reference, at the end of @level */ , + carry_op * reference /* reference node for insertion */ ) +{ + carry_op *result; + + result = (carry_op *) add_obj(&level->pool->op_pool, &level->ops, order, &reference->header); + if (!IS_ERR(result) && (result != NULL)) + ++level->ops_num; + return result; +} + +/* Return node on the right of which @node was created. + + Each node is created on the right of some existing node (or it is new root, + which is special case not handled here). + + @node is new node created on some level, but not yet inserted into its + parent, it has corresponding bit (JNODE_ORPHAN) set in zstate. + +*/ +static carry_node * +find_begetting_brother(carry_node * node /* node to start search + * from */ , + carry_level * kin UNUSED_ARG /* level to + * scan */ ) +{ + carry_node *scan; + + assert("nikita-1614", node != NULL); + assert("nikita-1615", kin != NULL); + assert("nikita-1616", LOCK_CNT_GTZ(rw_locked_tree)); + assert("nikita-1619", ergo(carry_real(node) != NULL, + ZF_ISSET(carry_real(node), JNODE_ORPHAN))); + + for (scan = node;; scan = carry_node_prev(scan)) { + assert("nikita-1617", !carry_node_end(kin, scan)); + if ((scan->node != node->node) && !ZF_ISSET(scan->node, JNODE_ORPHAN)) { + assert("nikita-1618", carry_real(scan) != NULL); + break; + } + } + return scan; +} + +static cmp_t +carry_node_cmp(carry_level * level, carry_node * n1, carry_node * n2) +{ + assert("nikita-2199", n1 != NULL); + assert("nikita-2200", n2 != NULL); + + if (n1 == n2) + return EQUAL_TO; + while (1) { + n1 = carry_node_next(n1); + if (carry_node_end(level, n1)) + return GREATER_THAN; + if (n1 == n2) + return LESS_THAN; + } + impossible("nikita-2201", "End of level reached"); +} + +reiser4_internal carry_node * +find_carry_node(carry_level * level, const znode * node) +{ + carry_node *scan; + carry_node *tmp_scan; + + assert("nikita-2202", level != NULL); + assert("nikita-2203", node != NULL); + + for_all_nodes(level, scan, tmp_scan) { + if (carry_real(scan) == node) + return scan; + } + return NULL; +} + +reiser4_internal znode * +carry_real(const carry_node * node) +{ + assert("nikita-3061", node != NULL); + + return node->lock_handle.node; +} + +reiser4_internal carry_node * +insert_carry_node(carry_level * doing, carry_level * todo, const znode * node) +{ + carry_node *base; + carry_node *scan; + carry_node *tmp_scan; + carry_node *proj; + + base = find_carry_node(doing, node); + assert("nikita-2204", base != NULL); + + for_all_nodes(todo, scan, tmp_scan) { + proj = find_carry_node(doing, scan->node); + assert("nikita-2205", proj != NULL); + if (carry_node_cmp(doing, proj, base) != LESS_THAN) + break; + } + return scan; +} + +static carry_node * +add_carry_atplace(carry_level *doing, carry_level *todo, znode *node) +{ + carry_node *reference; + + assert("nikita-2994", doing != NULL); + assert("nikita-2995", todo != NULL); + assert("nikita-2996", node != NULL); + + reference = insert_carry_node(doing, todo, node); + assert("nikita-2997", reference != NULL); + + return add_carry(todo, POOLO_BEFORE, reference); +} + +/* like post_carry(), but designed to be called from node plugin methods. + This function is different from post_carry() in that it finds proper place + to insert node in the queue. */ +reiser4_internal carry_op * +node_post_carry(carry_plugin_info * info /* carry parameters + * passed down to node + * plugin */ , + carry_opcode op /* opcode of operation */ , + znode * node /* node on which this + * operation will operate */ , + int apply_to_parent_p /* whether operation will + * operate directly on @node + * or on it parent. */ ) +{ + carry_op *result; + carry_node *child; + + assert("nikita-2207", info != NULL); + assert("nikita-2208", info->todo != NULL); + + if (info->doing == NULL) + return post_carry(info->todo, op, node, apply_to_parent_p); + + result = add_op(info->todo, POOLO_LAST, NULL); + if (IS_ERR(result)) + return result; + child = add_carry_atplace(info->doing, info->todo, node); + if (IS_ERR(child)) { + reiser4_pool_free(&info->todo->pool->op_pool, &result->header); + return (carry_op *) child; + } + result->node = child; + result->op = op; + child->parent = apply_to_parent_p; + if (ZF_ISSET(node, JNODE_ORPHAN)) + child->left_before = 1; + child->node = node; + return result; +} + +/* lock all carry nodes in @level */ +static int +lock_carry_level(carry_level * level /* level to lock */ ) +{ + int result; + carry_node *node; + carry_node *tmp_node; + + assert("nikita-881", level != NULL); + assert("nikita-2229", carry_level_invariant(level, CARRY_TODO)); + + /* lock nodes from left to right */ + result = 0; + for_all_nodes(level, node, tmp_node) { + result = lock_carry_node(level, node); + if (result != 0) + break; + } + return result; +} + +/* Synchronize delimiting keys between @node and its left neighbor. + + To reduce contention on dk key and simplify carry code, we synchronize + delimiting keys only when carry ultimately leaves tree level (carrying + changes upward) and unlocks nodes at this level. + + This function first finds left neighbor of @node and then updates left + neighbor's right delimiting key to conincide with least key in @node. + +*/ + +ON_DEBUG(extern atomic_t delim_key_version;) + +static void +sync_dkeys(znode *spot /* node to update */) +{ + reiser4_key pivot; + reiser4_tree *tree; + + assert("nikita-1610", spot != NULL); + assert("nikita-1612", LOCK_CNT_NIL(rw_locked_dk)); + + tree = znode_get_tree(spot); + RLOCK_TREE(tree); + WLOCK_DK(tree); + + assert("nikita-2192", znode_is_loaded(spot)); + + /* sync left delimiting key of @spot with key in its leftmost item */ + if (node_is_empty(spot)) + pivot = *znode_get_rd_key(spot); + else + leftmost_key_in_node(spot, &pivot); + + znode_set_ld_key(spot, &pivot); + + /* there can be sequence of empty nodes pending removal on the left of + @spot. Scan them and update their left and right delimiting keys to + match left delimiting key of @spot. Also, update right delimiting + key of first non-empty left neighbor. + */ + while (1) { + if (!ZF_ISSET(spot, JNODE_LEFT_CONNECTED)) + break; + + spot = spot->left; + if (spot == NULL) + break; + +#if 0 + /* on the leaf level we can only increase right delimiting key + * of a node on which we don't hold a long term lock. */ + assert("nikita-2930", + ergo(!znode_is_write_locked(spot) && + znode_get_level(spot) == LEAF_LEVEL, + keyge(&pivot, znode_get_rd_key(spot)))); +#endif + + znode_set_rd_key(spot, &pivot); + /* don't sink into the domain of another balancing */ + if (!znode_is_write_locked(spot)) + break; + if (ZF_ISSET(spot, JNODE_HEARD_BANSHEE)) + znode_set_ld_key(spot, &pivot); + else + break; + } + + WUNLOCK_DK(tree); + RUNLOCK_TREE(tree); +} + +ON_DEBUG(void check_dkeys(const znode *);) + +/* unlock all carry nodes in @level */ +static void +unlock_carry_level(carry_level * level /* level to unlock */ , + int failure /* true if unlocking owing to + * failure */ ) +{ + carry_node *node; + carry_node *tmp_node; + + assert("nikita-889", level != NULL); + + if (!failure) { + znode *spot; + + spot = NULL; + /* update delimiting keys */ + for_all_nodes(level, node, tmp_node) { + if (carry_real(node) != spot) { + spot = carry_real(node); + sync_dkeys(spot); + } + } + } + + /* nodes can be unlocked in arbitrary order. In preemptible + environment it's better to unlock in reverse order of locking, + though. + */ + for_all_nodes_back(level, node, tmp_node) { + /* all allocated nodes should be already linked to their + parents at this moment. */ + assert("nikita-1631", ergo(!failure, !ZF_ISSET(carry_real(node), + JNODE_ORPHAN))); + ON_DEBUG(check_dkeys(carry_real(node))); + unlock_carry_node(level, node, failure); + } + level->new_root = NULL; +} + +/* finish with @level + + Unlock nodes and release all allocated resources */ +static void +done_carry_level(carry_level * level /* level to finish */ ) +{ + carry_node *node; + carry_node *tmp_node; + carry_op *op; + carry_op *tmp_op; + + assert("nikita-1076", level != NULL); + + unlock_carry_level(level, 0); + for_all_nodes(level, node, tmp_node) { + assert("nikita-2113", locks_list_is_clean(&node->lock_handle)); + assert("nikita-2114", owners_list_is_clean(&node->lock_handle)); + reiser4_pool_free(&level->pool->node_pool, &node->header); + } + for_all_ops(level, op, tmp_op) + reiser4_pool_free(&level->pool->op_pool, &op->header); +} + +/* helper function to complete locking of carry node + + Finish locking of carry node. There are several ways in which new carry + node can be added into carry level and locked. Normal is through + lock_carry_node(), but also from find_{left|right}_neighbor(). This + function factors out common final part of all locking scenarios. It + supposes that @node -> lock_handle is lock handle for lock just taken and + fills ->real_node from this lock handle. + +*/ +reiser4_internal int +lock_carry_node_tail(carry_node * node /* node to complete locking of */ ) +{ + assert("nikita-1052", node != NULL); + assert("nikita-1187", carry_real(node) != NULL); + assert("nikita-1188", !node->unlock); + + node->unlock = 1; + /* Load node content into memory and install node plugin by + looking at the node header. + + Most of the time this call is cheap because the node is + already in memory. + + Corresponding zrelse() is in unlock_carry_node() + */ + return zload(carry_real(node)); +} + +/* lock carry node + + "Resolve" node to real znode, lock it and mark as locked. + This requires recursive locking of znodes. + + When operation is posted to the parent level, node it will be applied to is + not yet known. For example, when shifting data between two nodes, + delimiting has to be updated in parent or parents of nodes involved. But + their parents is not yet locked and, moreover said nodes can be reparented + by concurrent balancing. + + To work around this, carry operation is applied to special "carry node" + rather than to the znode itself. Carry node consists of some "base" or + "reference" znode and flags indicating how to get to the target of carry + operation (->real_node field of carry_node) from base. + +*/ +reiser4_internal int +lock_carry_node(carry_level * level /* level @node is in */ , + carry_node * node /* node to lock */ ) +{ + int result; + znode *reference_point; + lock_handle lh; + lock_handle tmp_lh; + + assert("nikita-887", level != NULL); + assert("nikita-882", node != NULL); + + result = 0; + reference_point = node->node; + init_lh(&lh); + init_lh(&tmp_lh); + if (node->left_before) { + /* handling of new nodes, allocated on the previous level: + + some carry ops were propably posted from the new node, but + this node neither has parent pointer set, nor is + connected. This will be done in ->create_hook() for + internal item. + + No then less, parent of new node has to be locked. To do + this, first go to the "left" in the carry order. This + depends on the decision to always allocate new node on the + right of existing one. + + Loop handles case when multiple nodes, all orphans, were + inserted. + + Strictly speaking, taking tree lock is not necessary here, + because all nodes scanned by loop in + find_begetting_brother() are write-locked by this thread, + and thus, their sibling linkage cannot change. + + */ + reference_point = UNDER_RW + (tree, znode_get_tree(reference_point), read, + find_begetting_brother(node, level)->node); + assert("nikita-1186", reference_point != NULL); + } + if (node->parent && (result == 0)) { + result = reiser4_get_parent(&tmp_lh, reference_point, ZNODE_WRITE_LOCK, 0); + if (result != 0) { + ; /* nothing */ + } else if (znode_get_level(tmp_lh.node) == 0) { + assert("nikita-1347", znode_above_root(tmp_lh.node)); + result = add_new_root(level, node, tmp_lh.node); + if (result == 0) { + reference_point = level->new_root; + move_lh(&lh, &node->lock_handle); + } + } else if ((level->new_root != NULL) && (level->new_root != znode_parent_nolock(reference_point))) { + /* parent of node exists, but this level aready + created different new root, so */ + warning("nikita-1109", + /* it should be "radicis", but tradition is + tradition. do banshees read latin? */ + "hodie natus est radici frater"); + result = -EIO; + } else { + move_lh(&lh, &tmp_lh); + reference_point = lh.node; + } + } + if (node->left && (result == 0)) { + assert("nikita-1183", node->parent); + assert("nikita-883", reference_point != NULL); + result = reiser4_get_left_neighbor( + &tmp_lh, reference_point, ZNODE_WRITE_LOCK, GN_CAN_USE_UPPER_LEVELS); + if (result == 0) { + done_lh(&lh); + move_lh(&lh, &tmp_lh); + reference_point = lh.node; + } + } + if (!node->parent && !node->left && !node->left_before) { + result = longterm_lock_znode(&lh, reference_point, ZNODE_WRITE_LOCK, ZNODE_LOCK_HIPRI); + } + if (result == 0) { + move_lh(&node->lock_handle, &lh); + result = lock_carry_node_tail(node); + } + done_lh(&tmp_lh); + done_lh(&lh); + return result; +} + +/* release a lock on &carry_node. + + Release if necessary lock on @node. This opearion is pair of + lock_carry_node() and is idempotent: you can call it more than once on the + same node. + +*/ +static void +unlock_carry_node(carry_level * level, + carry_node * node /* node to be released */ , + int failure /* 0 if node is unlocked due + * to some error */ ) +{ + znode *real_node; + + assert("nikita-884", node != NULL); + + real_node = carry_real(node); + /* pair to zload() in lock_carry_node_tail() */ + zrelse(real_node); + if (node->unlock && (real_node != NULL)) { + assert("nikita-899", real_node == node->lock_handle.node); + longterm_unlock_znode(&node->lock_handle); + } + if (failure) { + if (node->deallocate && (real_node != NULL)) { + /* free node in bitmap + + Prepare node for removal. Last zput() will finish + with it. + */ + ZF_SET(real_node, JNODE_HEARD_BANSHEE); + } + if (node->free) { + assert("nikita-2177", locks_list_is_clean(&node->lock_handle)); + assert("nikita-2112", owners_list_is_clean(&node->lock_handle)); + reiser4_pool_free(&level->pool->node_pool, &node->header); + } + } +} + +/* fatal_carry_error() - all-catching error handling function + + It is possible that carry faces unrecoverable error, like unability to + insert pointer at the internal level. Our simple solution is just panic in + this situation. More sophisticated things like attempt to remount + file-system as read-only can be implemented without much difficlties. + + It is believed, that: + + 1. in stead of panicking, all current transactions can be aborted rolling + system back to the consistent state. + +Umm, if you simply panic without doing anything more at all, then all current +transactions are aborted and the system is rolled back to a consistent state, +by virtue of the design of the transactional mechanism. Well, wait, let's be +precise. If an internal node is corrupted on disk due to hardware failure, +then there may be no consistent state that can be rolled back to, so instead +we should say that it will rollback the transactions, which barring other +factors means rolling back to a consistent state. + +# Nikita: there is a subtle difference between panic and aborting +# transactions: machine doesn't reboot. Processes aren't killed. Processes +# don't using reiser4 (not that we care about such processes), or using other +# reiser4 mounts (about them we do care) will simply continue to run. With +# some luck, even application using aborted file system can survive: it will +# get some error, like EBADF, from each file descriptor on failed file system, +# but applications that do care about tolerance will cope with this (squid +# will). + +It would be a nice feature though to support rollback without rebooting +followed by remount, but this can wait for later versions. + + + 2. once isolated transactions will be implemented it will be possible to + roll back offending transaction. + +2. is additional code complexity of inconsistent value (it implies that a broken tree should be kept in operation), so we must think about +it more before deciding if it should be done. -Hans + +*/ +static void +fatal_carry_error(carry_level * doing UNUSED_ARG /* carry level + * where + * unrecoverable + * error + * occurred */ , + int ecode /* error code */ ) +{ + assert("nikita-1230", doing != NULL); + assert("nikita-1231", ecode < 0); + + reiser4_panic("nikita-1232", "Carry failed: %i", ecode); +} + +/* add new root to the tree + + This function itself only manages changes in carry structures and delegates + all hard work (allocation of znode for new root, changes of parent and + sibling pointers to the add_tree_root(). + + Locking: old tree root is locked by carry at this point. Fake znode is also + locked. + +*/ +static int +add_new_root(carry_level * level /* carry level in context of which + * operation is performed */ , + carry_node * node /* carry node for existing root */ , + znode * fake /* "fake" znode already locked by + * us */ ) +{ + int result; + + assert("nikita-1104", level != NULL); + assert("nikita-1105", node != NULL); + + assert("nikita-1403", znode_is_write_locked(node->node)); + assert("nikita-1404", znode_is_write_locked(fake)); + + /* trying to create new root. */ + /* @node is root and it's already locked by us. This + means that nobody else can be trying to add/remove + tree root right now. + */ + if (level->new_root == NULL) + level->new_root = add_tree_root(node->node, fake); + if (!IS_ERR(level->new_root)) { + assert("nikita-1210", znode_is_root(level->new_root)); + node->deallocate = 1; + result = longterm_lock_znode(&node->lock_handle, level->new_root, ZNODE_WRITE_LOCK, ZNODE_LOCK_LOPRI); + if (result == 0) + zput(level->new_root); + } else { + result = PTR_ERR(level->new_root); + level->new_root = NULL; + } + return result; +} + +/* allocate new znode and add the operation that inserts the + pointer to it into the parent node into the todo level + + Allocate new znode, add it into carry queue and post into @todo queue + request to add pointer to new node into its parent. + + This is carry related routing that calls new_node() to allocate new + node. +*/ +reiser4_internal carry_node * +add_new_znode(znode * brother /* existing left neighbor of new + * node */ , + carry_node * ref /* carry node after which new + * carry node is to be inserted + * into queue. This affects + * locking. */ , + carry_level * doing /* carry queue where new node is + * to be added */ , + carry_level * todo /* carry queue where COP_INSERT + * operation to add pointer to + * new node will ne added */ ) +{ + carry_node *fresh; + znode *new_znode; + carry_op *add_pointer; + carry_plugin_info info; + + assert("nikita-1048", brother != NULL); + assert("nikita-1049", todo != NULL); + + /* There is a lot of possible variations here: to what parent + new node will be attached and where. For simplicity, always + do the following: + + (1) new node and @brother will have the same parent. + + (2) new node is added on the right of @brother + + */ + + fresh = add_carry_skip(doing, ref ? POOLO_AFTER : POOLO_LAST, ref); + if (IS_ERR(fresh)) + return fresh; + + fresh->deallocate = 1; + fresh->free = 1; + + new_znode = new_node(brother, znode_get_level(brother)); + if (IS_ERR(new_znode)) + /* @fresh will be deallocated automatically by error + handling code in the caller. */ + return (carry_node *) new_znode; + + /* new_znode returned znode with x_count 1. Caller has to decrease + it. make_space() does. */ + + ZF_SET(new_znode, JNODE_ORPHAN); + fresh->node = new_znode; + + while (ZF_ISSET(carry_real(ref), JNODE_ORPHAN)) { + ref = carry_node_prev(ref); + assert("nikita-1606", !carry_node_end(doing, ref)); + } + + info.todo = todo; + info.doing = doing; + add_pointer = node_post_carry(&info, COP_INSERT, carry_real(ref), 1); + if (IS_ERR(add_pointer)) { + /* no need to deallocate @new_znode here: it will be + deallocated during carry error handling. */ + return (carry_node *) add_pointer; + } + + add_pointer->u.insert.type = COPT_CHILD; + add_pointer->u.insert.child = fresh; + add_pointer->u.insert.brother = brother; + /* initially new node spawns empty key range */ + WLOCK_DK(znode_get_tree(brother)); + znode_set_ld_key(new_znode, + znode_set_rd_key(new_znode, znode_get_rd_key(brother))); + WUNLOCK_DK(znode_get_tree(brother)); + return fresh; +} + +/* + * Estimate how many pages of memory have to be reserved to complete execution + * of @level. + */ +static int carry_estimate_reserve(carry_level * level) +{ + carry_op *op; + carry_op *tmp_op; + int result; + + result = 0; + for_all_ops(level, op, tmp_op) + result += op_dispatch_table[op->op].estimate(op, level); + return result; +} + +/* DEBUGGING FUNCTIONS. + + Probably we also should leave them on even when + debugging is turned off to print dumps at errors. +*/ +#if REISER4_DEBUG +static int +carry_level_invariant(carry_level * level, carry_queue_state state) +{ + carry_node *node; + carry_node *tmp_node; + + if (level == NULL) + return 0; + + if (level->track_type != 0 && + level->track_type != CARRY_TRACK_NODE && + level->track_type != CARRY_TRACK_CHANGE) + return 0; + + /* check that nodes are in ascending order */ + for_all_nodes(level, node, tmp_node) { + znode *left; + znode *right; + + reiser4_key lkey; + reiser4_key rkey; + + if (node != carry_node_front(level)) { + if (state == CARRY_TODO) { + right = node->node; + left = carry_node_prev(node)->node; + } else { + right = carry_real(node); + left = carry_real(carry_node_prev(node)); + } + if (right == NULL || left == NULL) + continue; + if (node_is_empty(right) || node_is_empty(left)) + continue; + if (!keyle(leftmost_key_in_node(left, &lkey), + leftmost_key_in_node(right, &rkey))) { + print_znode("left", left); + print_znode("right", right); + return 0; + } + } + } + return 1; +} +#endif + +/* get symbolic name for boolean */ +static const char * +tf(int boolean /* truth value */ ) +{ + return boolean ? "t" : "f"; +} + +/* symbolic name for carry operation */ +static const char * +carry_op_name(carry_opcode op /* carry opcode */ ) +{ + switch (op) { + case COP_INSERT: + return "COP_INSERT"; + case COP_DELETE: + return "COP_DELETE"; + case COP_CUT: + return "COP_CUT"; + case COP_PASTE: + return "COP_PASTE"; + case COP_UPDATE: + return "COP_UPDATE"; + case COP_EXTENT: + return "COP_EXTENT"; + case COP_INSERT_FLOW: + return "COP_INSERT_FLOW"; + default:{ + /* not mt safe, but who cares? */ + static char buf[20]; + + sprintf(buf, "unknown op: %x", op); + return buf; + } + } +} + +/* dump information about carry node */ +static void +print_carry(const char *prefix /* prefix to print */ , + carry_node * node /* node to print */ ) +{ + if (node == NULL) { + printk("%s: null\n", prefix); + return; + } + printk("%s: %p parent: %s, left: %s, unlock: %s, free: %s, dealloc: %s\n", + prefix, node, tf(node->parent), tf(node->left), tf(node->unlock), tf(node->free), tf(node->deallocate)); + print_znode("\tnode", node->node); + print_znode("\treal_node", carry_real(node)); +} + +/* dump information about carry operation */ +static void +print_op(const char *prefix /* prefix to print */ , + carry_op * op /* operation to print */ ) +{ + if (op == NULL) { + printk("%s: null\n", prefix); + return; + } + printk("%s: %p carry_opcode: %s\n", prefix, op, carry_op_name(op->op)); + print_carry("\tnode", op->node); + switch (op->op) { + case COP_INSERT: + case COP_PASTE: + print_coord("\tcoord", op->u.insert.d ? op->u.insert.d->coord : NULL, 0); + print_key("\tkey", op->u.insert.d ? op->u.insert.d->key : NULL); + print_carry("\tchild", op->u.insert.child); + break; + case COP_DELETE: + print_carry("\tchild", op->u.delete.child); + break; + case COP_CUT: + if (op->u.cut_or_kill.is_cut) { + print_coord("\tfrom", op->u.cut_or_kill.u.kill->params.from, 0); + print_coord("\tto", op->u.cut_or_kill.u.kill->params.to, 0); + } else { + print_coord("\tfrom", op->u.cut_or_kill.u.cut->params.from, 0); + print_coord("\tto", op->u.cut_or_kill.u.cut->params.to, 0); + } + break; + case COP_UPDATE: + print_carry("\tleft", op->u.update.left); + break; + default: + /* do nothing */ + break; + } +} + +/* dump information about all nodes and operations in a @level */ +static void +print_level(const char *prefix /* prefix to print */ , + carry_level * level /* level to print */ ) +{ + carry_node *node; + carry_node *tmp_node; + carry_op *op; + carry_op *tmp_op; + + if (level == NULL) { + printk("%s: null\n", prefix); + return; + } + printk("%s: %p, restartable: %s\n", + prefix, level, tf(level->restartable)); + + for_all_nodes(level, node, tmp_node) + print_carry("\tcarry node", node); + for_all_ops(level, op, tmp_op) + print_op("\tcarry op", op); +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/carry.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/carry.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,418 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Functions and data types to "carry" tree modification(s) upward. + See fs/reiser4/carry.c for details. */ + +#if !defined( __FS_REISER4_CARRY_H__ ) +#define __FS_REISER4_CARRY_H__ + +#include "forward.h" +#include "debug.h" +#include "pool.h" +#include "znode.h" + +#include + +/* &carry_node - "location" of carry node. + + "location" of node that is involved or going to be involved into + carry process. Node where operation will be carried to on the + parent level cannot be recorded explicitly. Operation will be carried + usually to the parent of some node (where changes are performed at + the current level) or, to the left neighbor of its parent. But while + modifications are performed at the current level, parent may + change. So, we have to allow some indirection (or, positevly, + flexibility) in locating carry nodes. + +*/ +typedef struct carry_node { + /* pool linkage */ + reiser4_pool_header header; + + /* base node from which real_node is calculated. See + fs/reiser4/carry.c:lock_carry_node(). */ + znode *node; + + /* how to get ->real_node */ + /* to get ->real_node obtain parent of ->node*/ + __u32 parent:1; + /* to get ->real_node obtain left neighbor of parent of + ->node*/ + __u32 left:1; + __u32 left_before:1; + + /* locking */ + + /* this node was locked by carry process and should be + unlocked when carry leaves a level */ + __u32 unlock:1; + + /* disk block for this node was allocated by carry process and + should be deallocated when carry leaves a level */ + __u32 deallocate:1; + /* this carry node was allocated by carry process and should be + freed when carry leaves a level */ + __u32 free:1; + + /* type of lock we want to take on this node */ + lock_handle lock_handle; +} carry_node; + +/* &carry_opcode - elementary operations that can be carried upward + + Operations that carry() can handle. This list is supposed to be + expanded. + + Each carry operation (cop) is handled by appropriate function defined + in fs/reiser4/carry.c. For example COP_INSERT is handled by + fs/reiser4/carry.c:carry_insert() etc. These functions in turn + call plugins of nodes affected by operation to modify nodes' content + and to gather operations to be performed on the next level. + +*/ +typedef enum { + /* insert new item into node. */ + COP_INSERT, + /* delete pointer from parent node */ + COP_DELETE, + /* remove part of or whole node. */ + COP_CUT, + /* increase size of item. */ + COP_PASTE, + /* insert extent (that is sequence of unformatted nodes). */ + COP_EXTENT, + /* update delimiting key in least common ancestor of two + nodes. This is performed when items are moved between two + nodes. + */ + COP_UPDATE, + /* insert flow */ + COP_INSERT_FLOW, + COP_LAST_OP, +} carry_opcode; + +#define CARRY_FLOW_NEW_NODES_LIMIT 20 + +/* mode (or subtype) of COP_{INSERT|PASTE} operation. Specifies how target + item is determined. */ +typedef enum { + /* target item is one containing pointer to the ->child node */ + COPT_CHILD, + /* target item is given explicitly by @coord */ + COPT_ITEM_DATA, + /* target item is given by key */ + COPT_KEY, + /* see insert_paste_common() for more comments on this. */ + COPT_PASTE_RESTARTED, +} cop_insert_pos_type; + +/* flags to cut and delete */ +typedef enum { + /* don't kill node even if it became completely empty as results of + * cut. This is needed for eottl handling. See carry_extent() for + * details. */ + DELETE_RETAIN_EMPTY = (1 << 0) +} cop_delete_flag; + +/* + * carry() implements "lock handle tracking" feature. + * + * Callers supply carry with node where to perform initial operation and lock + * handle on this node. Trying to optimize node utilization carry may actually + * move insertion point to different node. Callers expect that lock handle + * will rebe transferred to the new node also. + * + */ +typedef enum { + /* transfer lock handle along with insertion point */ + CARRY_TRACK_CHANGE = 1, + /* acquire new lock handle to the node where insertion point is. This + * is used when carry() client doesn't initially possess lock handle + * on the insertion point node, for example, by extent insertion + * code. See carry_extent(). */ + CARRY_TRACK_NODE = 2 +} carry_track_type; + +/* data supplied to COP_{INSERT|PASTE} by callers */ +typedef struct carry_insert_data { + /* position where new item is to be inserted */ + coord_t *coord; + /* new item description */ + reiser4_item_data *data; + /* key of new item */ + const reiser4_key *key; +} carry_insert_data; + +/* cut and kill are similar, so carry_cut_data and carry_kill_data share the below structure of parameters */ +struct cut_kill_params { + /* coord where cut starts (inclusive) */ + coord_t *from; + /* coord where cut stops (inclusive, this item/unit will also be + * cut) */ + coord_t *to; + /* starting key. This is necessary when item and unit pos don't + * uniquely identify what portion or tree to remove. For example, this + * indicates what portion of extent unit will be affected. */ + const reiser4_key *from_key; + /* exclusive stop key */ + const reiser4_key *to_key; + /* if this is not NULL, smallest actually removed key is stored + * here. */ + reiser4_key *smallest_removed; + /* kill_node_content() is called for file truncate */ + int truncate; +}; + +struct carry_cut_data { + struct cut_kill_params params; +}; + +struct carry_kill_data { + struct cut_kill_params params; + /* parameter to be passed to the ->kill_hook() method of item + * plugin */ + /*void *iplug_params;*/ /* FIXME: unused currently */ + /* if not NULL---inode whose items are being removed. This is needed + * for ->kill_hook() of extent item to update VM structures when + * removing pages. */ + struct inode *inode; + /* sibling list maintenance is complicated by existence of eottl. When + * eottl whose left and right neighbors are formatted leaves is + * removed, one has to connect said leaves in the sibling list. This + * cannot be done when extent removal is just started as locking rules + * require sibling list update to happen atomically with removal of + * extent item. Therefore: 1. pointers to left and right neighbors + * have to be passed down to the ->kill_hook() of extent item, and + * 2. said neighbors have to be locked. */ + lock_handle *left; + lock_handle *right; + /* flags modifying behavior of kill. Currently, it may have DELETE_RETAIN_EMPTY set. */ + unsigned flags; +}; + +/* &carry_tree_op - operation to "carry" upward. + + Description of an operation we want to "carry" to the upper level of + a tree: e.g, when we insert something and there is not enough space + we allocate a new node and "carry" the operation of inserting a + pointer to the new node to the upper level, on removal of empty node, + we carry up operation of removing appropriate entry from parent. + + There are two types of carry ops: when adding or deleting node we + node at the parent level where appropriate modification has to be + performed is known in advance. When shifting items between nodes + (split, merge), delimiting key should be changed in the least common + parent of the nodes involved that is not known in advance. + + For the operations of the first type we store in &carry_op pointer to + the &carry_node at the parent level. For the operation of the second + type we store &carry_node or parents of the left and right nodes + modified and keep track of them upward until they coincide. + +*/ +typedef struct carry_op { + /* pool linkage */ + reiser4_pool_header header; + carry_opcode op; + /* node on which operation is to be performed: + + for insert, paste: node where new item is to be inserted + + for delete: node where pointer is to be deleted + + for cut: node to cut from + + for update: node where delimiting key is to be modified + + for modify: parent of modified node + + */ + carry_node *node; + union { + struct { + /* (sub-)type of insertion/paste. Taken from + cop_insert_pos_type. */ + __u8 type; + /* various operation flags. Taken from + cop_insert_flag. */ + __u8 flags; + carry_insert_data *d; + carry_node *child; + znode *brother; + } insert, paste, extent; + + struct { + int is_cut; + union { + carry_kill_data *kill; + carry_cut_data *cut; + } u; + } cut_or_kill; + + struct { + carry_node *left; + } update; + struct { + /* changed child */ + carry_node *child; + /* bitmask of changes. See &cop_modify_flag */ + __u32 flag; + } modify; + struct { + /* flags to deletion operation. Are taken from + cop_delete_flag */ + __u32 flags; + /* child to delete from parent. If this is + NULL, delete op->node. */ + carry_node *child; + } delete; + struct { + /* various operation flags. Taken from + cop_insert_flag. */ + __u32 flags; + flow_t *flow; + coord_t *insert_point; + reiser4_item_data *data; + /* flow insertion is limited by number of new blocks + added in that operation which do not get any data + but part of flow. This limit is set by macro + CARRY_FLOW_NEW_NODES_LIMIT. This field stores number + of nodes added already during one carry_flow */ + int new_nodes; + } insert_flow; + } u; +} carry_op; + +/* &carry_op_pool - preallocated pool of carry operations, and nodes */ +typedef struct carry_pool { + carry_op op[CARRIES_POOL_SIZE]; + reiser4_pool op_pool; + carry_node node[NODES_LOCKED_POOL_SIZE]; + reiser4_pool node_pool; +} carry_pool; + +/* &carry_tree_level - carry process on given level + + Description of balancing process on the given level. + + No need for locking here, as carry_tree_level is essentially per + thread thing (for now). + +*/ +struct carry_level { + /* this level may be restarted */ + __u32 restartable:1; + /* list of carry nodes on this level, ordered by key order */ + pool_level_list_head nodes; + pool_level_list_head ops; + /* pool where new objects are allocated from */ + carry_pool *pool; + int ops_num; + int nodes_num; + /* new root created on this level, if any */ + znode *new_root; + /* This is set by caller (insert_by_key(), resize_item(), etc.) when + they want ->tracked to automagically wander to the node where + insertion point moved after insert or paste. + */ + carry_track_type track_type; + /* lock handle supplied by user that we are tracking. See + above. */ + lock_handle *tracked; +}; + +/* information carry passes to plugin methods that may add new operations to + the @todo queue */ +struct carry_plugin_info { + carry_level *doing; + carry_level *todo; +}; + +int carry(carry_level * doing, carry_level * done); + +carry_node *add_carry(carry_level * level, pool_ordering order, carry_node * reference); +carry_node *add_carry_skip(carry_level * level, pool_ordering order, carry_node * reference); + +extern carry_node *insert_carry_node(carry_level * doing, + carry_level * todo, const znode * node); + +extern carry_pool *init_carry_pool(void); +extern void done_carry_pool(carry_pool * pool); + +extern void init_carry_level(carry_level * level, carry_pool * pool); + +extern carry_op *post_carry(carry_level * level, carry_opcode op, znode * node, int apply_to_parent); +extern carry_op *node_post_carry(carry_plugin_info * info, carry_opcode op, znode * node, int apply_to_parent_p); + +carry_node *add_new_znode(znode * brother, carry_node * reference, carry_level * doing, carry_level * todo); + +carry_node *find_carry_node(carry_level * level, const znode * node); + +extern znode *carry_real(const carry_node * node); + +/* helper macros to iterate over carry queues */ + +#define carry_node_next( node ) \ + ( ( carry_node * ) pool_level_list_next( &( node ) -> header ) ) + +#define carry_node_prev( node ) \ + ( ( carry_node * ) pool_level_list_prev( &( node ) -> header ) ) + +#define carry_node_front( level ) \ + ( ( carry_node * ) pool_level_list_front( &( level ) -> nodes ) ) + +#define carry_node_back( level ) \ + ( ( carry_node * ) pool_level_list_back( &( level ) -> nodes ) ) + +#define carry_node_end( level, node ) \ + ( pool_level_list_end( &( level ) -> nodes, &( node ) -> header ) ) + +/* macro to iterate over all operations in a @level */ +#define for_all_ops( level /* carry level (of type carry_level *) */, \ + op /* pointer to carry operation, modified by loop (of \ + * type carry_op *) */, \ + tmp /* pointer to carry operation (of type carry_op *), \ + * used to make iterator stable in the face of \ + * deletions from the level */ ) \ +for( op = ( carry_op * ) pool_level_list_front( &level -> ops ), \ + tmp = ( carry_op * ) pool_level_list_next( &op -> header ) ; \ + ! pool_level_list_end( &level -> ops, &op -> header ) ; \ + op = tmp, tmp = ( carry_op * ) pool_level_list_next( &op -> header ) ) + +/* macro to iterate over all nodes in a @level */ +#define for_all_nodes( level /* carry level (of type carry_level *) */, \ + node /* pointer to carry node, modified by loop (of \ + * type carry_node *) */, \ + tmp /* pointer to carry node (of type carry_node *), \ + * used to make iterator stable in the face of * \ + * deletions from the level */ ) \ +for( node = carry_node_front( level ), \ + tmp = carry_node_next( node ) ; ! carry_node_end( level, node ) ; \ + node = tmp, tmp = carry_node_next( node ) ) + +/* macro to iterate over all nodes in a @level in reverse order + + This is used, because nodes are unlocked in reversed order of locking */ +#define for_all_nodes_back( level /* carry level (of type carry_level *) */, \ + node /* pointer to carry node, modified by loop \ + * (of type carry_node *) */, \ + tmp /* pointer to carry node (of type carry_node \ + * *), used to make iterator stable in the \ + * face of deletions from the level */ ) \ +for( node = carry_node_back( level ), \ + tmp = carry_node_prev( node ) ; ! carry_node_end( level, node ) ; \ + node = tmp, tmp = carry_node_prev( node ) ) + +/* __FS_REISER4_CARRY_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/carry_ops.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/carry_ops.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,2107 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* implementation of carry operations */ + +#include "forward.h" +#include "debug.h" +#include "key.h" +#include "coord.h" +#include "plugin/item/item.h" +#include "plugin/node/node.h" +#include "jnode.h" +#include "znode.h" +#include "block_alloc.h" +#include "tree_walk.h" +#include "pool.h" +#include "tree_mod.h" +#include "carry.h" +#include "carry_ops.h" +#include "tree.h" +#include "super.h" +#include "reiser4.h" + +#include +#include + +static int carry_shift_data(sideof side, coord_t * insert_coord, znode * node, + carry_level * doing, carry_level * todo, unsigned int including_insert_coord_p); + +extern int lock_carry_node(carry_level * level, carry_node * node); +extern int lock_carry_node_tail(carry_node * node); + +/* find left neighbor of a carry node + + Look for left neighbor of @node and add it to the @doing queue. See + comments in the body. + +*/ +static carry_node * +find_left_neighbor(carry_op * op /* node to find left + * neighbor of */ , + carry_level * doing /* level to scan */ ) +{ + int result; + carry_node *node; + carry_node *left; + int flags; + reiser4_tree *tree; + + node = op->node; + + tree = current_tree; + RLOCK_TREE(tree); + /* first, check whether left neighbor is already in a @doing queue */ + if (carry_real(node)->left != NULL) { + /* NOTE: there is locking subtlety here. Look into + * find_right_neighbor() for more info */ + if (find_carry_node(doing, carry_real(node)->left) != NULL) { + RUNLOCK_TREE(tree); + left = node; + do { + left = carry_node_prev(left); + assert("nikita-3408", !carry_node_end(doing, + left)); + } while (carry_real(left) == carry_real(node)); + return left; + } + } + RUNLOCK_TREE(tree); + + left = add_carry_skip(doing, POOLO_BEFORE, node); + if (IS_ERR(left)) + return left; + + left->node = node->node; + left->free = 1; + + flags = GN_TRY_LOCK; + if (!op->u.insert.flags & COPI_LOAD_LEFT) + flags |= GN_NO_ALLOC; + + /* then, feeling lucky, peek left neighbor in the cache. */ + result = reiser4_get_left_neighbor(&left->lock_handle, carry_real(node), + ZNODE_WRITE_LOCK, flags); + if (result == 0) { + /* ok, node found and locked. */ + result = lock_carry_node_tail(left); + if (result != 0) + left = ERR_PTR(result); + } else if (result == -E_NO_NEIGHBOR || result == -ENOENT) { + /* node is leftmost node in a tree, or neighbor wasn't in + cache, or there is an extent on the left. */ + reiser4_pool_free(&doing->pool->node_pool, &left->header); + left = NULL; + } else if (doing->restartable) { + /* if left neighbor is locked, and level is restartable, add + new node to @doing and restart. */ + assert("nikita-913", node->parent != 0); + assert("nikita-914", node->node != NULL); + left->left = 1; + left->free = 0; + left = ERR_PTR(-E_REPEAT); + } else { + /* left neighbor is locked, level cannot be restarted. Just + ignore left neighbor. */ + reiser4_pool_free(&doing->pool->node_pool, &left->header); + left = NULL; + } + return left; +} + +/* find right neighbor of a carry node + + Look for right neighbor of @node and add it to the @doing queue. See + comments in the body. + +*/ +static carry_node * +find_right_neighbor(carry_op * op /* node to find right + * neighbor of */ , + carry_level * doing /* level to scan */ ) +{ + int result; + carry_node *node; + carry_node *right; + lock_handle lh; + int flags; + reiser4_tree *tree; + + init_lh(&lh); + + node = op->node; + + tree = current_tree; + RLOCK_TREE(tree); + /* first, check whether right neighbor is already in a @doing queue */ + if (carry_real(node)->right != NULL) { + /* + * Tree lock is taken here anyway, because, even if _outcome_ + * of (find_carry_node() != NULL) doesn't depends on + * concurrent updates to ->right, find_carry_node() cannot + * work with second argument NULL. Hence, following comment is + * of historic importance only. + * + * Subtle: + * + * Q: why don't we need tree lock here, looking for the right + * neighbor? + * + * A: even if value of node->real_node->right were changed + * during find_carry_node() execution, outcome of execution + * wouldn't change, because (in short) other thread cannot add + * elements to the @doing, and if node->real_node->right + * already was in @doing, value of node->real_node->right + * couldn't change, because node cannot be inserted between + * locked neighbors. + */ + if (find_carry_node(doing, carry_real(node)->right) != NULL) { + RUNLOCK_TREE(tree); + /* + * What we are doing here (this is also applicable to + * the find_left_neighbor()). + * + * tree_walk.c code requires that insertion of a + * pointer to a child, modification of parent pointer + * in the child, and insertion of the child into + * sibling list are atomic (see + * plugin/item/internal.c:create_hook_internal()). + * + * carry allocates new node long before pointer to it + * is inserted into parent and, actually, long before + * parent is even known. Such allocated-but-orphaned + * nodes are only trackable through carry level lists. + * + * Situation that is handled here is following: @node + * has valid ->right pointer, but there is + * allocated-but-orphaned node in the carry queue that + * is logically between @node and @node->right. Here + * we are searching for it. Critical point is that + * this is only possible if @node->right is also in + * the carry queue (this is checked above), because + * this is the only way new orphaned node could be + * inserted between them (before inserting new node, + * make_space() first tries to shift to the right, so, + * right neighbor will be locked and queued). + * + */ + right = node; + do { + right = carry_node_next(right); + assert("nikita-3408", !carry_node_end(doing, + right)); + } while (carry_real(right) == carry_real(node)); + return right; + } + } + RUNLOCK_TREE(tree); + + flags = GN_CAN_USE_UPPER_LEVELS; + if (!op->u.insert.flags & COPI_LOAD_RIGHT) + flags = GN_NO_ALLOC; + + /* then, try to lock right neighbor */ + init_lh(&lh); + result = reiser4_get_right_neighbor(&lh, carry_real(node), + ZNODE_WRITE_LOCK, flags); + if (result == 0) { + /* ok, node found and locked. */ + right = add_carry_skip(doing, POOLO_AFTER, node); + if (!IS_ERR(right)) { + right->node = lh.node; + move_lh(&right->lock_handle, &lh); + right->free = 1; + result = lock_carry_node_tail(right); + if (result != 0) + right = ERR_PTR(result); + } + } else if ((result == -E_NO_NEIGHBOR) || (result == -ENOENT)) { + /* node is rightmost node in a tree, or neighbor wasn't in + cache, or there is an extent on the right. */ + right = NULL; + } else + right = ERR_PTR(result); + done_lh(&lh); + return right; +} + +/* how much free space in a @node is needed for @op + + How much space in @node is required for completion of @op, where @op is + insert or paste operation. +*/ +static unsigned int +space_needed_for_op(znode * node /* znode data are + * inserted or + * pasted in */ , + carry_op * op /* carry + operation */ ) +{ + assert("nikita-919", op != NULL); + + switch (op->op) { + default: + impossible("nikita-1701", "Wrong opcode"); + case COP_INSERT: + return space_needed(node, NULL, op->u.insert.d->data, 1); + case COP_PASTE: + return space_needed(node, op->u.insert.d->coord, op->u.insert.d->data, 0); + } +} + +/* how much space in @node is required to insert or paste @data at + @coord. */ +reiser4_internal unsigned int +space_needed(const znode * node /* node data are inserted or + * pasted in */ , + const coord_t * coord /* coord where data are + * inserted or pasted + * at */ , + const reiser4_item_data * data /* data to insert or + * paste */ , + int insertion /* non-0 is inserting, 0---paste */ ) +{ + int result; + item_plugin *iplug; + + assert("nikita-917", node != NULL); + assert("nikita-918", node_plugin_by_node(node) != NULL); + assert("vs-230", !insertion || (coord == NULL)); + + result = 0; + iplug = data->iplug; + if (iplug->b.estimate != NULL) { + /* ask item plugin how much space is needed to insert this + item */ + result += iplug->b.estimate(insertion ? NULL : coord, data); + } else { + /* reasonable default */ + result += data->length; + } + if (insertion) { + node_plugin *nplug; + + nplug = node->nplug; + /* and add node overhead */ + if (nplug->item_overhead != NULL) { + result += nplug->item_overhead(node, 0); + } + } + return result; +} + +/* find &coord in parent where pointer to new child is to be stored. */ +static int +find_new_child_coord(carry_op * op /* COP_INSERT carry operation to + * insert pointer to new + * child */ ) +{ + int result; + znode *node; + znode *child; + + assert("nikita-941", op != NULL); + assert("nikita-942", op->op == COP_INSERT); + + node = carry_real(op->node); + assert("nikita-943", node != NULL); + assert("nikita-944", node_plugin_by_node(node) != NULL); + + child = carry_real(op->u.insert.child); + result = find_new_child_ptr(node, child, op->u.insert.brother, op->u.insert.d->coord); + + build_child_ptr_data(child, op->u.insert.d->data); + return result; +} + +/* additional amount of free space in @node required to complete @op */ +static int +free_space_shortage(znode * node /* node to check */ , + carry_op * op /* operation being performed */ ) +{ + assert("nikita-1061", node != NULL); + assert("nikita-1062", op != NULL); + + switch (op->op) { + default: + impossible("nikita-1702", "Wrong opcode"); + case COP_INSERT: + case COP_PASTE: + return space_needed_for_op(node, op) - znode_free_space(node); + case COP_EXTENT: + /* when inserting extent shift data around until insertion + point is utmost in the node. */ + if (coord_wrt(op->u.insert.d->coord) == COORD_INSIDE) + return +1; + else + return -1; + } +} + +/* helper function: update node pointer in operation after insertion + point was probably shifted into @target. */ +static znode * +sync_op(carry_op * op, carry_node * target) +{ + znode *insertion_node; + + /* reget node from coord: shift might move insertion coord to + the neighbor */ + insertion_node = op->u.insert.d->coord->node; + /* if insertion point was actually moved into new node, + update carry node pointer in operation. */ + if (insertion_node != carry_real(op->node)) { + op->node = target; + assert("nikita-2540", carry_real(target) == insertion_node); + } + assert("nikita-2541", + carry_real(op->node) == op->u.insert.d->coord->node); + return insertion_node; +} + +/* + * complete make_space() call: update tracked lock handle if necessary. See + * comments for fs/reiser4/carry.h:carry_track_type + */ +static int +make_space_tail(carry_op * op, carry_level * doing, znode * orig_node) +{ + int result; + carry_track_type tracking; + znode *node; + + tracking = doing->track_type; + node = op->u.insert.d->coord->node; + + if (tracking == CARRY_TRACK_NODE || + (tracking == CARRY_TRACK_CHANGE && node != orig_node)) { + /* inserting or pasting into node different from + original. Update lock handle supplied by caller. */ + assert("nikita-1417", doing->tracked != NULL); + done_lh(doing->tracked); + init_lh(doing->tracked); + result = longterm_lock_znode(doing->tracked, node, + ZNODE_WRITE_LOCK, ZNODE_LOCK_HIPRI); + } else + result = 0; + return result; +} + +/* This is insertion policy function. It shifts data to the left and right + neighbors of insertion coord and allocates new nodes until there is enough + free space to complete @op. + + See comments in the body. + + Assumes that the node format favors insertions at the right end of the node + as node40 does. + + See carry_flow() on detail about flow insertion +*/ +static int +make_space(carry_op * op /* carry operation, insert or paste */ , + carry_level * doing /* current carry queue */ , + carry_level * todo /* carry queue on the parent level */ ) +{ + znode *node; + int result; + int not_enough_space; + int blk_alloc; + znode *orig_node; + __u32 flags; + + coord_t *coord; + + assert("nikita-890", op != NULL); + assert("nikita-891", todo != NULL); + assert("nikita-892", + op->op == COP_INSERT || + op->op == COP_PASTE || op->op == COP_EXTENT); + assert("nikita-1607", + carry_real(op->node) == op->u.insert.d->coord->node); + + flags = op->u.insert.flags; + + /* NOTE check that new node can only be allocated after checking left + * and right neighbors. This is necessary for proper work of + * find_{left,right}_neighbor(). */ + assert("nikita-3410", ergo(flags & COPI_DONT_ALLOCATE, + flags & COPI_DONT_SHIFT_LEFT)); + assert("nikita-3411", ergo(flags & COPI_DONT_ALLOCATE, + flags & COPI_DONT_SHIFT_RIGHT)); + + coord = op->u.insert.d->coord; + orig_node = node = coord->node; + + assert("nikita-908", node != NULL); + assert("nikita-909", node_plugin_by_node(node) != NULL); + + result = 0; + /* If there is not enough space in a node, try to shift something to + the left neighbor. This is a bit tricky, as locking to the left is + low priority. This is handled by restart logic in carry(). + */ + not_enough_space = free_space_shortage(node, op); + if (not_enough_space <= 0) + /* it is possible that carry was called when there actually + was enough space in the node. For example, when inserting + leftmost item so that delimiting keys have to be updated. + */ + return make_space_tail(op, doing, orig_node); + if (!(flags & COPI_DONT_SHIFT_LEFT)) { + carry_node *left; + /* make note in statistics of an attempt to move + something into the left neighbor */ + left = find_left_neighbor(op, doing); + if (unlikely(IS_ERR(left))) { + if (PTR_ERR(left) == -E_REPEAT) + return -E_REPEAT; + else { + /* some error other than restart request + occurred. This shouldn't happen. Issue a + warning and continue as if left neighbor + weren't existing. + */ + warning("nikita-924", + "Error accessing left neighbor: %li", + PTR_ERR(left)); + print_znode("node", node); + } + } else if (left != NULL) { + + /* shift everything possible on the left of and + including insertion coord into the left neighbor */ + result = carry_shift_data(LEFT_SIDE, coord, + carry_real(left), doing, todo, + flags & COPI_GO_LEFT); + + /* reget node from coord: shift_left() might move + insertion coord to the left neighbor */ + node = sync_op(op, left); + + not_enough_space = free_space_shortage(node, op); + /* There is not enough free space in @node, but + may be, there is enough free space in + @left. Various balancing decisions are valid here. + The same for the shifiting to the right. + */ + } + } + /* If there still is not enough space, shift to the right */ + if (not_enough_space > 0 && !(flags & COPI_DONT_SHIFT_RIGHT)) { + carry_node *right; + + right = find_right_neighbor(op, doing); + if (IS_ERR(right)) { + warning("nikita-1065", + "Error accessing right neighbor: %li", + PTR_ERR(right)); + print_znode("node", node); + } else if (right != NULL) { + /* node containing insertion point, and its right + neighbor node are write locked by now. + + shift everything possible on the right of but + excluding insertion coord into the right neighbor + */ + result = carry_shift_data(RIGHT_SIDE, coord, + carry_real(right), + doing, todo, + flags & COPI_GO_RIGHT); + /* reget node from coord: shift_right() might move + insertion coord to the right neighbor */ + node = sync_op(op, right); + not_enough_space = free_space_shortage(node, op); + } + } + /* If there is still not enough space, allocate new node(s). + + We try to allocate new blocks if COPI_DONT_ALLOCATE is not set in + the carry operation flags (currently this is needed during flush + only). + */ + for (blk_alloc = 0; + not_enough_space > 0 && result == 0 && blk_alloc < 2 && + !(flags & COPI_DONT_ALLOCATE); ++blk_alloc) { + carry_node *fresh; /* new node we are allocating */ + coord_t coord_shadow; /* remembered insertion point before + * shifting data into new node */ + carry_node *node_shadow; /* remembered insertion node before + * shifting */ + unsigned int gointo; /* whether insertion point should move + * into newly allocated node */ + + /* allocate new node on the right of @node. Znode and disk + fake block number for new node are allocated. + + add_new_znode() posts carry operation COP_INSERT with + COPT_CHILD option to the parent level to add + pointer to newly created node to its parent. + + Subtle point: if several new nodes are required to complete + insertion operation at this level, they will be inserted + into their parents in the order of creation, which means + that @node will be valid "cookie" at the time of insertion. + + */ + fresh = add_new_znode(node, op->node, doing, todo); + if (IS_ERR(fresh)) + return PTR_ERR(fresh); + + /* Try to shift into new node. */ + result = lock_carry_node(doing, fresh); + zput(carry_real(fresh)); + if (result != 0) { + warning("nikita-947", + "Cannot lock new node: %i", result); + print_znode("new", carry_real(fresh)); + print_znode("node", node); + return result; + } + + /* both nodes are write locked by now. + + shift everything possible on the right of and + including insertion coord into the right neighbor. + */ + coord_dup(&coord_shadow, op->u.insert.d->coord); + node_shadow = op->node; + /* move insertion point into newly created node if: + + . insertion point is rightmost in the source node, or + . this is not the first node we are allocating in a row. + */ + gointo = + (blk_alloc > 0) || + coord_is_after_rightmost(op->u.insert.d->coord); + + result = carry_shift_data(RIGHT_SIDE, coord, carry_real(fresh), + doing, todo, gointo); + /* if insertion point was actually moved into new node, + update carry node pointer in operation. */ + node = sync_op(op, fresh); + not_enough_space = free_space_shortage(node, op); + if ((not_enough_space > 0) && (node != coord_shadow.node)) { + /* there is not enough free in new node. Shift + insertion point back to the @shadow_node so that + next new node would be inserted between + @shadow_node and @fresh. + */ + coord_normalize(&coord_shadow); + coord_dup(coord, &coord_shadow); + node = coord->node; + op->node = node_shadow; + if (1 || (flags & COPI_STEP_BACK)) { + /* still not enough space?! Maybe there is + enough space in the source node (i.e., node + data are moved from) now. + */ + not_enough_space = free_space_shortage(node, op); + } + } + } + if (not_enough_space > 0) { + if (!(flags & COPI_DONT_ALLOCATE)) + warning("nikita-948", "Cannot insert new item"); + result = -E_NODE_FULL; + } + assert("nikita-1622", ergo(result == 0, + carry_real(op->node) == coord->node)); + assert("nikita-2616", coord == op->u.insert.d->coord); + if (result == 0) + result = make_space_tail(op, doing, orig_node); + return result; +} + +/* insert_paste_common() - common part of insert and paste operations + + This function performs common part of COP_INSERT and COP_PASTE. + + There are two ways in which insertion/paste can be requested: + + . by directly supplying reiser4_item_data. In this case, op -> + u.insert.type is set to COPT_ITEM_DATA. + + . by supplying child pointer to which is to inserted into parent. In this + case op -> u.insert.type == COPT_CHILD. + + . by supplying key of new item/unit. This is currently only used during + extent insertion + + This is required, because when new node is allocated we don't know at what + position pointer to it is to be stored in the parent. Actually, we don't + even know what its parent will be, because parent can be re-balanced + concurrently and new node re-parented, and because parent can be full and + pointer to the new node will go into some other node. + + insert_paste_common() resolves pointer to child node into position in the + parent by calling find_new_child_coord(), that fills + reiser4_item_data. After this, insertion/paste proceeds uniformly. + + Another complication is with finding free space during pasting. It may + happen that while shifting items to the neighbors and newly allocated + nodes, insertion coord can no longer be in the item we wanted to paste + into. At this point, paste becomes (morphs) into insert. Moreover free + space analysis has to be repeated, because amount of space required for + insertion is different from that of paste (item header overhead, etc). + + This function "unifies" different insertion modes (by resolving child + pointer or key into insertion coord), and then calls make_space() to free + enough space in the node by shifting data to the left and right and by + allocating new nodes if necessary. Carry operation knows amount of space + required for its completion. After enough free space is obtained, caller of + this function (carry_{insert,paste,etc.}) performs actual insertion/paste + by calling item plugin method. + +*/ +static int +insert_paste_common(carry_op * op /* carry operation being + * performed */ , + carry_level * doing /* current carry level */ , + carry_level * todo /* next carry level */ , + carry_insert_data * cdata /* pointer to + * cdata */ , + coord_t * coord /* insertion/paste coord */ , + reiser4_item_data * data /* data to be + * inserted/pasted */ ) +{ + assert("nikita-981", op != NULL); + assert("nikita-980", todo != NULL); + assert("nikita-979", (op->op == COP_INSERT) || (op->op == COP_PASTE) || (op->op == COP_EXTENT)); + + if (op->u.insert.type == COPT_PASTE_RESTARTED) { + /* nothing to do. Fall through to make_space(). */ + ; + } else if (op->u.insert.type == COPT_KEY) { + node_search_result intra_node; + znode *node; + /* Problem with doing batching at the lowest level, is that + operations here are given by coords where modification is + to be performed, and one modification can invalidate coords + of all following operations. + + So, we are implementing yet another type for operation that + will use (the only) "locator" stable across shifting of + data between nodes, etc.: key (COPT_KEY). + + This clause resolves key to the coord in the node. + + But node can change also. Probably some pieces have to be + added to the lock_carry_node(), to lock node by its key. + + */ + /* NOTE-NIKITA Lookup bias is fixed to FIND_EXACT. Complain + if you need something else. */ + op->u.insert.d->coord = coord; + node = carry_real(op->node); + intra_node = node_plugin_by_node(node)->lookup + (node, op->u.insert.d->key, FIND_EXACT, op->u.insert.d->coord); + if ((intra_node != NS_FOUND) && (intra_node != NS_NOT_FOUND)) { + warning("nikita-1715", "Intra node lookup failure: %i", intra_node); + print_znode("node", node); + return intra_node; + } + } else if (op->u.insert.type == COPT_CHILD) { + /* if we are asked to insert pointer to the child into + internal node, first convert pointer to the child into + coord within parent node. + */ + znode *child; + int result; + + op->u.insert.d = cdata; + op->u.insert.d->coord = coord; + op->u.insert.d->data = data; + op->u.insert.d->coord->node = carry_real(op->node); + result = find_new_child_coord(op); + child = carry_real(op->u.insert.child); + if (result != NS_NOT_FOUND) { + warning("nikita-993", "Cannot find a place for child pointer: %i", result); + print_znode("child", child); + print_znode("parent", carry_real(op->node)); + return result; + } + /* This only happens when we did multiple insertions at + the previous level, trying to insert single item and + it so happened, that insertion of pointers to all new + nodes before this one already caused parent node to + split (may be several times). + + I am going to come up with better solution. + + You are not expected to understand this. + -- v6root/usr/sys/ken/slp.c + + Basically, what happens here is the following: carry came + to the parent level and is about to insert internal item + pointing to the child node that it just inserted in the + level below. Position where internal item is to be inserted + was found by find_new_child_coord() above, but node of the + current carry operation (that is, parent node of child + inserted on the previous level), was determined earlier in + the lock_carry_level/lock_carry_node. It could so happen + that other carry operations already performed on the parent + level already split parent node, so that insertion point + moved into another node. Handle this by creating new carry + node for insertion point if necessary. + */ + if (carry_real(op->node) != op->u.insert.d->coord->node) { + pool_ordering direction; + znode *z1; + znode *z2; + reiser4_key k1; + reiser4_key k2; + + /* + * determine in what direction insertion point + * moved. Do this by comparing delimiting keys. + */ + z1 = op->u.insert.d->coord->node; + z2 = carry_real(op->node); + if (keyle(leftmost_key_in_node(z1, &k1), + leftmost_key_in_node(z2, &k2))) + /* insertion point moved to the left */ + direction = POOLO_BEFORE; + else + /* insertion point moved to the right */ + direction = POOLO_AFTER; + + op->node = add_carry_skip(doing, direction, op->node); + if (IS_ERR(op->node)) + return PTR_ERR(op->node); + op->node->node = op->u.insert.d->coord->node; + op->node->free = 1; + result = lock_carry_node(doing, op->node); + if (result != 0) + return result; + } + + /* + * set up key of an item being inserted: we are inserting + * internal item and its key is (by the very definition of + * search tree) is leftmost key in the child node. + */ + op->u.insert.d->key = UNDER_RW(dk, znode_get_tree(child), read, + leftmost_key_in_node(child, znode_get_ld_key(child))); + op->u.insert.d->data->arg = op->u.insert.brother; + } else { + assert("vs-243", op->u.insert.d->coord != NULL); + op->u.insert.d->coord->node = carry_real(op->node); + } + + /* find free space. */ + return make_space(op, doing, todo); +} + +/* handle carry COP_INSERT operation. + + Insert new item into node. New item can be given in one of two ways: + + - by passing &tree_coord and &reiser4_item_data as part of @op. This is + only applicable at the leaf/twig level. + + - by passing a child node pointer to which is to be inserted by this + operation. + +*/ +static int +carry_insert(carry_op * op /* operation to perform */ , + carry_level * doing /* queue of operations @op + * is part of */ , + carry_level * todo /* queue where new operations + * are accumulated */ ) +{ + znode *node; + carry_insert_data cdata; + coord_t coord; + reiser4_item_data data; + carry_plugin_info info; + int result; + + assert("nikita-1036", op != NULL); + assert("nikita-1037", todo != NULL); + assert("nikita-1038", op->op == COP_INSERT); + + coord_init_zero(&coord); + + /* perform common functionality of insert and paste. */ + result = insert_paste_common(op, doing, todo, &cdata, &coord, &data); + if (result != 0) + return result; + + node = op->u.insert.d->coord->node; + assert("nikita-1039", node != NULL); + assert("nikita-1040", node_plugin_by_node(node) != NULL); + + assert("nikita-949", space_needed_for_op(node, op) <= znode_free_space(node)); + + /* ask node layout to create new item. */ + info.doing = doing; + info.todo = todo; + result = node_plugin_by_node(node)->create_item + (op->u.insert.d->coord, op->u.insert.d->key, op->u.insert.d->data, &info); + doing->restartable = 0; + znode_make_dirty(node); + + return result; +} + +/* + * Flow insertion code. COP_INSERT_FLOW is special tree operation that is + * supplied with a "flow" (that is, a stream of data) and inserts it into tree + * by slicing into multiple items. + */ + +#define flow_insert_point(op) ( ( op ) -> u.insert_flow.insert_point ) +#define flow_insert_flow(op) ( ( op ) -> u.insert_flow.flow ) +#define flow_insert_data(op) ( ( op ) -> u.insert_flow.data ) + +static size_t +item_data_overhead(carry_op * op) +{ + if (flow_insert_data(op)->iplug->b.estimate == NULL) + return 0; + return (flow_insert_data(op)->iplug->b.estimate(NULL /* estimate insertion */, flow_insert_data(op)) - + flow_insert_data(op)->length); +} + +/* FIXME-VS: this is called several times during one make_flow_for_insertion + and it will always return the same result. Some optimization could be made + by calculating this value once at the beginning and passing it around. That + would reduce some flexibility in future changes +*/ +static int can_paste(coord_t *, const reiser4_key *, const reiser4_item_data *); +static size_t +flow_insertion_overhead(carry_op * op) +{ + znode *node; + size_t insertion_overhead; + + node = flow_insert_point(op)->node; + insertion_overhead = 0; + if (node->nplug->item_overhead && + !can_paste(flow_insert_point(op), &flow_insert_flow(op)->key, flow_insert_data(op))) + insertion_overhead = node->nplug->item_overhead(node, 0) + item_data_overhead(op); + return insertion_overhead; +} + +/* how many bytes of flow does fit to the node */ +static int +what_can_fit_into_node(carry_op * op) +{ + size_t free, overhead; + + overhead = flow_insertion_overhead(op); + free = znode_free_space(flow_insert_point(op)->node); + if (free <= overhead) + return 0; + free -= overhead; + /* FIXME: flow->length is loff_t only to not get overflowed in case of expandign truncate */ + if (free < op->u.insert_flow.flow->length) + return free; + return (int)op->u.insert_flow.flow->length; +} + +/* in make_space_for_flow_insertion we need to check either whether whole flow + fits into a node or whether minimal fraction of flow fits into a node */ +static int +enough_space_for_whole_flow(carry_op * op) +{ + return (unsigned) what_can_fit_into_node(op) == op->u.insert_flow.flow->length; +} + +#define MIN_FLOW_FRACTION 1 +static int +enough_space_for_min_flow_fraction(carry_op * op) +{ + assert("vs-902", coord_is_after_rightmost(flow_insert_point(op))); + + return what_can_fit_into_node(op) >= MIN_FLOW_FRACTION; +} + +/* this returns 0 if left neighbor was obtained successfully and everything + upto insertion point including it were shifted and left neighbor still has + some free space to put minimal fraction of flow into it */ +static int +make_space_by_shift_left(carry_op * op, carry_level * doing, carry_level * todo) +{ + carry_node *left; + znode *orig; + + left = find_left_neighbor(op, doing); + if (unlikely(IS_ERR(left))) { + warning("vs-899", "make_space_by_shift_left: " "error accessing left neighbor: %li", PTR_ERR(left)); + return 1; + } + if (left == NULL) + /* left neighbor either does not exist or is unformatted + node */ + return 1; + + orig = flow_insert_point(op)->node; + /* try to shift content of node @orig from its head upto insert point + including insertion point into the left neighbor */ + carry_shift_data(LEFT_SIDE, flow_insert_point(op), + carry_real(left), doing, todo, 1 /* including insert + * point */); + if (carry_real(left) != flow_insert_point(op)->node) { + /* insertion point did not move */ + return 1; + } + + /* insertion point is set after last item in the node */ + assert("vs-900", coord_is_after_rightmost(flow_insert_point(op))); + + if (!enough_space_for_min_flow_fraction(op)) { + /* insertion point node does not have enough free space to put + even minimal portion of flow into it, therefore, move + insertion point back to orig node (before first item) */ + coord_init_before_first_item(flow_insert_point(op), orig); + return 1; + } + + /* part of flow is to be written to the end of node */ + op->node = left; + return 0; +} + +/* this returns 0 if right neighbor was obtained successfully and everything to + the right of insertion point was shifted to it and node got enough free + space to put minimal fraction of flow into it */ +static int +make_space_by_shift_right(carry_op * op, carry_level * doing, carry_level * todo) +{ + carry_node *right; + + right = find_right_neighbor(op, doing); + if (unlikely(IS_ERR(right))) { + warning("nikita-1065", "shift_right_excluding_insert_point: " + "error accessing right neighbor: %li", PTR_ERR(right)); + return 1; + } + if (right) { + /* shift everything possible on the right of but excluding + insertion coord into the right neighbor */ + carry_shift_data(RIGHT_SIDE, flow_insert_point(op), + carry_real(right), doing, todo, 0 /* not + * including + * insert + * point */); + } else { + /* right neighbor either does not exist or is unformatted + node */ + ; + } + if (coord_is_after_rightmost(flow_insert_point(op))) { + if (enough_space_for_min_flow_fraction(op)) { + /* part of flow is to be written to the end of node */ + return 0; + } + } + + /* new node is to be added if insert point node did not get enough + space for whole flow */ + return 1; +} + +/* this returns 0 when insert coord is set at the node end and fraction of flow + fits into that node */ +static int +make_space_by_new_nodes(carry_op * op, carry_level * doing, carry_level * todo) +{ + int result; + znode *node; + carry_node *new; + + node = flow_insert_point(op)->node; + + if (op->u.insert_flow.new_nodes == CARRY_FLOW_NEW_NODES_LIMIT) + return RETERR(-E_NODE_FULL); + /* add new node after insert point node */ + new = add_new_znode(node, op->node, doing, todo); + if (unlikely(IS_ERR(new))) { + return PTR_ERR(new); + } + result = lock_carry_node(doing, new); + zput(carry_real(new)); + if (unlikely(result)) { + return result; + } + op->u.insert_flow.new_nodes++; + if (!coord_is_after_rightmost(flow_insert_point(op))) { + carry_shift_data(RIGHT_SIDE, flow_insert_point(op), + carry_real(new), doing, todo, 0 /* not + * including + * insert + * point */); + + assert("vs-901", coord_is_after_rightmost(flow_insert_point(op))); + + if (enough_space_for_min_flow_fraction(op)) { + return 0; + } + if (op->u.insert_flow.new_nodes == CARRY_FLOW_NEW_NODES_LIMIT) + return RETERR(-E_NODE_FULL); + + /* add one more new node */ + new = add_new_znode(node, op->node, doing, todo); + if (unlikely(IS_ERR(new))) { + return PTR_ERR(new); + } + result = lock_carry_node(doing, new); + zput(carry_real(new)); + if (unlikely(result)) { + return result; + } + op->u.insert_flow.new_nodes++; + } + + /* move insertion point to new node */ + coord_init_before_first_item(flow_insert_point(op), carry_real(new)); + op->node = new; + return 0; +} + +static int +make_space_for_flow_insertion(carry_op * op, carry_level * doing, carry_level * todo) +{ + __u32 flags = op->u.insert_flow.flags; + + if (enough_space_for_whole_flow(op)) { + /* whole flow fits into insert point node */ + return 0; + } + + if (!(flags & COPI_DONT_SHIFT_LEFT) && (make_space_by_shift_left(op, doing, todo) == 0)) { + /* insert point is shifted to left neighbor of original insert + point node and is set after last unit in that node. It has + enough space to fit at least minimal fraction of flow. */ + return 0; + } + + if (enough_space_for_whole_flow(op)) { + /* whole flow fits into insert point node */ + return 0; + } + + if (!(flags & COPI_DONT_SHIFT_RIGHT) && (make_space_by_shift_right(op, doing, todo) == 0)) { + /* insert point is still set to the same node, but there is + nothing to the right of insert point. */ + return 0; + } + + if (enough_space_for_whole_flow(op)) { + /* whole flow fits into insert point node */ + return 0; + } + + return make_space_by_new_nodes(op, doing, todo); +} + +/* implements COP_INSERT_FLOW operation */ +static int +carry_insert_flow(carry_op * op, carry_level * doing, carry_level * todo) +{ + int result; + flow_t *f; + coord_t *insert_point; + node_plugin *nplug; + carry_plugin_info info; + znode *orig_node; + lock_handle *orig_lh; + + f = op->u.insert_flow.flow; + result = 0; + + /* carry system needs this to work */ + info.doing = doing; + info.todo = todo; + + orig_node = flow_insert_point(op)->node; + orig_lh = doing->tracked; + + while (f->length) { + result = make_space_for_flow_insertion(op, doing, todo); + if (result) + break; + + insert_point = flow_insert_point(op); + nplug = node_plugin_by_node(insert_point->node); + + /* compose item data for insertion/pasting */ + flow_insert_data(op)->data = f->data; + flow_insert_data(op)->length = what_can_fit_into_node(op); + + if (can_paste(insert_point, &f->key, flow_insert_data(op))) { + /* insert point is set to item of file we are writing to and we have to append to it */ + assert("vs-903", insert_point->between == AFTER_UNIT); + nplug->change_item_size(insert_point, flow_insert_data(op)->length); + flow_insert_data(op)->iplug->b.paste(insert_point, flow_insert_data(op), &info); + } else { + /* new item must be inserted */ + pos_in_node_t new_pos; + flow_insert_data(op)->length += item_data_overhead(op); + + /* FIXME-VS: this is because node40_create_item changes + insert_point for obscure reasons */ + switch (insert_point->between) { + case AFTER_ITEM: + new_pos = insert_point->item_pos + 1; + break; + case EMPTY_NODE: + new_pos = 0; + break; + case BEFORE_ITEM: + assert("vs-905", insert_point->item_pos == 0); + new_pos = 0; + break; + default: + impossible("vs-906", "carry_insert_flow: invalid coord"); + new_pos = 0; + break; + } + + nplug->create_item(insert_point, &f->key, flow_insert_data(op), &info); + coord_set_item_pos(insert_point, new_pos); + } + coord_init_after_item_end(insert_point); + doing->restartable = 0; + znode_make_dirty(insert_point->node); + + move_flow_forward(f, (unsigned) flow_insert_data(op)->length); + } + + if (orig_node != flow_insert_point(op)->node) { + /* move lock to new insert point */ + done_lh(orig_lh); + init_lh(orig_lh); + result = longterm_lock_znode(orig_lh, flow_insert_point(op)->node, ZNODE_WRITE_LOCK, ZNODE_LOCK_HIPRI); + } + + return result; +} + +/* implements COP_DELETE operation + + Remove pointer to @op -> u.delete.child from it's parent. + + This function also handles killing of a tree root is last pointer from it + was removed. This is complicated by our handling of "twig" level: root on + twig level is never killed. + +*/ +static int +carry_delete(carry_op * op /* operation to be performed */ , + carry_level * doing UNUSED_ARG /* current carry + * level */ , + carry_level * todo /* next carry level */ ) +{ + int result; + coord_t coord; + coord_t coord2; + znode *parent; + znode *child; + carry_plugin_info info; + reiser4_tree *tree; + + /* + * This operation is called to delete internal item pointing to the + * child node that was removed by carry from the tree on the previous + * tree level. + */ + + assert("nikita-893", op != NULL); + assert("nikita-894", todo != NULL); + assert("nikita-895", op->op == COP_DELETE); + + coord_init_zero(&coord); + coord_init_zero(&coord2); + + parent = carry_real(op->node); + child = op->u.delete.child ? + carry_real(op->u.delete.child) : op->node->node; + tree = znode_get_tree(child); + RLOCK_TREE(tree); + + /* + * @parent was determined when carry entered parent level + * (lock_carry_level/lock_carry_node). Since then, actual parent of + * @child node could change due to other carry operations performed on + * the parent level. Check for this. + */ + + if (znode_parent(child) != parent) { + /* NOTE-NIKITA add stat counter for this. */ + parent = znode_parent(child); + assert("nikita-2581", find_carry_node(doing, parent)); + } + RUNLOCK_TREE(tree); + + assert("nikita-1213", znode_get_level(parent) > LEAF_LEVEL); + + /* Twig level horrors: tree should be of height at least 2. So, last + pointer from the root at twig level is preserved even if child is + empty. This is ugly, but so it was architectured. + */ + + if (znode_is_root(parent) && + znode_get_level(parent) <= REISER4_MIN_TREE_HEIGHT && + node_num_items(parent) == 1) { + /* Delimiting key manipulations. */ + WLOCK_DK(tree); + znode_set_ld_key(child, znode_set_ld_key(parent, min_key())); + znode_set_rd_key(child, znode_set_rd_key(parent, max_key())); + ZF_SET(child, JNODE_DKSET); + WUNLOCK_DK(tree); + + /* @child escaped imminent death! */ + ZF_CLR(child, JNODE_HEARD_BANSHEE); + return 0; + } + + /* convert child pointer to the coord_t */ + result = find_child_ptr(parent, child, &coord); + if (result != NS_FOUND) { + warning("nikita-994", "Cannot find child pointer: %i", result); + print_znode("child", child); + print_znode("parent", parent); + print_coord_content("coord", &coord); + return result; + } + + coord_dup(&coord2, &coord); + info.doing = doing; + info.todo = todo; + { + /* + * Actually kill internal item: prepare structure with + * arguments for ->cut_and_kill() method... + */ + + struct carry_kill_data kdata; + kdata.params.from = &coord; + kdata.params.to = &coord2; + kdata.params.from_key = NULL; + kdata.params.to_key = NULL; + kdata.params.smallest_removed = NULL; + kdata.params.truncate = 1; + kdata.flags = op->u.delete.flags; + kdata.inode = 0; + kdata.left = 0; + kdata.right = 0; + /* ... and call it. */ + result = node_plugin_by_node(parent)->cut_and_kill(&kdata, + &info); + } + doing->restartable = 0; + + /* check whether root should be killed violently */ + if (znode_is_root(parent) && + /* don't kill roots at and lower than twig level */ + znode_get_level(parent) > REISER4_MIN_TREE_HEIGHT && + node_num_items(parent) == 1) { + result = kill_tree_root(coord.node); + } + + return result < 0 ? : 0; +} + +/* implements COP_CUT opration + + Cuts part or whole content of node. + +*/ +static int +carry_cut(carry_op * op /* operation to be performed */ , + carry_level * doing /* current carry level */ , + carry_level * todo /* next carry level */ ) +{ + int result; + carry_plugin_info info; + node_plugin *nplug; + + assert("nikita-896", op != NULL); + assert("nikita-897", todo != NULL); + assert("nikita-898", op->op == COP_CUT); + + info.doing = doing; + info.todo = todo; + + nplug = node_plugin_by_node(carry_real(op->node)); + if (op->u.cut_or_kill.is_cut) + result = nplug->cut(op->u.cut_or_kill.u.cut, &info); + else + result = nplug->cut_and_kill(op->u.cut_or_kill.u.kill, &info); + + doing->restartable = 0; + return result < 0 ? : 0; +} + +/* helper function for carry_paste(): returns true if @op can be continued as + paste */ +static int +can_paste(coord_t * icoord, const reiser4_key * key, const reiser4_item_data * data) +{ + coord_t circa; + item_plugin *new_iplug; + item_plugin *old_iplug; + int result = 0; /* to keep gcc shut */ + + assert("", icoord->between != AT_UNIT); + + /* obviously, one cannot paste when node is empty---there is nothing + to paste into. */ + if (node_is_empty(icoord->node)) + return 0; + /* if insertion point is at the middle of the item, then paste */ + if (!coord_is_between_items(icoord)) + return 1; + coord_dup(&circa, icoord); + circa.between = AT_UNIT; + + old_iplug = item_plugin_by_coord(&circa); + new_iplug = data->iplug; + + /* check whether we can paste to the item @icoord is "at" when we + ignore ->between field */ + if (old_iplug == new_iplug && item_can_contain_key(&circa, key, data)) { + result = 1; + } else if (icoord->between == BEFORE_UNIT || icoord->between == BEFORE_ITEM) { + /* otherwise, try to glue to the item at the left, if any */ + coord_dup(&circa, icoord); + if (coord_set_to_left(&circa)) { + result = 0; + coord_init_before_item(icoord); + } else { + old_iplug = item_plugin_by_coord(&circa); + result = (old_iplug == new_iplug) && item_can_contain_key(icoord, key, data); + if (result) { + coord_dup(icoord, &circa); + icoord->between = AFTER_UNIT; + } + } + } else if (icoord->between == AFTER_UNIT || icoord->between == AFTER_ITEM) { + coord_dup(&circa, icoord); + /* otherwise, try to glue to the item at the right, if any */ + if (coord_set_to_right(&circa)) { + result = 0; + coord_init_after_item(icoord); + } else { + int (*cck) (const coord_t *, const reiser4_key *, const reiser4_item_data *); + + old_iplug = item_plugin_by_coord(&circa); + + cck = old_iplug->b.can_contain_key; + if (cck == NULL) + /* item doesn't define ->can_contain_key + method? So it is not expandable. */ + result = 0; + else { + result = (old_iplug == new_iplug) && cck(&circa /*icoord */ , key, data); + if (result) { + coord_dup(icoord, &circa); + icoord->between = BEFORE_UNIT; + } + } + } + } else + impossible("nikita-2513", "Nothing works"); + if (result) { + if (icoord->between == BEFORE_ITEM) { + assert("vs-912", icoord->unit_pos == 0); + icoord->between = BEFORE_UNIT; + } else if (icoord->between == AFTER_ITEM) { + coord_init_after_item_end(icoord); + } + } + return result; +} + +/* implements COP_PASTE operation + + Paste data into existing item. This is complicated by the fact that after + we shifted something to the left or right neighbors trying to free some + space, item we were supposed to paste into can be in different node than + insertion coord. If so, we are no longer doing paste, but insert. See + comments in insert_paste_common(). + +*/ +static int +carry_paste(carry_op * op /* operation to be performed */ , + carry_level * doing UNUSED_ARG /* current carry + * level */ , + carry_level * todo /* next carry level */ ) +{ + znode *node; + carry_insert_data cdata; + coord_t dcoord; + reiser4_item_data data; + int result; + int real_size; + item_plugin *iplug; + carry_plugin_info info; + coord_t *coord; + + assert("nikita-982", op != NULL); + assert("nikita-983", todo != NULL); + assert("nikita-984", op->op == COP_PASTE); + + coord_init_zero(&dcoord); + + result = insert_paste_common(op, doing, todo, &cdata, &dcoord, &data); + if (result != 0) + return result; + + coord = op->u.insert.d->coord; + + /* handle case when op -> u.insert.coord doesn't point to the item + of required type. restart as insert. */ + if (!can_paste(coord, op->u.insert.d->key, op->u.insert.d->data)) { + op->op = COP_INSERT; + op->u.insert.type = COPT_PASTE_RESTARTED; + result = op_dispatch_table[COP_INSERT].handler(op, doing, todo); + + return result; + } + + node = coord->node; + iplug = item_plugin_by_coord(coord); + assert("nikita-992", iplug != NULL); + + assert("nikita-985", node != NULL); + assert("nikita-986", node_plugin_by_node(node) != NULL); + + assert("nikita-987", space_needed_for_op(node, op) <= znode_free_space(node)); + + assert("nikita-1286", coord_is_existing_item(coord)); + + /* + * if item is expanded as a result of this operation, we should first + * change item size, than call ->b.paste item method. If item is + * shrunk, it should be done other way around: first call ->b.paste + * method, then reduce item size. + */ + + real_size = space_needed_for_op(node, op); + if (real_size > 0) + node->nplug->change_item_size(coord, real_size); + + doing->restartable = 0; + info.doing = doing; + info.todo = todo; + + result = iplug->b.paste(coord, op->u.insert.d->data, &info); + + if (real_size < 0) + node->nplug->change_item_size(coord, real_size); + + /* if we pasted at the beginning of the item, update item's key. */ + if (coord->unit_pos == 0 && coord->between != AFTER_UNIT) + node->nplug->update_item_key(coord, op->u.insert.d->key, &info); + + znode_make_dirty(node); + return result; +} + +/* handle carry COP_EXTENT operation. */ +static int +carry_extent(carry_op * op /* operation to perform */ , + carry_level * doing /* queue of operations @op + * is part of */ , + carry_level * todo /* queue where new operations + * are accumulated */ ) +{ + znode *node; + carry_insert_data cdata; + coord_t coord; + reiser4_item_data data; + carry_op *delete_dummy; + carry_op *insert_extent; + int result; + carry_plugin_info info; + + assert("nikita-1751", op != NULL); + assert("nikita-1752", todo != NULL); + assert("nikita-1753", op->op == COP_EXTENT); + + /* extent insertion overview: + + extents live on the TWIG LEVEL, which is level one above the leaf + one. This complicates extent insertion logic somewhat: it may + happen (and going to happen all the time) that in logical key + ordering extent has to be placed between items I1 and I2, located + at the leaf level, but I1 and I2 are in the same formatted leaf + node N1. To insert extent one has to + + (1) reach node N1 and shift data between N1, its neighbors and + possibly newly allocated nodes until I1 and I2 fall into different + nodes. Since I1 and I2 are still neighboring items in logical key + order, they will be necessary utmost items in their respective + nodes. + + (2) After this new extent item is inserted into node on the twig + level. + + Fortunately this process can reuse almost all code from standard + insertion procedure (viz. make_space() and insert_paste_common()), + due to the following observation: make_space() only shifts data up + to and excluding or including insertion point. It never + "over-moves" through insertion point. Thus, one can use + make_space() to perform step (1). All required for this is just to + instruct free_space_shortage() to keep make_space() shifting data + until insertion point is at the node border. + + */ + + /* perform common functionality of insert and paste. */ + result = insert_paste_common(op, doing, todo, &cdata, &coord, &data); + if (result != 0) + return result; + + node = op->u.extent.d->coord->node; + assert("nikita-1754", node != NULL); + assert("nikita-1755", node_plugin_by_node(node) != NULL); + assert("nikita-1700", coord_wrt(op->u.extent.d->coord) != COORD_INSIDE); + + /* NOTE-NIKITA add some checks here. Not assertions, -EIO. Check that + extent fits between items. */ + + info.doing = doing; + info.todo = todo; + + /* there is another complication due to placement of extents on the + twig level: extents are "rigid" in the sense that key-range + occupied by extent cannot grow indefinitely to the right as it is + for the formatted leaf nodes. Because of this when search finds two + adjacent extents on the twig level, it has to "drill" to the leaf + level, creating new node. Here we are removing this node. + */ + if (node_is_empty(node)) { + delete_dummy = node_post_carry(&info, COP_DELETE, node, 1); + if (IS_ERR(delete_dummy)) + return PTR_ERR(delete_dummy); + delete_dummy->u.delete.child = NULL; + delete_dummy->u.delete.flags = DELETE_RETAIN_EMPTY; + ZF_SET(node, JNODE_HEARD_BANSHEE); + } + + /* proceed with inserting extent item into parent. We are definitely + inserting rather than pasting if we get that far. */ + insert_extent = node_post_carry(&info, COP_INSERT, node, 1); + if (IS_ERR(insert_extent)) + /* @delete_dummy will be automatically destroyed on the level + exiting */ + return PTR_ERR(insert_extent); + /* NOTE-NIKITA insertion by key is simplest option here. Another + possibility is to insert on the left or right of already existing + item. + */ + insert_extent->u.insert.type = COPT_KEY; + insert_extent->u.insert.d = op->u.extent.d; + assert("nikita-1719", op->u.extent.d->key != NULL); + insert_extent->u.insert.d->data->arg = op->u.extent.d->coord; + insert_extent->u.insert.flags = znode_get_tree(node)->carry.new_extent_flags; + + /* + * if carry was asked to track lock handle we should actually track + * lock handle on the twig node rather than on the leaf where + * operation was started from. Transfer tracked lock handle. + */ + if (doing->track_type) { + assert("nikita-3242", doing->tracked != NULL); + assert("nikita-3244", todo->tracked == NULL); + todo->tracked = doing->tracked; + todo->track_type = CARRY_TRACK_NODE; + doing->tracked = NULL; + doing->track_type = 0; + } + + return 0; +} + +/* update key in @parent between pointers to @left and @right. + + Find coords of @left and @right and update delimiting key between them. + This is helper function called by carry_update(). Finds position of + internal item involved. Updates item key. Updates delimiting keys of child + nodes involved. +*/ +static int +update_delimiting_key(znode * parent /* node key is updated + * in */ , + znode * left /* child of @parent */ , + znode * right /* child of @parent */ , + carry_level * doing /* current carry + * level */ , + carry_level * todo /* parent carry + * level */ , + const char **error_msg /* place to + * store error + * message */ ) +{ + coord_t left_pos; + coord_t right_pos; + int result; + reiser4_key ldkey; + carry_plugin_info info; + + assert("nikita-1177", right != NULL); + /* find position of right left child in a parent */ + result = find_child_ptr(parent, right, &right_pos); + if (result != NS_FOUND) { + *error_msg = "Cannot find position of right child"; + return result; + } + + if ((left != NULL) && !coord_is_leftmost_unit(&right_pos)) { + /* find position of the left child in a parent */ + result = find_child_ptr(parent, left, &left_pos); + if (result != NS_FOUND) { + *error_msg = "Cannot find position of left child"; + return result; + } + assert("nikita-1355", left_pos.node != NULL); + } else + left_pos.node = NULL; + + /* check that they are separated by exactly one key and are basically + sane */ + if (REISER4_DEBUG) { + if ((left_pos.node != NULL) + && !coord_is_existing_unit(&left_pos)) { + *error_msg = "Left child is bastard"; + return RETERR(-EIO); + } + if (!coord_is_existing_unit(&right_pos)) { + *error_msg = "Right child is bastard"; + return RETERR(-EIO); + } + if (left_pos.node != NULL && + !coord_are_neighbors(&left_pos, &right_pos)) { + *error_msg = "Children are not direct siblings"; + return RETERR(-EIO); + } + } + *error_msg = NULL; + + info.doing = doing; + info.todo = todo; + + /* + * If child node is not empty, new key of internal item is a key of + * leftmost item in the child node. If the child is empty, take its + * right delimiting key as a new key of the internal item. Precise key + * in the latter case is not important per se, because the child (and + * the internal item) are going to be killed shortly anyway, but we + * have to preserve correct order of keys in the parent node. + */ + + if (!ZF_ISSET(right, JNODE_HEARD_BANSHEE)) + leftmost_key_in_node(right, &ldkey); + else + UNDER_RW_VOID(dk, znode_get_tree(parent), read, + ldkey = *znode_get_rd_key(right)); + node_plugin_by_node(parent)->update_item_key(&right_pos, &ldkey, &info); + doing->restartable = 0; + znode_make_dirty(parent); + return 0; +} + +/* implements COP_UPDATE opration + + Update delimiting keys. + +*/ +static int +carry_update(carry_op * op /* operation to be performed */ , + carry_level * doing /* current carry level */ , + carry_level * todo /* next carry level */ ) +{ + int result; + carry_node *missing UNUSED_ARG; + znode *left; + znode *right; + carry_node *lchild; + carry_node *rchild; + const char *error_msg; + reiser4_tree *tree; + + /* + * This operation is called to update key of internal item. This is + * necessary when carry shifted of cut data on the child + * level. Arguments of this operation are: + * + * @right --- child node. Operation should update key of internal + * item pointing to @right. + * + * @left --- left neighbor of @right. This parameter is optional. + */ + + assert("nikita-902", op != NULL); + assert("nikita-903", todo != NULL); + assert("nikita-904", op->op == COP_UPDATE); + + lchild = op->u.update.left; + rchild = op->node; + + if (lchild != NULL) { + assert("nikita-1001", lchild->parent); + assert("nikita-1003", !lchild->left); + left = carry_real(lchild); + } else + left = NULL; + + tree = znode_get_tree(rchild->node); + RLOCK_TREE(tree); + right = znode_parent(rchild->node); + RUNLOCK_TREE(tree); + + if (right != NULL) { + result = update_delimiting_key(right, + lchild ? lchild->node : NULL, + rchild->node, + doing, todo, &error_msg); + } else { + error_msg = "Cannot find node to update key in"; + result = RETERR(-EIO); + } + /* operation will be reposted to the next level by the + ->update_item_key() method of node plugin, if necessary. */ + + if (result != 0) { + warning("nikita-999", "Error updating delimiting key: %s (%i)", error_msg ? : "", result); + print_znode("left", left); + print_znode("right", right); + print_znode("lchild", lchild ? lchild->node : NULL); + print_znode("rchild", rchild->node); + } + return result; +} + +/* move items from @node during carry */ +static int +carry_shift_data(sideof side /* in what direction to move data */ , + coord_t * insert_coord /* coord where new item + * is to be inserted */ , + znode * node /* node which data are moved from */ , + carry_level * doing /* active carry queue */ , + carry_level * todo /* carry queue where new + * operations are to be put + * in */ , + unsigned int including_insert_coord_p /* true if + * @insertion_coord + * can be moved */ ) +{ + int result; + znode *source; + carry_plugin_info info; + node_plugin *nplug; + + source = insert_coord->node; + + info.doing = doing; + info.todo = todo; + + nplug = node_plugin_by_node(node); + result = nplug->shift(insert_coord, node, + (side == LEFT_SIDE) ? SHIFT_LEFT : SHIFT_RIGHT, 0, + (int) including_insert_coord_p, &info); + /* the only error ->shift() method of node plugin can return is + -ENOMEM due to carry node/operation allocation. */ + assert("nikita-915", result >= 0 || result == -ENOMEM); + if (result > 0) { + /* + * if some number of bytes was actually shifted, mark nodes + * dirty, and carry level as non-restartable. + */ + doing->restartable = 0; + znode_make_dirty(source); + znode_make_dirty(node); + } + + assert("nikita-2077", coord_check(insert_coord)); + return 0; +} + +typedef carry_node *(*carry_iterator) (carry_node * node); +static carry_node *find_dir_carry(carry_node * node, carry_level * level, carry_iterator iterator); + +/* look for the left neighbor of given carry node in a carry queue. + + This is used by find_left_neighbor(), but I am not sure that this + really gives any advantage. More statistics required. + +*/ +reiser4_internal carry_node * +find_left_carry(carry_node * node /* node to fine left neighbor + * of */ , + carry_level * level /* level to scan */ ) +{ + return find_dir_carry(node, level, (carry_iterator) pool_level_list_prev); +} + +/* look for the right neighbor of given carry node in a + carry queue. + + This is used by find_right_neighbor(), but I am not sure that this + really gives any advantage. More statistics required. + +*/ +reiser4_internal carry_node * +find_right_carry(carry_node * node /* node to fine right neighbor + * of */ , + carry_level * level /* level to scan */ ) +{ + return find_dir_carry(node, level, (carry_iterator) pool_level_list_next); +} + +/* look for the left or right neighbor of given carry node in a carry + queue. + + Helper function used by find_{left|right}_carry(). +*/ +static carry_node * +find_dir_carry(carry_node * node /* node to start scanning + * from */ , + carry_level * level /* level to scan */ , + carry_iterator iterator /* operation to + * move to the next + * node */ ) +{ + carry_node *neighbor; + + assert("nikita-1059", node != NULL); + assert("nikita-1060", level != NULL); + + /* scan list of carry nodes on this list dir-ward, skipping all + carry nodes referencing the same znode. */ + neighbor = node; + while (1) { + neighbor = iterator(neighbor); + if (pool_level_list_end(&level->nodes, &neighbor->header)) + return NULL; + if (carry_real(neighbor) != carry_real(node)) + return neighbor; + } +} + +/* + * Memory reservation estimation. + * + * Carry process proceeds through tree levels upwards. Carry assumes that it + * takes tree in consistent state (e.g., that search tree invariants hold), + * and leaves tree consistent after it finishes. This means that when some + * error occurs carry cannot simply return if there are pending carry + * operations. Generic solution for this problem is carry-undo either as + * transaction manager feature (requiring checkpoints and isolation), or + * through some carry specific mechanism. + * + * Our current approach is to panic if carry hits an error while tree is + * inconsistent. Unfortunately -ENOMEM can easily be triggered. To work around + * this "memory reservation" mechanism was added. + * + * Memory reservation is implemented by perthread-pages.diff patch from + * core-patches. Its API is defined in + * + * int perthread_pages_reserve(int nrpages, int gfp); + * void perthread_pages_release(int nrpages); + * int perthread_pages_count(void); + * + * carry estimates its worst case memory requirements at the entry, reserved + * enough memory, and released unused pages before returning. + * + * Code below estimates worst case memory requirements for a given carry + * queue. This is dome by summing worst case memory requirements for each + * operation in the queue. + * + */ + +/* + * Memory memory requirements of many operations depends on the tree + * height. For example, item insertion requires new node to be inserted at + * each tree level in the worst case. What tree height should be used for + * estimation? Current tree height is wrong, because tree height can change + * between the time when estimation was done and the time when operation is + * actually performed. Maximal possible tree height (REISER4_MAX_ZTREE_HEIGHT) + * is also not desirable, because it would lead to the huge over-estimation + * all the time. Plausible solution is "capped tree height": if current tree + * height is less than some TREE_HEIGHT_CAP constant, capped tree height is + * TREE_HEIGHT_CAP, otherwise it's current tree height. Idea behind this is + * that if tree height is TREE_HEIGHT_CAP or larger, it's extremely unlikely + * to be increased even more during short interval of time. + */ +#define TREE_HEIGHT_CAP (5) + +/* return capped tree height for the @tree. See comment above. */ +static int +cap_tree_height(reiser4_tree * tree) +{ + return max_t(int, tree->height, TREE_HEIGHT_CAP); +} + +/* return capped tree height for the current tree. */ +static int capped_height(void) +{ + return cap_tree_height(current_tree); +} + +/* return number of pages required to store given number of bytes */ +static int bytes_to_pages(int bytes) +{ + return (bytes + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; +} + +/* how many pages are required to allocate znodes during item insertion. */ +static int +carry_estimate_znodes(void) +{ + /* + * Note, that there we have some problem here: there is no way to + * reserve pages specifically for the given slab. This means that + * these pages can be hijacked for some other end. + */ + + /* in the worst case we need 3 new znode on each tree level */ + return bytes_to_pages(capped_height() * sizeof(znode) * 3); +} + +/* + * how many pages are required to load bitmaps. One bitmap per level. + */ +static int +carry_estimate_bitmaps(void) +{ + if (reiser4_is_set(reiser4_get_current_sb(), REISER4_DONT_LOAD_BITMAP)) { + int bytes; + + bytes = capped_height() * + (0 + /* bnode should be added, but its is private to + * bitmap.c, skip for now. */ + 2 * sizeof(jnode)); /* working and commit jnodes */ + return bytes_to_pages(bytes) + 2; /* and their contents */ + } else + /* bitmaps were pre-loaded during mount */ + return 0; +} + +/* worst case item insertion memory requirements */ +static int +carry_estimate_insert(carry_op * op, carry_level * level) +{ + return + carry_estimate_bitmaps() + + carry_estimate_znodes() + + 1 + /* new atom */ + capped_height() + /* new block on each level */ + 1 + /* and possibly extra new block at the leaf level */ + 3; /* loading of leaves into memory */ +} + +/* worst case item deletion memory requirements */ +static int +carry_estimate_delete(carry_op * op, carry_level * level) +{ + return + carry_estimate_bitmaps() + + carry_estimate_znodes() + + 1 + /* new atom */ + 3; /* loading of leaves into memory */ +} + +/* worst case tree cut memory requirements */ +static int +carry_estimate_cut(carry_op * op, carry_level * level) +{ + return + carry_estimate_bitmaps() + + carry_estimate_znodes() + + 1 + /* new atom */ + 3; /* loading of leaves into memory */ +} + +/* worst case memory requirements of pasting into item */ +static int +carry_estimate_paste(carry_op * op, carry_level * level) +{ + return + carry_estimate_bitmaps() + + carry_estimate_znodes() + + 1 + /* new atom */ + capped_height() + /* new block on each level */ + 1 + /* and possibly extra new block at the leaf level */ + 3; /* loading of leaves into memory */ +} + +/* worst case memory requirements of extent insertion */ +static int +carry_estimate_extent(carry_op * op, carry_level * level) +{ + return + carry_estimate_insert(op, level) + /* insert extent */ + carry_estimate_delete(op, level); /* kill leaf */ +} + +/* worst case memory requirements of key update */ +static int +carry_estimate_update(carry_op * op, carry_level * level) +{ + return 0; +} + +/* worst case memory requirements of flow insertion */ +static int +carry_estimate_insert_flow(carry_op * op, carry_level * level) +{ + int newnodes; + + newnodes = min(bytes_to_pages(op->u.insert_flow.flow->length), + CARRY_FLOW_NEW_NODES_LIMIT); + /* + * roughly estimate insert_flow as a sequence of insertions. + */ + return newnodes * carry_estimate_insert(op, level); +} + +/* This is dispatch table for carry operations. It can be trivially + abstracted into useful plugin: tunable balancing policy is a good + thing. */ +reiser4_internal carry_op_handler op_dispatch_table[COP_LAST_OP] = { + [COP_INSERT] = { + .handler = carry_insert, + .estimate = carry_estimate_insert + }, + [COP_DELETE] = { + .handler = carry_delete, + .estimate = carry_estimate_delete + }, + [COP_CUT] = { + .handler = carry_cut, + .estimate = carry_estimate_cut + }, + [COP_PASTE] = { + .handler = carry_paste, + .estimate = carry_estimate_paste + }, + [COP_EXTENT] = { + .handler = carry_extent, + .estimate = carry_estimate_extent + }, + [COP_UPDATE] = { + .handler = carry_update, + .estimate = carry_estimate_update + }, + [COP_INSERT_FLOW] = { + .handler = carry_insert_flow, + .estimate = carry_estimate_insert_flow + } +}; + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/carry_ops.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/carry_ops.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,41 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* implementation of carry operations. See carry_ops.c for details. */ + +#if !defined( __CARRY_OPS_H__ ) +#define __CARRY_OPS_H__ + +#include "forward.h" +#include "znode.h" +#include "carry.h" + +/* carry operation handlers */ +typedef struct carry_op_handler { + /* perform operation */ + int (*handler) (carry_op * op, carry_level * doing, carry_level * todo); + /* estimate memory requirements for @op */ + int (*estimate) (carry_op * op, carry_level * level); +} carry_op_handler; + +/* This is dispatch table for carry operations. It can be trivially + abstracted into useful plugin: tunable balancing policy is a good + thing. */ +extern carry_op_handler op_dispatch_table[COP_LAST_OP]; + +unsigned int space_needed(const znode * node, const coord_t * coord, const reiser4_item_data * data, int inserting); +extern carry_node *find_left_carry(carry_node * node, carry_level * level); +extern carry_node *find_right_carry(carry_node * node, carry_level * level); + +/* __CARRY_OPS_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/cluster.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/cluster.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,71 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Contains cluster operations for cryptcompress object plugin (see + http://www.namesys.com/cryptcompress_design.txt for details). */ + +/* Concepts of clustering. Definition of cluster size. + Data clusters, page clusters, disk clusters. + + + In order to compress plain text we first should split it into chunks. + Then we process each chunk independently by the following function: + + void alg(char *input_ptr, int input_length, char *output_ptr, int *output_length); + + where: + input_ptr is a pointer to the first byte of input chunk (that contains plain text), + input_len is a length of input chunk, + output_ptr is a pointer to the first byte of output chunk (that contains processed text), + *output_len is a length of output chunk. + + the length of output chunk depends both on input_len and on the content of + input chunk. input_len (which can be assigned an arbitrary value) affects the + compression quality (the more input_len the better the compression quality). + For each cryptcompress file we assign special attribute - cluster size: + + Cluster size is a file attribute, which determines the maximal size + of input chunk that we use for compression. + + So if we wanna compress a 10K-file with a cluster size of 4K, we split this file + into three chunks (first and second - 4K, third - 2K). Those chunks are + clusters in the space of file offsets (data clusters). + + Cluster sizes are represented as (PAGE_CACHE_SIZE << shift), where + shift (= 0, 1, 2,... ). You'll note that this representation + affects the allowed values for cluster size. This is stored in + disk stat-data (CLUSTER_STAT, layout is in reiser4_cluster_stat (see + (plugin/item/static_stat.h) for details). + Note that working with + cluster_size > PAGE_SIZE (when cluster_shift > 0, and cluster contains more + then one page) is suboptimal because before compression we should assemble + all cluster pages into one flow (this means superfluous memcpy during + read/write). So the better way to increase cluster size (and therefore + compression quality) is making PAGE_SIZE larger (for instance by page + clustering stuff of William Lee). But if you need PAGE_SIZE < cluster_size, + then use the page clustering offered by reiser4. + + The inode mapping of a cryptcompress file contains pages filled by plain text. + Cluster size also defines clustering in address space. For example, + 101K-file with cluster size 16K (cluster shift = 2), which can be mapped + into 26 pages, has 7 "page clusters": first six clusters contains 4 pages + and one cluster contains 2 pages (for the file tail). + + We split each output (compressed) chunk into special items to provide + tight packing of data on disk (currently only ctails hold compressed data). + This set of items we call a "disk cluster". + + Each cluster is defined (like pages are) by its index (e.g. offset, + but the unit is cluster size instead of PAGE_SIZE). Key offset of + the first unit of the first item of each disk cluster (we call this a + "key of disk cluster") is a multiple of the cluster index. + + All read/write/truncate operations are performed upon clusters. + For example, if we wanna read 40K of a cryptcompress file with cluster size 16K + from offset = 20K, we first need to read two clusters (of indexes 1, 2). This + means that all main methods of cryptcompress object plugin call appropriate + cluster operation. + + For the same index we use one structure (type reiser4_cluster_t) to + represent all data/page/disk clusters. (EDWARD-FIXME-HANS: are you + sure that is good style? and where is the code that goes with this comment....;-) ) +*/ diff -puN /dev/null fs/reiser4/cluster.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/cluster.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,289 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* This file contains page/cluster index translators and offset modulators + See http://www.namesys.com/cryptcompress_design.html for details */ + +#if !defined( __FS_REISER4_CLUSTER_H__ ) +#define __FS_REISER4_CLUSTER_H__ + +static inline loff_t min_count(loff_t a, loff_t b) +{ + return (a < b ? a : b); +} + +static inline loff_t max_count(loff_t a, loff_t b) +{ + return (a > b ? a : b); +} + +static inline int inode_cluster_shift (struct inode * inode) +{ + assert("edward-92", inode != NULL); + assert("edward-93", reiser4_inode_data(inode) != NULL); + assert("edward-94", inode_get_flag(inode, REISER4_CLUSTER_KNOWN)); + + return reiser4_inode_data(inode)->cluster_shift; +} + +static inline unsigned +page_cluster_shift(struct inode * inode) +{ + return inode_cluster_shift(inode) + PAGE_CACHE_SHIFT; +} + +/* cluster size in page units */ +static inline unsigned cluster_nrpages (struct inode * inode) +{ + return (1U << inode_cluster_shift(inode)); +} + +static inline size_t inode_cluster_size (struct inode * inode) +{ + assert("edward-96", inode != NULL); + + return (PAGE_CACHE_SIZE << inode_cluster_shift(inode)); +} + +static inline unsigned long +pg_to_clust(unsigned long idx, struct inode * inode) +{ + return idx >> inode_cluster_shift(inode); +} + +static inline unsigned long +clust_to_pg(unsigned long idx, struct inode * inode) +{ + return idx << inode_cluster_shift(inode); +} + +static inline unsigned long +pg_to_clust_to_pg(unsigned long idx, struct inode * inode) +{ + return clust_to_pg(pg_to_clust(idx, inode), inode); +} + +static inline unsigned long +off_to_pg(loff_t off) +{ + return (off >> PAGE_CACHE_SHIFT); +} + +static inline loff_t +pg_to_off(unsigned long idx) +{ + return ((loff_t)(idx) << PAGE_CACHE_SHIFT); +} + +static inline unsigned long +off_to_clust(loff_t off, struct inode * inode) +{ + return pg_to_clust(off_to_pg(off), inode); +} + +static inline loff_t +clust_to_off(unsigned long idx, struct inode * inode) +{ + return pg_to_off(clust_to_pg(idx, inode)); +} + +static inline unsigned long +count_to_nr(loff_t count, unsigned shift) +{ + return (count + (1UL << shift) - 1) >> shift; +} + +/* number of pages occupied by @count bytes */ +static inline unsigned long +count_to_nrpages(loff_t count) +{ + return count_to_nr(count, PAGE_CACHE_SHIFT); +} + +/* number of clusters occupied by @count bytes */ +static inline unsigned long +count_to_nrclust(loff_t count, struct inode * inode) +{ + return count_to_nr(count, page_cluster_shift(inode)); +} + +/* number of clusters occupied by @count pages */ +static inline cloff_t +pgcount_to_nrclust(pgoff_t count, struct inode * inode) +{ + return count_to_nr(count, inode_cluster_shift(inode)); +} + +static inline loff_t +off_to_clust_to_off(loff_t off, struct inode * inode) +{ + return clust_to_off(off_to_clust(off, inode), inode); +} + +static inline unsigned long +off_to_clust_to_pg(loff_t off, struct inode * inode) +{ + return clust_to_pg(off_to_clust(off, inode), inode); +} + +static inline unsigned +off_to_pgoff(loff_t off) +{ + return off & (PAGE_CACHE_SIZE - 1); +} + +static inline unsigned +off_to_cloff(loff_t off, struct inode * inode) +{ + return off & ((loff_t)(inode_cluster_size(inode)) - 1); +} + +static inline unsigned +pg_to_off_to_cloff(unsigned long idx, struct inode * inode) +{ + return off_to_cloff(pg_to_off(idx), inode); +} + +/* if @size != 0, returns index of the page + which contains the last byte of the file */ +static inline pgoff_t +size_to_pg(loff_t size) +{ + return (size ? off_to_pg(size - 1) : 0); +} + +/* minimal index of the page which doesn't contain + file data */ +static inline pgoff_t +size_to_next_pg(loff_t size) +{ + return (size ? off_to_pg(size - 1) + 1 : 0); +} + +static inline unsigned +off_to_pgcount(loff_t off, unsigned long idx) +{ + if (idx > off_to_pg(off)) + return 0; + if (idx < off_to_pg(off)) + return PAGE_CACHE_SIZE; + return off_to_pgoff(off); +} + +static inline unsigned +off_to_count(loff_t off, unsigned long idx, struct inode * inode) +{ + if (idx > off_to_clust(off, inode)) + return 0; + if (idx < off_to_clust(off, inode)) + return inode_cluster_size(inode); + return off_to_cloff(off, inode); +} + +static inline unsigned +fsize_to_count(reiser4_cluster_t * clust, struct inode * inode) +{ + assert("edward-288", clust != NULL); + assert("edward-289", inode != NULL); + + return off_to_count(inode->i_size, clust->index, inode); +} + +static inline void +reiser4_slide_init (reiser4_slide_t * win){ + assert("edward-1084", win != NULL); + memset(win, 0, sizeof *win); +} + +static inline void +reiser4_cluster_init (reiser4_cluster_t * clust, reiser4_slide_t * window){ + assert("edward-84", clust != NULL); + memset(clust, 0, sizeof *clust); + clust->dstat = INVAL_DISK_CLUSTER; + clust->win = window; +} + +static inline int +dclust_get_extension(hint_t * hint) +{ + return hint->ext_coord.extension.ctail.shift; +} + +static inline void +dclust_set_extension(hint_t * hint) +{ + assert("edward-1270", item_id_by_coord(&hint->ext_coord.coord) == CTAIL_ID); + hint->ext_coord.extension.ctail.shift = cluster_shift_by_coord(&hint->ext_coord.coord); +} + +static inline int +hint_is_unprepped_dclust(hint_t * hint) +{ + return dclust_get_extension(hint) == (int)UCTAIL_SHIFT; +} + +static inline void +coord_set_between_clusters(coord_t * coord) +{ +#if REISER4_DEBUG + int result; + result = zload(coord->node); + assert("edward-1296", !result); +#endif + if (!coord_is_between_items(coord)) { + coord->between = AFTER_ITEM; + coord->unit_pos = 0; + } +#if REISER4_DEBUG + zrelse(coord->node); +#endif +} + +int inflate_cluster(reiser4_cluster_t *, struct inode *); +int find_cluster(reiser4_cluster_t *, struct inode *, int read, int write); +void forget_cluster_pages(struct page ** page, int nrpages); +int flush_cluster_pages(reiser4_cluster_t *, jnode *, struct inode *); +int deflate_cluster(reiser4_cluster_t *, struct inode *); +void truncate_page_cluster(struct inode * inode, cloff_t start); +void set_hint_cluster(struct inode * inode, hint_t * hint, unsigned long index, znode_lock_mode mode); +void invalidate_hint_cluster(reiser4_cluster_t * clust); +int get_disk_cluster_locked(reiser4_cluster_t * clust, struct inode * inode, znode_lock_mode lock_mode); +void reset_cluster_params(reiser4_cluster_t * clust); +int prepare_page_cluster(struct inode *inode, reiser4_cluster_t *clust, int capture); +void release_cluster_pages_nocapture(reiser4_cluster_t *); +void put_cluster_handle(reiser4_cluster_t * clust, tfm_action act); +int grab_tfm_stream(struct inode * inode, tfm_cluster_t * tc, tfm_action act, tfm_stream_id id); +int tfm_cluster_is_uptodate (tfm_cluster_t * tc); +void tfm_cluster_set_uptodate (tfm_cluster_t * tc); +void tfm_cluster_clr_uptodate (tfm_cluster_t * tc); +unsigned long clust_by_coord(const coord_t * coord, struct inode * inode); + +static inline int +alloc_clust_pages(reiser4_cluster_t * clust, struct inode * inode ) +{ + assert("edward-791", clust != NULL); + assert("edward-792", inode != NULL); + clust->pages = reiser4_kmalloc(sizeof(*clust->pages) << inode_cluster_shift(inode), GFP_KERNEL); + if (!clust->pages) + return -ENOMEM; + return 0; +} + +static inline void +free_clust_pages(reiser4_cluster_t * clust) +{ + reiser4_kfree(clust->pages); +} + +#endif /* __FS_REISER4_CLUSTER_H__ */ + + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/context.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/context.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,303 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Manipulation of reiser4_context */ + +/* + * global context used during system call. Variable of this type is allocated + * on the stack at the beginning of the reiser4 part of the system call and + * pointer to it is stored in the current->fs_context. This allows us to avoid + * passing pointer to current transaction and current lockstack (both in + * one-to-one mapping with threads) all over the call chain. + * + * It's kind of like those global variables the prof used to tell you not to + * use in CS1, except thread specific.;-) Nikita, this was a good idea. + * + * In some situations it is desirable to have ability to enter reiser4_context + * more than once for the same thread (nested contexts). For example, there + * are some functions that can be called either directly from VFS/VM or from + * already active reiser4 context (->writepage, for example). + * + * In such situations "child" context acts like dummy: all activity is + * actually performed in the top level context, and get_current_context() + * always returns top level context. Of course, init_context()/done_context() + * have to be properly nested any way. + * + * Note that there is an important difference between reiser4 uses + * ->fs_context and the way other file systems use it. Other file systems + * (ext3 and reiserfs) use ->fs_context only for the duration of _transaction_ + * (this is why ->fs_context was initially called ->journal_info). This means, + * that when ext3 or reiserfs finds that ->fs_context is not NULL on the entry + * to the file system, they assume that some transaction is already underway, + * and usually bail out, because starting nested transaction would most likely + * lead to the deadlock. This gives false positives with reiser4, because we + * set ->fs_context before starting transaction. + */ + +#include "debug.h" +#include "super.h" +#include "context.h" + +#include /* balance_dirty_pages() */ +#include + +#if REISER4_DEBUG + +/* List of all currently active contexts, used for debugging purposes. */ +static context_list_head active_contexts; +/* lock protecting access to active_contexts. */ +spinlock_t active_contexts_lock; + +#endif /* REISER4_DEBUG */ + +/* initialise context and bind it to the current thread + + This function should be called at the beginning of reiser4 part of + syscall. +*/ +reiser4_internal int +init_context(reiser4_context * context /* pointer to the reiser4 context + * being initalised */ , + struct super_block *super /* super block we are going to + * work with */) +{ + assert("nikita-2662", !in_interrupt() && !in_irq()); + assert("nikita-3356", context != NULL); + assert("nikita-3357", super != NULL); + assert("nikita-3358", super->s_op == NULL || is_reiser4_super(super)); + + memset(context, 0, sizeof *context); + + if (is_in_reiser4_context()) { + reiser4_context *parent; + + parent = (reiser4_context *) current->journal_info; + /* NOTE-NIKITA this is dubious */ + if (parent->super == super) { + context->parent = parent; +#if (REISER4_DEBUG) + ++context->parent->nr_children; +#endif + return 0; + } + } + + context->super = super; + context->magic = context_magic; + context->outer = current->journal_info; + current->journal_info = (void *) context; + + init_lock_stack(&context->stack); + + txn_begin(context); + + context->parent = context; + tap_list_init(&context->taps); +#if REISER4_DEBUG + context_list_clean(context); /* to satisfy assertion */ + spin_lock(&active_contexts_lock); + context_list_check(&active_contexts); + context_list_push_front(&active_contexts, context); + spin_unlock(&active_contexts_lock); + context->task = current; +#endif + grab_space_enable(); + return 0; +} + +/* cast lock stack embedded into reiser4 context up to its container */ +reiser4_internal reiser4_context * +get_context_by_lock_stack(lock_stack * owner) +{ + return container_of(owner, reiser4_context, stack); +} + +/* true if there is already _any_ reiser4 context for the current thread */ +reiser4_internal int +is_in_reiser4_context(void) +{ + reiser4_context *ctx; + + ctx = current->journal_info; + return + ctx != NULL && + ((unsigned long) ctx->magic) == context_magic; +} + +/* + * call balance dirty pages for the current context. + * + * File system is expected to call balance_dirty_pages_ratelimited() whenever + * it dirties a page. reiser4 does this for unformatted nodes (that is, during + * write---this covers vast majority of all dirty traffic), but we cannot do + * this immediately when formatted node is dirtied, because long term lock is + * usually held at that time. To work around this, dirtying of formatted node + * simply increases ->nr_marked_dirty counter in the current reiser4 + * context. When we are about to leave this context, + * balance_dirty_pages_ratelimited() is called, if necessary. + * + * This introduces another problem: sometimes we do not want to run + * balance_dirty_pages_ratelimited() when leaving a context, for example + * because some important lock (like ->i_sem on the parent directory) is + * held. To achieve this, ->nobalance flag can be set in the current context. + */ +static void +balance_dirty_pages_at(reiser4_context * context) +{ + reiser4_super_info_data * sbinfo = get_super_private(context->super); + + /* + * call balance_dirty_pages_ratelimited() to process formatted nodes + * dirtied during this system call. + */ + if (context->nr_marked_dirty != 0 && /* were any nodes dirtied? */ + /* aren't we called early during mount? */ + sbinfo->fake && + /* don't call balance dirty pages from ->writepage(): it's + * deadlock prone */ + !(current->flags & PF_MEMALLOC) && + /* and don't stall pdflush */ + !current_is_pdflush()) + balance_dirty_pages_ratelimited(sbinfo->fake->i_mapping); +} + +/* + * exit reiser4 context. Call balance_dirty_pages_at() if necessary. Close + * transaction. Call done_context() to do context related book-keeping. + */ +reiser4_internal void reiser4_exit_context(reiser4_context * context) +{ + assert("nikita-3021", schedulable()); + + if (context == context->parent) { + if (!context->nobalance) { + txn_restart(context); + balance_dirty_pages_at(context); + } + txn_end(context); + } + done_context(context); +} + +/* release resources associated with context. + + This function should be called at the end of "session" with reiser4, + typically just before leaving reiser4 driver back to VFS. + + This is good place to put some degugging consistency checks, like that + thread released all locks and closed transcrash etc. + +*/ +reiser4_internal void +done_context(reiser4_context * context /* context being released */) +{ + reiser4_context *parent; + assert("nikita-860", context != NULL); + + parent = context->parent; + assert("nikita-2174", parent != NULL); + assert("nikita-2093", parent == parent->parent); + assert("nikita-859", parent->magic == context_magic); + assert("vs-646", (reiser4_context *) current->journal_info == parent); + assert("zam-686", !in_interrupt() && !in_irq()); + + /* only do anything when leaving top-level reiser4 context. All nested + * contexts are just dummies. */ + if (parent == context) { + assert("jmacd-673", parent->trans == NULL); + assert("jmacd-1002", lock_stack_isclean(&parent->stack)); + assert("nikita-1936", no_counters_are_held()); + assert("nikita-3403", !delayed_inode_updates(context->dirty)); + assert("nikita-2626", tap_list_empty(taps_list())); + assert("zam-1004", get_super_private(context->super)->delete_sema_owner != current); + + /* release all grabbed but as yet unused blocks */ + if (context->grabbed_blocks != 0) + all_grabbed2free(); + + /* + * synchronize against longterm_unlock_znode(): + * wake_up_requestor() wakes up requestors without holding + * zlock (otherwise they will immediately bump into that lock + * after wake up on another CPU). To work around (rare) + * situation where requestor has been woken up asynchronously + * and managed to run until completion (and destroy its + * context and lock stack) before wake_up_requestor() called + * wake_up() on it, wake_up_requestor() synchronize on lock + * stack spin lock. It has actually been observed that spin + * lock _was_ locked at this point, because + * wake_up_requestor() took interrupt. + */ + spin_lock_stack(&context->stack); + spin_unlock_stack(&context->stack); + +#if REISER4_DEBUG + /* remove from active contexts */ + spin_lock(&active_contexts_lock); + context_list_remove(parent); + spin_unlock(&active_contexts_lock); +#endif + assert("zam-684", context->nr_children == 0); + /* restore original ->fs_context value */ + current->journal_info = context->outer; + } else { +#if REISER4_DEBUG + parent->nr_children--; + assert("zam-685", parent->nr_children >= 0); +#endif + } +} + +/* Initialize list of all contexts */ +reiser4_internal int +init_context_mgr(void) +{ +#if REISER4_DEBUG + spin_lock_init(&active_contexts_lock); + context_list_init(&active_contexts); +#endif + return 0; +} + +#if REISER4_DEBUG +/* debugging function: output reiser4 context contexts in the human readable + * form */ +static void +print_context(const char *prefix, reiser4_context * context) +{ + if (context == NULL) { + printk("%s: null context\n", prefix); + return; + } + print_lock_counters("\tlocks", &context->locks); + printk("pid: %i, comm: %s\n", context->task->pid, context->task->comm); + print_lock_stack("\tlock stack", &context->stack); + info_atom("\tatom", context->trans_in_ctx.atom); +} + +/* debugging: dump contents of all active contexts */ +void +print_contexts(void) +{ + reiser4_context *context; + + spin_lock(&active_contexts_lock); + + for_all_type_safe_list(context, &active_contexts, context) { + print_context("context", context); + } + + spin_unlock(&active_contexts_lock); +} + +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/context.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/context.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,282 @@ +/* Copyright 2001, 2002, 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Reiser4 context. See context.c for details. */ + +#if !defined( __REISER4_CONTEXT_H__ ) +#define __REISER4_CONTEXT_H__ + +#include "forward.h" +#include "debug.h" +#include "spin_macros.h" +#include "dformat.h" +#include "type_safe_list.h" +#include "tap.h" +#include "lock.h" + +#include /* for __u?? */ +#include /* for struct super_block */ +#include +#include /* for struct task_struct */ + +/* list of active lock stacks */ +#if REISER4_DEBUG +TYPE_SAFE_LIST_DECLARE(context); +#endif + +ON_DEBUG(TYPE_SAFE_LIST_DECLARE(flushers);) + +#if REISER4_DEBUG + +/* + * Stat-data update tracking. + * + * Some reiser4 functions (reiser4_{del,add}_nlink() take an additional + * parameter indicating whether stat-data update should be performed. This is + * because sometimes fields of the same inode are modified several times + * during single system and updating stat-data (which implies tree lookup and, + * sometimes, tree balancing) on each inode modification is too expensive. To + * avoid unnecessary stat-data updates, we pass flag to not update it during + * inode field updates, and update it manually at the end of the system call. + * + * This introduces a possibility of "missed stat data update" when final + * stat-data update is not performed in some code path. To detect and track + * down such situations following code was developed. + * + * dirty_inode_info is an array of slots. Each slot keeps information about + * "delayed stat data update", that is about a call to a function modifying + * inode field that was instructed to not update stat data. Direct call to + * reiser4_update_sd() clears corresponding slot. On leaving reiser4 context + * all slots are scanned and information about still not forced updates is + * printed. + */ + +/* how many delayed stat data update slots to remember */ +#define TRACKED_DELAYED_UPDATE (0) + +typedef struct { + ino_t ino; /* inode number of object with delayed stat data + * update */ + int delayed; /* 1 if update is delayed, 0 if update for forced */ + void *stack[4]; /* stack back-trace of the call chain where update was + * delayed */ +} dirty_inode_info[TRACKED_DELAYED_UPDATE]; + +extern void mark_inode_update(struct inode *object, int immediate); +extern int delayed_inode_updates(dirty_inode_info info); + +#else + +typedef struct {} dirty_inode_info; + +#define mark_inode_update(object, immediate) noop +#define delayed_inode_updates(info) noop + +#endif + +/* reiser4 per-thread context */ +struct reiser4_context { + /* magic constant. For identification of reiser4 contexts. */ + __u32 magic; + + /* current lock stack. See lock.[ch]. This is where list of all + locks taken by current thread is kept. This is also used in + deadlock detection. */ + lock_stack stack; + + /* current transcrash. */ + txn_handle *trans; + /* transaction handle embedded into reiser4_context. ->trans points + * here by default. */ + txn_handle trans_in_ctx; + + /* super block we are working with. To get the current tree + use &get_super_private (reiser4_get_current_sb ())->tree. */ + struct super_block *super; + + /* parent fs activation */ + struct fs_activation *outer; + + /* per-thread grabbed (for further allocation) blocks counter */ + reiser4_block_nr grabbed_blocks; + + /* parent context */ + reiser4_context *parent; + + /* list of taps currently monitored. See tap.c */ + tap_list_head taps; + + /* grabbing space is enabled */ + int grab_enabled :1; + /* should be set when we are write dirty nodes to disk in jnode_flush or + * reiser4_write_logs() */ + int writeout_mode :1; + /* true, if current thread is an ent thread */ + int entd :1; + /* true, if balance_dirty_pages() should not be run when leaving this + * context. This is used to avoid lengthly balance_dirty_pages() + * operation when holding some important resource, like directory + * ->i_sem */ + int nobalance :1; + + /* count non-trivial jnode_set_dirty() calls */ + unsigned long nr_marked_dirty; +#if REISER4_DEBUG + /* A link of all active contexts. */ + context_list_link contexts_link; + /* debugging information about reiser4 locks held by the current + * thread */ + lock_counters_info locks; + int nr_children; /* number of child contexts */ + struct task_struct *task; /* so we can easily find owner of the stack */ + + /* + * disk space grabbing debugging support + */ + /* how many disk blocks were grabbed by the first call to + * reiser4_grab_space() in this context */ + reiser4_block_nr grabbed_initially; + + /* list of all threads doing flush currently */ + flushers_list_link flushers_link; + /* information about last error encountered by reiser4 */ + err_site err; + /* information about delayed stat data updates. See above. */ + dirty_inode_info dirty; + +#ifdef CONFIG_FRAME_POINTER + void *grabbed_at[4]; +#endif +#endif +}; + +#if REISER4_DEBUG +TYPE_SAFE_LIST_DEFINE(context, reiser4_context, contexts_link); +TYPE_SAFE_LIST_DEFINE(flushers, reiser4_context, flushers_link); +#endif + +extern reiser4_context *get_context_by_lock_stack(lock_stack *); + +/* Debugging helps. */ +extern int init_context_mgr(void); +#if REISER4_DEBUG +extern void print_contexts(void); +#endif + +#define current_tree (&(get_super_private(reiser4_get_current_sb())->tree)) +#define current_blocksize reiser4_get_current_sb()->s_blocksize +#define current_blocksize_bits reiser4_get_current_sb()->s_blocksize_bits + +extern int init_context(reiser4_context * context, struct super_block *super); +extern void done_context(reiser4_context * context); + +/* magic constant we store in reiser4_context allocated at the stack. Used to + catch accesses to staled or uninitialized contexts. */ +#define context_magic ((__u32) 0x4b1b5d0b) + +extern int is_in_reiser4_context(void); + +/* + * return reiser4_context for the thread @tsk + */ +static inline reiser4_context * +get_context(const struct task_struct *tsk) +{ + assert("vs-1682", ((reiser4_context *) tsk->journal_info)->magic == context_magic); + return (reiser4_context *) tsk->journal_info; +} + +/* + * return reiser4 context of the current thread, or NULL if there is none. + */ +static inline reiser4_context * +get_current_context_check(void) +{ + if (is_in_reiser4_context()) + return get_context(current); + else + return NULL; +} + +static inline reiser4_context * get_current_context(void);/* __attribute__((const));*/ + +/* return context associated with current thread */ +static inline reiser4_context * +get_current_context(void) +{ + return get_context(current); +} + +/* + * true if current thread is in the write-out mode. Thread enters write-out + * mode during jnode_flush and reiser4_write_logs(). + */ +static inline int is_writeout_mode(void) +{ + return get_current_context()->writeout_mode; +} + +/* + * enter write-out mode + */ +static inline void writeout_mode_enable(void) +{ + assert("zam-941", !get_current_context()->writeout_mode); + get_current_context()->writeout_mode = 1; +} + +/* + * leave write-out mode + */ +static inline void writeout_mode_disable(void) +{ + assert("zam-942", get_current_context()->writeout_mode); + get_current_context()->writeout_mode = 0; +} + +static inline void grab_space_enable(void) +{ + get_current_context()->grab_enabled = 1; +} + +static inline void grab_space_disable(void) +{ + get_current_context()->grab_enabled = 0; +} + +static inline void grab_space_set_enabled (int enabled) +{ + get_current_context()->grab_enabled = enabled; +} + +static inline int is_grab_enabled(reiser4_context *ctx) +{ + return ctx->grab_enabled; +} + +/* mark transaction handle in @ctx as TXNH_DONT_COMMIT, so that no commit or + * flush would be performed when it is closed. This is necessary when handle + * has to be closed under some coarse semaphore, like i_sem of + * directory. Commit will be performed by ktxnmgrd. */ +static inline void context_set_commit_async(reiser4_context * context) +{ + context = context->parent; + context->nobalance = 1; + context->trans->flags |= TXNH_DONT_COMMIT; +} + +extern void reiser4_exit_context(reiser4_context * context); + +/* __REISER4_CONTEXT_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/coord.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/coord.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,959 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "forward.h" +#include "debug.h" +#include "dformat.h" +#include "tree.h" +#include "plugin/item/item.h" +#include "znode.h" +#include "coord.h" + +/* Internal constructor. */ +static inline void +coord_init_values(coord_t *coord, const znode *node, pos_in_node_t item_pos, + pos_in_node_t unit_pos, between_enum between) +{ + coord->node = (znode *) node; + coord_set_item_pos(coord, item_pos); + coord->unit_pos = unit_pos; + coord->between = between; + ON_DEBUG(coord->plug_v = 0); + ON_DEBUG(coord->body_v = 0); + + /*ON_TRACE (TRACE_COORDS, "init coord %p node %p: %u %u %s\n", coord, node, item_pos, unit_pos, coord_tween_tostring (between)); */ +} + +/* after shifting of node content, coord previously set properly may become + invalid, try to "normalize" it. */ +reiser4_internal void +coord_normalize(coord_t *coord) +{ + znode *node; + + node = coord->node; + assert("vs-683", node); + + coord_clear_iplug(coord); + + if (node_is_empty(node)) { + coord_init_first_unit(coord, node); + } else if ((coord->between == AFTER_ITEM) || (coord->between == AFTER_UNIT)) { + return; + } else if (coord->item_pos == coord_num_items(coord) && coord->between == BEFORE_ITEM) { + coord_dec_item_pos(coord); + coord->between = AFTER_ITEM; + } else if (coord->unit_pos == coord_num_units(coord) && coord->between == BEFORE_UNIT) { + coord->unit_pos--; + coord->between = AFTER_UNIT; + } else if (coord->item_pos == coord_num_items(coord) && coord->unit_pos == 0 && coord->between == BEFORE_UNIT) { + coord_dec_item_pos(coord); + coord->unit_pos = 0; + coord->between = AFTER_ITEM; + } +} + +/* Copy a coordinate. */ +reiser4_internal void +coord_dup(coord_t * coord, const coord_t * old_coord) +{ + assert("jmacd-9800", coord_check(old_coord)); + coord_dup_nocheck(coord, old_coord); +} + +/* Copy a coordinate without check. Useful when old_coord->node is not + loaded. As in cbk_tree_lookup -> connect_znode -> connect_one_side */ +reiser4_internal void +coord_dup_nocheck(coord_t * coord, const coord_t * old_coord) +{ + coord->node = old_coord->node; + coord_set_item_pos(coord, old_coord->item_pos); + coord->unit_pos = old_coord->unit_pos; + coord->between = old_coord->between; + coord->iplugid = old_coord->iplugid; + ON_DEBUG(coord->plug_v = old_coord->plug_v); + ON_DEBUG(coord->body_v = old_coord->body_v); +} + +/* Initialize an invalid coordinate. */ +reiser4_internal void +coord_init_invalid(coord_t * coord, const znode * node) +{ + coord_init_values(coord, node, 0, 0, INVALID_COORD); +} + +reiser4_internal void +coord_init_first_unit_nocheck(coord_t * coord, const znode * node) +{ + coord_init_values(coord, node, 0, 0, AT_UNIT); +} + +/* Initialize a coordinate to point at the first unit of the first item. If the node is + empty, it is positioned at the EMPTY_NODE. */ +reiser4_internal void +coord_init_first_unit(coord_t * coord, const znode * node) +{ + int is_empty = node_is_empty(node); + + coord_init_values(coord, node, 0, 0, (is_empty ? EMPTY_NODE : AT_UNIT)); + + assert("jmacd-9801", coord_check(coord)); +} + +/* Initialize a coordinate to point at the last unit of the last item. If the node is + empty, it is positioned at the EMPTY_NODE. */ +reiser4_internal void +coord_init_last_unit(coord_t * coord, const znode * node) +{ + int is_empty = node_is_empty(node); + + coord_init_values(coord, node, (is_empty ? 0 : node_num_items(node) - 1), 0, (is_empty ? EMPTY_NODE : AT_UNIT)); + if (!is_empty) + coord->unit_pos = coord_last_unit_pos(coord); + assert("jmacd-9802", coord_check(coord)); +} + +/* Initialize a coordinate to before the first item. If the node is empty, it is + positioned at the EMPTY_NODE. */ +reiser4_internal void +coord_init_before_first_item(coord_t * coord, const znode * node) +{ + int is_empty = node_is_empty(node); + + coord_init_values(coord, node, 0, 0, (is_empty ? EMPTY_NODE : BEFORE_UNIT)); + + assert("jmacd-9803", coord_check(coord)); +} + +/* Initialize a coordinate to after the last item. If the node is empty, it is positioned + at the EMPTY_NODE. */ +reiser4_internal void +coord_init_after_last_item(coord_t * coord, const znode * node) +{ + int is_empty = node_is_empty(node); + + coord_init_values(coord, node, + (is_empty ? 0 : node_num_items(node) - 1), 0, (is_empty ? EMPTY_NODE : AFTER_ITEM)); + + assert("jmacd-9804", coord_check(coord)); +} + +/* Initialize a coordinate to after last unit in the item. Coord must be set + already to existing item */ +reiser4_internal void +coord_init_after_item_end(coord_t * coord) +{ + coord->between = AFTER_UNIT; + coord->unit_pos = coord_last_unit_pos(coord); +} + +/* Initialize a coordinate to before the item. Coord must be set already to existing item */ +reiser4_internal void +coord_init_before_item(coord_t * coord) +{ + coord->unit_pos = 0; + coord->between = BEFORE_ITEM; +} + +/* Initialize a coordinate to after the item. Coord must be set already to existing item */ +reiser4_internal void +coord_init_after_item(coord_t * coord) +{ + coord->unit_pos = 0; + coord->between = AFTER_ITEM; +} + +/* Initialize a coordinate by 0s. Used in places where init_coord was used and + it was not clear how actually */ +reiser4_internal void +coord_init_zero(coord_t * coord) +{ + memset(coord, 0, sizeof (*coord)); +} + +/* Return the number of units at the present item. Asserts coord_is_existing_item(). */ +reiser4_internal unsigned +coord_num_units(const coord_t * coord) +{ + assert("jmacd-9806", coord_is_existing_item(coord)); + + return item_plugin_by_coord(coord)->b.nr_units(coord); +} + +/* Returns true if the coord was initializewd by coord_init_invalid (). */ +/* Audited by: green(2002.06.15) */ +reiser4_internal int +coord_is_invalid(const coord_t * coord) +{ + return coord->between == INVALID_COORD; +} + +/* Returns true if the coordinate is positioned at an existing item, not before or after + an item. It may be placed at, before, or after any unit within the item, whether + existing or not. */ +reiser4_internal int +coord_is_existing_item(const coord_t * coord) +{ + switch (coord->between) { + case EMPTY_NODE: + case BEFORE_ITEM: + case AFTER_ITEM: + case INVALID_COORD: + return 0; + + case BEFORE_UNIT: + case AT_UNIT: + case AFTER_UNIT: + return coord->item_pos < coord_num_items(coord); + } + + impossible("jmacd-9900", "unreachable coord: %p", coord); + return 0; +} + +/* Returns true if the coordinate is positioned at an existing unit, not before or after a + unit. */ +/* Audited by: green(2002.06.15) */ +reiser4_internal int +coord_is_existing_unit(const coord_t * coord) +{ + switch (coord->between) { + case EMPTY_NODE: + case BEFORE_UNIT: + case AFTER_UNIT: + case BEFORE_ITEM: + case AFTER_ITEM: + case INVALID_COORD: + return 0; + + case AT_UNIT: + return (coord->item_pos < coord_num_items(coord) && coord->unit_pos < coord_num_units(coord)); + } + + impossible("jmacd-9902", "unreachable"); + return 0; +} + +/* Returns true if the coordinate is positioned at the first unit of the first item. Not + true for empty nodes nor coordinates positioned before the first item. */ +/* Audited by: green(2002.06.15) */ +reiser4_internal int +coord_is_leftmost_unit(const coord_t * coord) +{ + return (coord->between == AT_UNIT && coord->item_pos == 0 && coord->unit_pos == 0); +} + +#if REISER4_DEBUG +/* For assertions only, checks for a valid coordinate. */ +int +coord_check(const coord_t * coord) +{ + if (coord->node == NULL) { + return 0; + } + if (znode_above_root(coord->node)) + return 1; + + switch (coord->between) { + default: + case INVALID_COORD: + return 0; + case EMPTY_NODE: + if (!node_is_empty(coord->node)) { + return 0; + } + return coord->item_pos == 0 && coord->unit_pos == 0; + + case BEFORE_UNIT: + case AFTER_UNIT: + if (node_is_empty(coord->node) && (coord->item_pos == 0) && (coord->unit_pos == 0)) + return 1; + case AT_UNIT: + break; + case AFTER_ITEM: + case BEFORE_ITEM: + /* before/after item should not set unit_pos. */ + if (coord->unit_pos != 0) { + return 0; + } + break; + } + + if (coord->item_pos >= node_num_items(coord->node)) { + return 0; + } + + /* FIXME-VS: we are going to check unit_pos. This makes no sense when + between is set either AFTER_ITEM or BEFORE_ITEM */ + if (coord->between == AFTER_ITEM || coord->between == BEFORE_ITEM) + return 1; + + if (coord_is_iplug_set(coord) && + coord->unit_pos > item_plugin_by_coord(coord)->b.nr_units(coord) - 1) { + return 0; + } + return 1; +} +#endif + +/* Adjust coordinate boundaries based on the number of items prior to coord_next/prev. + Returns 1 if the new position is does not exist. */ +static int +coord_adjust_items(coord_t * coord, unsigned items, int is_next) +{ + /* If the node is invalid, leave it. */ + if (coord->between == INVALID_COORD) { + return 1; + } + + /* If the node is empty, set it appropriately. */ + if (items == 0) { + coord->between = EMPTY_NODE; + coord_set_item_pos(coord, 0); + coord->unit_pos = 0; + return 1; + } + + /* If it was empty and it no longer is, set to BEFORE/AFTER_ITEM. */ + if (coord->between == EMPTY_NODE) { + coord->between = (is_next ? BEFORE_ITEM : AFTER_ITEM); + coord_set_item_pos(coord, 0); + coord->unit_pos = 0; + return 0; + } + + /* If the item_pos is out-of-range, set it appropriatly. */ + if (coord->item_pos >= items) { + coord->between = AFTER_ITEM; + coord_set_item_pos(coord, items - 1); + coord->unit_pos = 0; + /* If is_next, return 1 (can't go any further). */ + return is_next; + } + + return 0; +} + +/* Advances the coordinate by one unit to the right. If empty, no change. If + coord_is_rightmost_unit, advances to AFTER THE LAST ITEM. Returns 0 if new position is an + existing unit. */ +reiser4_internal int +coord_next_unit(coord_t * coord) +{ + unsigned items = coord_num_items(coord); + + if (coord_adjust_items(coord, items, 1) == 1) { + return 1; + } + + switch (coord->between) { + case BEFORE_UNIT: + /* Now it is positioned at the same unit. */ + coord->between = AT_UNIT; + return 0; + + case AFTER_UNIT: + case AT_UNIT: + /* If it was at or after a unit and there are more units in this item, + advance to the next one. */ + if (coord->unit_pos < coord_last_unit_pos(coord)) { + coord->unit_pos += 1; + coord->between = AT_UNIT; + return 0; + } + + /* Otherwise, it is crossing an item boundary and treated as if it was + after the current item. */ + coord->between = AFTER_ITEM; + coord->unit_pos = 0; + /* FALLTHROUGH */ + + case AFTER_ITEM: + /* Check for end-of-node. */ + if (coord->item_pos == items - 1) { + return 1; + } + + coord_inc_item_pos(coord); + coord->unit_pos = 0; + coord->between = AT_UNIT; + return 0; + + case BEFORE_ITEM: + /* The adjust_items checks ensure that we are valid here. */ + coord->unit_pos = 0; + coord->between = AT_UNIT; + return 0; + + case INVALID_COORD: + case EMPTY_NODE: + /* Handled in coord_adjust_items(). */ + break; + } + + impossible("jmacd-9902", "unreachable"); + return 0; +} + +/* Advances the coordinate by one item to the right. If empty, no change. If + coord_is_rightmost_unit, advances to AFTER THE LAST ITEM. Returns 0 if new position is + an existing item. */ +reiser4_internal int +coord_next_item(coord_t * coord) +{ + unsigned items = coord_num_items(coord); + + if (coord_adjust_items(coord, items, 1) == 1) { + return 1; + } + + switch (coord->between) { + case AFTER_UNIT: + case AT_UNIT: + case BEFORE_UNIT: + case AFTER_ITEM: + /* Check for end-of-node. */ + if (coord->item_pos == items - 1) { + coord->between = AFTER_ITEM; + coord->unit_pos = 0; + coord_clear_iplug(coord); + return 1; + } + + /* Anywhere in an item, go to the next one. */ + coord->between = AT_UNIT; + coord_inc_item_pos(coord); + coord->unit_pos = 0; + return 0; + + case BEFORE_ITEM: + /* The out-of-range check ensures that we are valid here. */ + coord->unit_pos = 0; + coord->between = AT_UNIT; + return 0; + case INVALID_COORD: + case EMPTY_NODE: + /* Handled in coord_adjust_items(). */ + break; + } + + impossible("jmacd-9903", "unreachable"); + return 0; +} + +/* Advances the coordinate by one unit to the left. If empty, no change. If + coord_is_leftmost_unit, advances to BEFORE THE FIRST ITEM. Returns 0 if new position + is an existing unit. */ +reiser4_internal int +coord_prev_unit(coord_t * coord) +{ + unsigned items = coord_num_items(coord); + + if (coord_adjust_items(coord, items, 0) == 1) { + return 1; + } + + switch (coord->between) { + case AT_UNIT: + case BEFORE_UNIT: + if (coord->unit_pos > 0) { + coord->unit_pos -= 1; + coord->between = AT_UNIT; + return 0; + } + + if (coord->item_pos == 0) { + coord->between = BEFORE_ITEM; + return 1; + } + + coord_dec_item_pos(coord); + coord->unit_pos = coord_last_unit_pos(coord); + coord->between = AT_UNIT; + return 0; + + case AFTER_UNIT: + /* What if unit_pos is out-of-range? */ + assert("jmacd-5442", coord->unit_pos <= coord_last_unit_pos(coord)); + coord->between = AT_UNIT; + return 0; + + case BEFORE_ITEM: + if (coord->item_pos == 0) { + return 1; + } + + coord_dec_item_pos(coord); + /* FALLTHROUGH */ + + case AFTER_ITEM: + coord->between = AT_UNIT; + coord->unit_pos = coord_last_unit_pos(coord); + return 0; + + case INVALID_COORD: + case EMPTY_NODE: + break; + } + + impossible("jmacd-9904", "unreachable"); + return 0; +} + +/* Advances the coordinate by one item to the left. If empty, no change. If + coord_is_leftmost_unit, advances to BEFORE THE FIRST ITEM. Returns 0 if new position + is an existing item. */ +reiser4_internal int +coord_prev_item(coord_t * coord) +{ + unsigned items = coord_num_items(coord); + + if (coord_adjust_items(coord, items, 0) == 1) { + return 1; + } + + switch (coord->between) { + case AT_UNIT: + case AFTER_UNIT: + case BEFORE_UNIT: + case BEFORE_ITEM: + + if (coord->item_pos == 0) { + coord->between = BEFORE_ITEM; + coord->unit_pos = 0; + return 1; + } + + coord_dec_item_pos(coord); + coord->unit_pos = 0; + coord->between = AT_UNIT; + return 0; + + case AFTER_ITEM: + coord->between = AT_UNIT; + coord->unit_pos = 0; + return 0; + + case INVALID_COORD: + case EMPTY_NODE: + break; + } + + impossible("jmacd-9905", "unreachable"); + return 0; +} + +/* Calls either coord_init_first_unit or coord_init_last_unit depending on sideof argument. */ +reiser4_internal void +coord_init_sideof_unit(coord_t * coord, const znode * node, sideof dir) +{ + assert("jmacd-9821", dir == LEFT_SIDE || dir == RIGHT_SIDE); + if (dir == LEFT_SIDE) { + coord_init_first_unit(coord, node); + } else { + coord_init_last_unit(coord, node); + } +} + +/* Calls either coord_is_before_leftmost or coord_is_after_rightmost depending on sideof + argument. */ +/* Audited by: green(2002.06.15) */ +reiser4_internal int +coord_is_after_sideof_unit(coord_t * coord, sideof dir) +{ + assert("jmacd-9822", dir == LEFT_SIDE || dir == RIGHT_SIDE); + if (dir == LEFT_SIDE) { + return coord_is_before_leftmost(coord); + } else { + return coord_is_after_rightmost(coord); + } +} + +/* Calls either coord_next_unit or coord_prev_unit depending on sideof argument. */ +/* Audited by: green(2002.06.15) */ +reiser4_internal int +coord_sideof_unit(coord_t * coord, sideof dir) +{ + assert("jmacd-9823", dir == LEFT_SIDE || dir == RIGHT_SIDE); + if (dir == LEFT_SIDE) { + return coord_prev_unit(coord); + } else { + return coord_next_unit(coord); + } +} + +#if REISER4_DEBUG +#define DEBUG_COORD_FIELDS (sizeof(c1->plug_v) + sizeof(c1->body_v)) +#else +#define DEBUG_COORD_FIELDS (0) +#endif + +reiser4_internal int +coords_equal(const coord_t * c1, const coord_t * c2) +{ + assert("nikita-2840", c1 != NULL); + assert("nikita-2841", c2 != NULL); + +#if 0 + /* assertion to track changes in coord_t */ + cassert(sizeof(*c1) == sizeof(c1->node) + + sizeof(c1->item_pos) + + sizeof(c1->unit_pos) + + sizeof(c1->iplugid) + + sizeof(c1->between) + + sizeof(c1->pad) + + sizeof(c1->offset) + + DEBUG_COORD_FIELDS); +#endif + return + c1->node == c2->node && + c1->item_pos == c2->item_pos && + c1->unit_pos == c2->unit_pos && + c1->between == c2->between; +} + +/* If coord_is_after_rightmost return NCOORD_ON_THE_RIGHT, if coord_is_after_leftmost + return NCOORD_ON_THE_LEFT, otherwise return NCOORD_INSIDE. */ +/* Audited by: green(2002.06.15) */ +reiser4_internal coord_wrt_node coord_wrt(const coord_t * coord) +{ + if (coord_is_before_leftmost(coord)) { + return COORD_ON_THE_LEFT; + } + + if (coord_is_after_rightmost(coord)) { + return COORD_ON_THE_RIGHT; + } + + return COORD_INSIDE; +} + +/* Returns true if the coordinate is positioned after the last item or after the last unit + of the last item or it is an empty node. */ +/* Audited by: green(2002.06.15) */ +reiser4_internal int +coord_is_after_rightmost(const coord_t * coord) +{ + assert("jmacd-7313", coord_check(coord)); + + switch (coord->between) { + case INVALID_COORD: + case AT_UNIT: + case BEFORE_UNIT: + case BEFORE_ITEM: + return 0; + + case EMPTY_NODE: + return 1; + + case AFTER_ITEM: + return (coord->item_pos == node_num_items(coord->node) - 1); + + case AFTER_UNIT: + return ((coord->item_pos == node_num_items(coord->node) - 1) && + coord->unit_pos == coord_last_unit_pos(coord)); + } + + impossible("jmacd-9908", "unreachable"); + return 0; +} + +/* Returns true if the coordinate is positioned before the first item or it is an empty + node. */ +reiser4_internal int +coord_is_before_leftmost(const coord_t * coord) +{ + /* FIXME-VS: coord_check requires node to be loaded whereas it is not + necessary to check if coord is set before leftmost + assert ("jmacd-7313", coord_check (coord)); */ + switch (coord->between) { + case INVALID_COORD: + case AT_UNIT: + case AFTER_ITEM: + case AFTER_UNIT: + return 0; + + case EMPTY_NODE: + return 1; + + case BEFORE_ITEM: + case BEFORE_UNIT: + return (coord->item_pos == 0) && (coord->unit_pos == 0); + } + + impossible("jmacd-9908", "unreachable"); + return 0; +} + +/* Returns true if the coordinate is positioned after a item, before a item, after the + last unit of an item, before the first unit of an item, or at an empty node. */ +/* Audited by: green(2002.06.15) */ +reiser4_internal int +coord_is_between_items(const coord_t * coord) +{ + assert("jmacd-7313", coord_check(coord)); + + switch (coord->between) { + case INVALID_COORD: + case AT_UNIT: + return 0; + + case AFTER_ITEM: + case BEFORE_ITEM: + case EMPTY_NODE: + return 1; + + case BEFORE_UNIT: + return coord->unit_pos == 0; + + case AFTER_UNIT: + return coord->unit_pos == coord_last_unit_pos(coord); + } + + impossible("jmacd-9908", "unreachable"); + return 0; +} + +/* Returns true if the coordinates are positioned at adjacent units, regardless of + before-after or item boundaries. */ +reiser4_internal int +coord_are_neighbors(coord_t * c1, coord_t * c2) +{ + coord_t *left; + coord_t *right; + + assert("nikita-1241", c1 != NULL); + assert("nikita-1242", c2 != NULL); + assert("nikita-1243", c1->node == c2->node); + assert("nikita-1244", coord_is_existing_unit(c1)); + assert("nikita-1245", coord_is_existing_unit(c2)); + + left = right = 0; + switch (coord_compare(c1, c2)) { + case COORD_CMP_ON_LEFT: + left = c1; + right = c2; + break; + case COORD_CMP_ON_RIGHT: + left = c2; + right = c1; + break; + case COORD_CMP_SAME: + return 0; + default: + wrong_return_value("nikita-1246", "compare_coords()"); + } + assert("vs-731", left && right); + if (left->item_pos == right->item_pos) { + return left->unit_pos + 1 == right->unit_pos; + } else if (left->item_pos + 1 == right->item_pos) { + return (left->unit_pos == coord_last_unit_pos(left)) && (right->unit_pos == 0); + } else { + return 0; + } +} + +/* Assuming two coordinates are positioned in the same node, return COORD_CMP_ON_RIGHT, + COORD_CMP_ON_LEFT, or COORD_CMP_SAME depending on c1's position relative to c2. */ +/* Audited by: green(2002.06.15) */ +reiser4_internal coord_cmp coord_compare(coord_t * c1, coord_t * c2) +{ + assert("vs-209", c1->node == c2->node); + assert("vs-194", coord_is_existing_unit(c1) + && coord_is_existing_unit(c2)); + + if (c1->item_pos > c2->item_pos) + return COORD_CMP_ON_RIGHT; + if (c1->item_pos < c2->item_pos) + return COORD_CMP_ON_LEFT; + if (c1->unit_pos > c2->unit_pos) + return COORD_CMP_ON_RIGHT; + if (c1->unit_pos < c2->unit_pos) + return COORD_CMP_ON_LEFT; + return COORD_CMP_SAME; +} + +/* If the coordinate is between items, shifts it to the right. Returns 0 on success and + non-zero if there is no position to the right. */ +reiser4_internal int +coord_set_to_right(coord_t * coord) +{ + unsigned items = coord_num_items(coord); + + if (coord_adjust_items(coord, items, 1) == 1) { + return 1; + } + + switch (coord->between) { + case AT_UNIT: + return 0; + + case BEFORE_ITEM: + case BEFORE_UNIT: + coord->between = AT_UNIT; + return 0; + + case AFTER_UNIT: + if (coord->unit_pos < coord_last_unit_pos(coord)) { + coord->unit_pos += 1; + coord->between = AT_UNIT; + return 0; + } else { + + coord->unit_pos = 0; + + if (coord->item_pos == items - 1) { + coord->between = AFTER_ITEM; + return 1; + } + + coord_inc_item_pos(coord); + coord->between = AT_UNIT; + return 0; + } + + case AFTER_ITEM: + if (coord->item_pos == items - 1) { + return 1; + } + + coord_inc_item_pos(coord); + coord->unit_pos = 0; + coord->between = AT_UNIT; + return 0; + + case EMPTY_NODE: + return 1; + + case INVALID_COORD: + break; + } + + impossible("jmacd-9920", "unreachable"); + return 0; +} + +/* If the coordinate is between items, shifts it to the left. Returns 0 on success and + non-zero if there is no position to the left. */ +reiser4_internal int +coord_set_to_left(coord_t * coord) +{ + unsigned items = coord_num_items(coord); + + if (coord_adjust_items(coord, items, 0) == 1) { + return 1; + } + + switch (coord->between) { + case AT_UNIT: + return 0; + + case AFTER_UNIT: + coord->between = AT_UNIT; + return 0; + + case AFTER_ITEM: + coord->between = AT_UNIT; + coord->unit_pos = coord_last_unit_pos(coord); + return 0; + + case BEFORE_UNIT: + if (coord->unit_pos > 0) { + coord->unit_pos -= 1; + coord->between = AT_UNIT; + return 0; + } else { + + if (coord->item_pos == 0) { + coord->between = BEFORE_ITEM; + return 1; + } + + coord->unit_pos = coord_last_unit_pos(coord); + coord_dec_item_pos(coord); + coord->between = AT_UNIT; + return 0; + } + + case BEFORE_ITEM: + if (coord->item_pos == 0) { + return 1; + } + + coord_dec_item_pos(coord); + coord->unit_pos = coord_last_unit_pos(coord); + coord->between = AT_UNIT; + return 0; + + case EMPTY_NODE: + return 1; + + case INVALID_COORD: + break; + } + + impossible("jmacd-9920", "unreachable"); + return 0; +} + +reiser4_internal const char * +coord_tween_tostring(between_enum n) +{ + switch (n) { + case BEFORE_UNIT: + return "before unit"; + case BEFORE_ITEM: + return "before item"; + case AT_UNIT: + return "at unit"; + case AFTER_UNIT: + return "after unit"; + case AFTER_ITEM: + return "after item"; + case EMPTY_NODE: + return "empty node"; + case INVALID_COORD: + return "invalid"; + default:{ + static char buf[30]; + + sprintf(buf, "unknown: %i", n); + return buf; + } + } +} + +reiser4_internal void +print_coord(const char *mes, const coord_t * coord, int node) +{ + if (coord == NULL) { + printk("%s: null\n", mes); + return; + } + printk("%s: item_pos = %d, unit_pos %d, tween=%s, iplug=%d\n", + mes, coord->item_pos, coord->unit_pos, coord_tween_tostring(coord->between), coord->iplugid); + if (node) + print_znode("\tnode", coord->node); +} + +reiser4_internal int +item_utmost_child_real_block(const coord_t * coord, sideof side, reiser4_block_nr * blk) +{ + return item_plugin_by_coord(coord)->f.utmost_child_real_block(coord, side, blk); +} + +reiser4_internal int +item_utmost_child(const coord_t * coord, sideof side, jnode ** child) +{ + return item_plugin_by_coord(coord)->f.utmost_child(coord, side, child); +} + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/coord.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/coord.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,337 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Coords */ + +#if !defined( __REISER4_COORD_H__ ) +#define __REISER4_COORD_H__ + +#include "forward.h" +#include "debug.h" +#include "dformat.h" + +/* insertions happen between coords in the tree, so we need some means + of specifying the sense of betweenness. */ +typedef enum { + BEFORE_UNIT, /* Note: we/init_coord depends on this value being zero. */ + AT_UNIT, + AFTER_UNIT, + BEFORE_ITEM, + AFTER_ITEM, + INVALID_COORD, + EMPTY_NODE, +} between_enum; + +/* location of coord w.r.t. its node */ +typedef enum { + COORD_ON_THE_LEFT = -1, + COORD_ON_THE_RIGHT = +1, + COORD_INSIDE = 0 +} coord_wrt_node; + +typedef enum { + COORD_CMP_SAME = 0, COORD_CMP_ON_LEFT = -1, COORD_CMP_ON_RIGHT = +1 +} coord_cmp; + +struct coord { + /* node in a tree */ + /* 0 */ znode *node; + + /* position of item within node */ + /* 4 */ pos_in_node_t item_pos; + /* position of unit within item */ + /* 6 */ pos_in_node_t unit_pos; + /* optimization: plugin of item is stored in coord_t. Until this was + implemented, item_plugin_by_coord() was major CPU consumer. ->iplugid + is invalidated (set to 0xff) on each modification of ->item_pos, + and all such modifications are funneled through coord_*_item_pos() + functions below. + */ + /* 8 */ char iplugid; + /* position of coord w.r.t. to neighboring items and/or units. + Values are taken from &between_enum above. + */ + /* 9 */ char between; + /* padding. It will be added by the compiler anyway to conform to the + * C language alignment requirements. We keep it here to be on the + * safe side and to have a clear picture of the memory layout of this + * structure. */ + /* 10 */ __u16 pad; + /* 12 */ int offset; +#if REISER4_DEBUG + unsigned long plug_v; + unsigned long body_v; +#endif +}; + +#define INVALID_PLUGID ((char)((1 << 8) - 1)) +#define INVALID_OFFSET -1 + +static inline void +coord_clear_iplug(coord_t * coord) +{ + assert("nikita-2835", coord != NULL); + coord->iplugid = INVALID_PLUGID; + coord->offset = INVALID_OFFSET; +} + +static inline int +coord_is_iplug_set(const coord_t * coord) +{ + assert("nikita-2836", coord != NULL); + return coord->iplugid != INVALID_PLUGID; +} + +static inline void +coord_set_item_pos(coord_t * coord, pos_in_node_t pos) +{ + assert("nikita-2478", coord != NULL); + coord->item_pos = pos; + coord_clear_iplug(coord); +} + +static inline void +coord_dec_item_pos(coord_t * coord) +{ + assert("nikita-2480", coord != NULL); + --coord->item_pos; + coord_clear_iplug(coord); +} + +static inline void +coord_inc_item_pos(coord_t * coord) +{ + assert("nikita-2481", coord != NULL); + ++coord->item_pos; + coord_clear_iplug(coord); +} + +static inline void +coord_add_item_pos(coord_t * coord, int delta) +{ + assert("nikita-2482", coord != NULL); + coord->item_pos += delta; + coord_clear_iplug(coord); +} + +static inline void +coord_invalid_item_pos(coord_t * coord) +{ + assert("nikita-2832", coord != NULL); + coord->item_pos = (unsigned short)~0; + coord_clear_iplug(coord); +} + +/* Reverse a direction. */ +static inline sideof +sideof_reverse(sideof side) +{ + return side == LEFT_SIDE ? RIGHT_SIDE : LEFT_SIDE; +} + +/* NOTE: There is a somewhat odd mixture of the following opposed terms: + + "first" and "last" + "next" and "prev" + "before" and "after" + "leftmost" and "rightmost" + + But I think the chosen names are decent the way they are. +*/ + +/* COORD INITIALIZERS */ + +/* Initialize an invalid coordinate. */ +extern void coord_init_invalid(coord_t * coord, const znode * node); + +extern void coord_init_first_unit_nocheck(coord_t * coord, const znode * node); + +/* Initialize a coordinate to point at the first unit of the first item. If the node is + empty, it is positioned at the EMPTY_NODE. */ +extern void coord_init_first_unit(coord_t * coord, const znode * node); + +/* Initialize a coordinate to point at the last unit of the last item. If the node is + empty, it is positioned at the EMPTY_NODE. */ +extern void coord_init_last_unit(coord_t * coord, const znode * node); + +/* Initialize a coordinate to before the first item. If the node is empty, it is + positioned at the EMPTY_NODE. */ +extern void coord_init_before_first_item(coord_t * coord, const znode * node); + +/* Initialize a coordinate to after the last item. If the node is empty, it is positioned + at the EMPTY_NODE. */ +extern void coord_init_after_last_item(coord_t * coord, const znode * node); + +/* Initialize a coordinate to after last unit in the item. Coord must be set + already to existing item */ +void coord_init_after_item_end(coord_t * coord); + +/* Initialize a coordinate to before the item. Coord must be set already to existing item */ +void coord_init_before_item(coord_t *); +/* Initialize a coordinate to after the item. Coord must be set already to existing item */ +void coord_init_after_item(coord_t *); + +/* Calls either coord_init_first_unit or coord_init_last_unit depending on sideof argument. */ +extern void coord_init_sideof_unit(coord_t * coord, const znode * node, sideof dir); + +/* Initialize a coordinate by 0s. Used in places where init_coord was used and + it was not clear how actually + FIXME-VS: added by vs (2002, june, 8) */ +extern void coord_init_zero(coord_t * coord); + +/* COORD METHODS */ + +/* after shifting of node content, coord previously set properly may become + invalid, try to "normalize" it. */ +void coord_normalize(coord_t * coord); + +/* Copy a coordinate. */ +extern void coord_dup(coord_t * coord, const coord_t * old_coord); + +/* Copy a coordinate without check. */ +void coord_dup_nocheck(coord_t * coord, const coord_t * old_coord); + +unsigned coord_num_units(const coord_t * coord); + +/* Return the last valid unit number at the present item (i.e., + coord_num_units() - 1). */ +static inline unsigned +coord_last_unit_pos(const coord_t * coord) +{ + return coord_num_units(coord) - 1; +} + +#if REISER4_DEBUG +/* For assertions only, checks for a valid coordinate. */ +extern int coord_check(const coord_t * coord); + +extern unsigned long znode_times_locked(const znode *z); + +static inline void +coord_update_v(coord_t * coord) +{ + coord->plug_v = coord->body_v = znode_times_locked(coord->node); +} +#endif + +extern int coords_equal(const coord_t * c1, const coord_t * c2); + +extern void print_coord(const char *mes, const coord_t * coord, int print_node); + +/* If coord_is_after_rightmost return NCOORD_ON_THE_RIGHT, if coord_is_after_leftmost + return NCOORD_ON_THE_LEFT, otherwise return NCOORD_INSIDE. */ +extern coord_wrt_node coord_wrt(const coord_t * coord); + +/* Returns true if the coordinates are positioned at adjacent units, regardless of + before-after or item boundaries. */ +extern int coord_are_neighbors(coord_t * c1, coord_t * c2); + +/* Assuming two coordinates are positioned in the same node, return NCOORD_CMP_ON_RIGHT, + NCOORD_CMP_ON_LEFT, or NCOORD_CMP_SAME depending on c1's position relative to c2. */ +extern coord_cmp coord_compare(coord_t * c1, coord_t * c2); + +/* COORD PREDICATES */ + +/* Returns true if the coord was initializewd by coord_init_invalid (). */ +extern int coord_is_invalid(const coord_t * coord); + +/* Returns true if the coordinate is positioned at an existing item, not before or after + an item. It may be placed at, before, or after any unit within the item, whether + existing or not. If this is true you can call methods of the item plugin. */ +extern int coord_is_existing_item(const coord_t * coord); + +/* Returns true if the coordinate is positioned after a item, before a item, after the + last unit of an item, before the first unit of an item, or at an empty node. */ +extern int coord_is_between_items(const coord_t * coord); + +/* Returns true if the coordinate is positioned at an existing unit, not before or after a + unit. */ +extern int coord_is_existing_unit(const coord_t * coord); + +/* Returns true if the coordinate is positioned at an empty node. */ +extern int coord_is_empty(const coord_t * coord); + +/* Returns true if the coordinate is positioned at the first unit of the first item. Not + true for empty nodes nor coordinates positioned before the first item. */ +extern int coord_is_leftmost_unit(const coord_t * coord); + +/* Returns true if the coordinate is positioned after the last item or after the last unit + of the last item or it is an empty node. */ +extern int coord_is_after_rightmost(const coord_t * coord); + +/* Returns true if the coordinate is positioned before the first item or it is an empty + node. */ +extern int coord_is_before_leftmost(const coord_t * coord); + +/* Calls either coord_is_before_leftmost or coord_is_after_rightmost depending on sideof + argument. */ +extern int coord_is_after_sideof_unit(coord_t * coord, sideof dir); + +/* COORD MODIFIERS */ + +/* Advances the coordinate by one unit to the right. If empty, no change. If + coord_is_rightmost_unit, advances to AFTER THE LAST ITEM. Returns 0 if new position is + an existing unit. */ +extern int coord_next_unit(coord_t * coord); + +/* Advances the coordinate by one item to the right. If empty, no change. If + coord_is_rightmost_unit, advances to AFTER THE LAST ITEM. Returns 0 if new position is + an existing item. */ +extern int coord_next_item(coord_t * coord); + +/* Advances the coordinate by one unit to the left. If empty, no change. If + coord_is_leftmost_unit, advances to BEFORE THE FIRST ITEM. Returns 0 if new position + is an existing unit. */ +extern int coord_prev_unit(coord_t * coord); + +/* Advances the coordinate by one item to the left. If empty, no change. If + coord_is_leftmost_unit, advances to BEFORE THE FIRST ITEM. Returns 0 if new position + is an existing item. */ +extern int coord_prev_item(coord_t * coord); + +/* If the coordinate is between items, shifts it to the right. Returns 0 on success and + non-zero if there is no position to the right. */ +extern int coord_set_to_right(coord_t * coord); + +/* If the coordinate is between items, shifts it to the left. Returns 0 on success and + non-zero if there is no position to the left. */ +extern int coord_set_to_left(coord_t * coord); + +/* If the coordinate is at an existing unit, set to after that unit. Returns 0 on success + and non-zero if the unit did not exist. */ +extern int coord_set_after_unit(coord_t * coord); + +/* Calls either coord_next_unit or coord_prev_unit depending on sideof argument. */ +extern int coord_sideof_unit(coord_t * coord, sideof dir); + +/* iterate over all units in @node */ +#define for_all_units( coord, node ) \ + for( coord_init_before_first_item( ( coord ), ( node ) ) ; \ + coord_next_unit( coord ) == 0 ; ) + +/* iterate over all items in @node */ +#define for_all_items( coord, node ) \ + for( coord_init_before_first_item( ( coord ), ( node ) ) ; \ + coord_next_item( coord ) == 0 ; ) + +#if REISER4_DEBUG +extern const char *coord_tween_tostring(between_enum n); +#endif + +/* COORD/ITEM METHODS */ + +extern int item_utmost_child_real_block(const coord_t * coord, sideof side, reiser4_block_nr * blk); +extern int item_utmost_child(const coord_t * coord, sideof side, jnode ** child); + +/* __REISER4_COORD_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/crypt.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/crypt.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,92 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ +/* Crypto-plugins for reiser4 cryptcompress objects */ + +#include "debug.h" +#include "plugin/plugin.h" +#include "plugin/cryptcompress.h" +#include +#include + +#define MAX_CRYPTO_BLOCKSIZE 128 +#define NONE_EXPKEY_WORDS 8 +#define NONE_BLOCKSIZE 8 + +/* + Default align() method of the crypto-plugin (look for description of this method + in plugin/plugin.h) + +1) creates the aligning armored format of the input flow before encryption. + "armored" means that padding is filled by private data (for example, + pseudo-random sequence of bytes is not private data). +2) returns length of appended padding + + [ flow | aligning_padding ] + ^ + | + @pad +*/ +UNUSED_ARG static int +align_cluster_common(__u8 *pad /* pointer to the first byte of aligning format */, + int flow_size /* size of non-aligned flow */, + int blocksize /* crypto-block size */) +{ + int pad_size; + + assert("edward-01", pad != NULL); + assert("edward-02", flow_size != 0); + assert("edward-03", blocksize != 0 || blocksize <= MAX_CRYPTO_BLOCKSIZE); + + pad_size = blocksize - (flow_size % blocksize); + get_random_bytes (pad, pad_size); + return pad_size; +} + +/* common scale method (look for description of this method in plugin/plugin.h) + for all symmetric algorithms which doesn't scale anything +*/ +static loff_t scale_common(struct inode * inode UNUSED_ARG, + size_t blocksize UNUSED_ARG /* crypto block size, which is returned + by blocksize method of crypto plugin */, + loff_t src_off /* offset to scale */) +{ + return src_off; +} + +REGISTER_NONE_ALG(crypt, CRYPTO) + +/* EDWARD-FIXME-HANS: why is this not in the plugin directory? */ + +/* crypto plugins */ +crypto_plugin crypto_plugins[LAST_CRYPTO_ID] = { + [NONE_CRYPTO_ID] = { + .h = { + .type_id = REISER4_CRYPTO_PLUGIN_TYPE, + .id = NONE_CRYPTO_ID, + .pops = NULL, + /* If you wanna your files to not be crypto + transformed, specify this crypto pluigin */ + .label = "none", + .desc = "absence of crypto transform", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .alloc = alloc_none_crypt, + .free = free_none_crypt, + .nr_keywords = NONE_EXPKEY_WORDS, + .scale = scale_common, + .align_cluster = NULL, + .setkey = NULL, + .encrypt = NULL, + .decrypt = NULL + } +}; + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/debug.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/debug.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,447 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Debugging facilities. */ + +/* + * This file contains generic debugging functions used by reiser4. Roughly + * following: + * + * panicking: reiser4_do_panic(), reiser4_print_prefix(). + * + * locking: schedulable(), lock_counters(), print_lock_counters(), + * no_counters_are_held(), commit_check_locks() + * + * {debug,trace,log}_flags: reiser4_are_all_debugged(), + * reiser4_is_debugged(), get_current_trace_flags(), + * get_current_log_flags(). + * + * kmalloc/kfree leak detection: reiser4_kmalloc(), reiser4_kfree(), + * reiser4_kfree_in_sb(). + * + * error code monitoring (see comment before RETERR macro): return_err(), + * report_err(). + * + * stack back-tracing: fill_backtrace() + * + * miscellaneous: preempt_point(), call_on_each_assert(), debugtrap(). + * + */ + +#include "reiser4.h" +#include "context.h" +#include "super.h" +#include "txnmgr.h" +#include "znode.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#if REISER4_DEBUG +static void report_err(void); +#else +#define report_err() noop +#endif + +/* + * global buffer where message given to reiser4_panic is formatted. + */ +static char panic_buf[REISER4_PANIC_MSG_BUFFER_SIZE]; + +/* + * lock protecting consistency of panic_buf under concurrent panics + */ +static spinlock_t panic_guard = SPIN_LOCK_UNLOCKED; + +#if REISER4_DEBUG +static int +reiser4_is_debugged(struct super_block *super, __u32 flag); +#endif + +/* Your best friend. Call it on each occasion. This is called by + fs/reiser4/debug.h:reiser4_panic(). */ +reiser4_internal void +reiser4_do_panic(const char *format /* format string */ , ... /* rest */) +{ + static int in_panic = 0; + va_list args; + + /* + * check for recursive panic. + */ + if (in_panic == 0) { + in_panic = 1; + + spin_lock(&panic_guard); + va_start(args, format); + vsnprintf(panic_buf, sizeof(panic_buf), format, args); + va_end(args); + printk(KERN_EMERG "reiser4 panicked cowardly: %s", panic_buf); + spin_unlock(&panic_guard); + + /* + * if kernel debugger is configured---drop in. Early dropping + * into kgdb is not always convenient, because panic message + * is not yet printed most of the times. But: + * + * (1) message can be extracted from printk_buf[] + * (declared static inside of printk()), and + * + * (2) sometimes serial/kgdb combo dies while printing + * long panic message, so it's more prudent to break into + * debugger earlier. + * + */ + DEBUGON(1); + +#if REISER4_DEBUG + if (get_current_context_check() != NULL) { + struct super_block *super; + reiser4_context *ctx; + + /* + * if we are within reiser4 context, print it contents: + */ + + /* lock counters... */ + ON_DEBUG(print_lock_counters("pins held", lock_counters())); + /* other active contexts... */ + ON_DEBUG(print_contexts()); + ctx = get_current_context(); + super = ctx->super; + if (get_super_private(super) != NULL && + reiser4_is_debugged(super, REISER4_VERBOSE_PANIC)) + /* znodes... */ + print_znodes("znodes", current_tree); + { + extern spinlock_t active_contexts_lock; + + /* + * remove context from the list of active + * contexts. This is precaution measure: + * current is going to die, and leaving + * context on the list would render latter + * corrupted. + */ + spin_lock(&active_contexts_lock); + context_list_remove(ctx->parent); + spin_unlock(&active_contexts_lock); + } + } +#endif + } + BUG(); + /* to make gcc happy about noreturn attribute */ + panic("%s", panic_buf); +} + +reiser4_internal void +reiser4_print_prefix(const char *level, int reperr, const char *mid, + const char *function, const char *file, int lineno) +{ + const char *comm; + int pid; + + if (unlikely(in_interrupt() || in_irq())) { + comm = "interrupt"; + pid = 0; + } else { + comm = current->comm; + pid = current->pid; + } + printk("%sreiser4[%.16s(%i)]: %s (%s:%i)[%s]:\n", + level, comm, pid, function, file, lineno, mid); + if (reperr) + report_err(); +} + +/* Preemption point: this should be called periodically during long running + operations (carry, allocate, and squeeze are best examples) */ +reiser4_internal int +preempt_point(void) +{ + assert("nikita-3008", schedulable()); + cond_resched(); + return signal_pending(current); +} + +#if REISER4_DEBUG + +/* check that no spinlocks are held */ +int schedulable(void) +{ + if (get_current_context_check() != NULL) { + if (!LOCK_CNT_NIL(spin_locked)) { + print_lock_counters("in atomic", lock_counters()); + return 0; + } + } + might_sleep(); + return 1; +} +#endif + +#if REISER4_DEBUG +/* Debugging aid: return struct where information about locks taken by current + thread is accumulated. This can be used to formulate lock ordering + constraints and various assertions. + +*/ +lock_counters_info * +lock_counters(void) +{ + reiser4_context *ctx = get_current_context(); + assert("jmacd-1123", ctx != NULL); + return &ctx->locks; +} + +/* + * print human readable information about locks held by the reiser4 context. + */ +void +print_lock_counters(const char *prefix, const lock_counters_info * info) +{ + printk("%s: jnode: %i, tree: %i (r:%i,w:%i), dk: %i (r:%i,w:%i)\n" + "jload: %i, " + "txnh: %i, atom: %i, stack: %i, txnmgr: %i, " + "ktxnmgrd: %i, fq: %i, reiser4_sb: %i\n" + "inode: %i, " + "cbk_cache: %i (r:%i,w%i), " + "epoch: %i, eflush: %i, " + "zlock: %i (r:%i, w:%i)\n" + "spin: %i, long: %i inode_sem: (r:%i,w:%i)\n" + "d: %i, x: %i, t: %i\n", prefix, + info->spin_locked_jnode, + info->rw_locked_tree, info->read_locked_tree, + info->write_locked_tree, + + info->rw_locked_dk, info->read_locked_dk, info->write_locked_dk, + + info->spin_locked_jload, + info->spin_locked_txnh, + info->spin_locked_atom, info->spin_locked_stack, + info->spin_locked_txnmgr, info->spin_locked_ktxnmgrd, + info->spin_locked_fq, info->spin_locked_super, + info->spin_locked_inode_object, + + info->rw_locked_cbk_cache, + info->read_locked_cbk_cache, + info->write_locked_cbk_cache, + + info->spin_locked_epoch, + info->spin_locked_super_eflush, + + info->rw_locked_zlock, + info->read_locked_zlock, + info->write_locked_zlock, + + info->spin_locked, + info->long_term_locked_znode, + info->inode_sem_r, info->inode_sem_w, + info->d_refs, info->x_refs, info->t_refs); +} + +/* + * return true, iff no locks are held. + */ +int +no_counters_are_held(void) +{ + lock_counters_info *counters; + + counters = lock_counters(); + return + (counters->rw_locked_zlock == 0) && + (counters->read_locked_zlock == 0) && + (counters->write_locked_zlock == 0) && + (counters->spin_locked_jnode == 0) && + (counters->rw_locked_tree == 0) && + (counters->read_locked_tree == 0) && + (counters->write_locked_tree == 0) && + (counters->rw_locked_dk == 0) && + (counters->read_locked_dk == 0) && + (counters->write_locked_dk == 0) && + (counters->spin_locked_txnh == 0) && + (counters->spin_locked_atom == 0) && + (counters->spin_locked_stack == 0) && + (counters->spin_locked_txnmgr == 0) && + (counters->spin_locked_inode_object == 0) && + (counters->spin_locked == 0) && + (counters->long_term_locked_znode == 0) && + (counters->inode_sem_r == 0) && + (counters->inode_sem_w == 0); +} + +/* + * return true, iff transaction commit can be done under locks held by the + * current thread. + */ +int +commit_check_locks(void) +{ + lock_counters_info *counters; + int inode_sem_r; + int inode_sem_w; + int result; + + /* + * inode's read/write semaphore is the only reiser4 lock that can be + * held during commit. + */ + + counters = lock_counters(); + inode_sem_r = counters->inode_sem_r; + inode_sem_w = counters->inode_sem_w; + + counters->inode_sem_r = counters->inode_sem_w = 0; + result = no_counters_are_held(); + counters->inode_sem_r = inode_sem_r; + counters->inode_sem_w = inode_sem_w; + return result; +} + +/* + * check that some bits specified by @flags are set in ->debug_flags of the + * super block. + */ +static int +reiser4_is_debugged(struct super_block *super, __u32 flag) +{ + return get_super_private(super)->debug_flags & flag; +} + +/* REISER4_DEBUG */ +#endif + +/* allocate memory. This calls kmalloc(), performs some additional checks, and + keeps track of how many memory was allocated on behalf of current super + block. */ +reiser4_internal void * +reiser4_kmalloc(size_t size /* number of bytes to allocate */ , + int gfp_flag /* allocation flag */ ) +{ + void *result; + + assert("nikita-3009", ergo(gfp_flag & __GFP_WAIT, schedulable())); + + result = kmalloc(size, gfp_flag); +#if REISER4_DEBUG + if (result != NULL) { + reiser4_super_info_data *sbinfo; + + sbinfo = get_current_super_private(); + assert("nikita-1407", sbinfo != NULL); + reiser4_spin_lock_sb(sbinfo); + sbinfo->kmallocs ++; + reiser4_spin_unlock_sb(sbinfo); + } +#endif + return result; +} + +/* release memory allocated by reiser4_kmalloc() and update counter. */ +reiser4_internal void +reiser4_kfree(void *area /* memory to from */) +{ + assert("nikita-1410", area != NULL); + return reiser4_kfree_in_sb(area, reiser4_get_current_sb()); +} + +/* release memory allocated by reiser4_kmalloc() for the specified + * super-block. This is useful when memory is released outside of reiser4 + * context */ +reiser4_internal void +reiser4_kfree_in_sb(void *area /* memory to from */, struct super_block *sb) +{ + assert("nikita-2729", area != NULL); +#if REISER4_DEBUG + { + reiser4_super_info_data *sbinfo; + + sbinfo = get_super_private(sb); + reiser4_spin_lock_sb(sbinfo); + assert("nikita-2730", sbinfo->kmallocs > 0); + sbinfo->kmallocs --; + reiser4_spin_unlock_sb(sbinfo); + } +#endif + kfree(area); +} + +#if REISER4_DEBUG + +/* + * fill "error site" in the current reiser4 context. See comment before RETERR + * macro for more details. + */ +void +return_err(int code, const char *file, int line) +{ + if (code < 0 && is_in_reiser4_context()) { + reiser4_context *ctx = get_current_context(); + + if (ctx != NULL) { + ctx->err.code = code; + ctx->err.file = file; + ctx->err.line = line; +#ifdef CONFIG_FRAME_POINTER + ctx->err.bt[0] =__builtin_return_address(0); + ctx->err.bt[1] =__builtin_return_address(1); + ctx->err.bt[2] =__builtin_return_address(2); + ctx->err.bt[3] =__builtin_return_address(3); + ctx->err.bt[4] =__builtin_return_address(4); +#endif + } + } +} + +/* + * report error information recorder by return_err(). + */ +static void +report_err(void) +{ + reiser4_context *ctx = get_current_context_check(); + + if (ctx != NULL) { + if (ctx->err.code != 0) { + printk("code: %i at %s:%i\n", + ctx->err.code, ctx->err.file, ctx->err.line); + } + } +} + +#endif /* REISER4_DEBUG */ + +#if KERNEL_DEBUGGER +/* + * this functions just drops into kernel debugger. It is a convenient place to + * put breakpoint in. + */ +void debugtrap(void) +{ + /* do nothing. Put break point here. */ +#if defined(CONFIG_KGDB) && !defined(CONFIG_REISER4_FS_MODULE) + extern void breakpoint(void); + breakpoint(); +#endif +} +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/debug.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/debug.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,353 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Declarations of debug macros. */ + +#if !defined( __FS_REISER4_DEBUG_H__ ) +#define __FS_REISER4_DEBUG_H__ + +#include "forward.h" +#include "reiser4.h" + + +/* generic function to produce formatted output, decorating it with + whatever standard prefixes/postfixes we want. "Fun" is a function + that will be actually called, can be printk, panic etc. + This is for use by other debugging macros, not by users. */ +#define DCALL(lev, fun, reperr, label, format, ...) \ +({ \ + reiser4_print_prefix(lev, reperr, label, \ + __FUNCTION__, __FILE__, __LINE__); \ + fun(lev format "\n" , ## __VA_ARGS__); \ +}) + +/* + * cause kernel to crash + */ +#define reiser4_panic(mid, format, ...) \ + DCALL("", reiser4_do_panic, 1, mid, format , ## __VA_ARGS__) + +/* print message with indication of current process, file, line and + function */ +#define reiser4_log(label, format, ...) \ + DCALL(KERN_DEBUG, printk, 0, label, format , ## __VA_ARGS__) + +/* Assertion checked during compilation. + If "cond" is false (0) we get duplicate case label in switch. + Use this to check something like famous + cassert (sizeof(struct reiserfs_journal_commit) == 4096) ; + in 3.x journal.c. If cassertion fails you get compiler error, + so no "maintainer-id". +*/ +#define cassert(cond) ({ switch(-1) { case (cond): case 0: break; } }) + +#define noop do {;} while(0) + +#if REISER4_DEBUG +/* version of info that only actually prints anything when _d_ebugging + is on */ +#define dinfo(format, ...) printk(format , ## __VA_ARGS__) +/* macro to catch logical errors. Put it into `default' clause of + switch() statement. */ +#define impossible(label, format, ...) \ + reiser4_panic(label, "impossible: " format , ## __VA_ARGS__) +/* assert assures that @cond is true. If it is not, reiser4_panic() is + called. Use this for checking logical consistency and _never_ call + this to check correctness of external data: disk blocks and user-input . */ +#define assert(label, cond) \ +({ \ + /* call_on_each_assert(); */ \ + if (cond) { \ + /* put negated check to avoid using !(cond) that would lose \ + * warnings for things like assert(a = b); */ \ + ; \ + } else { \ + DEBUGON(1); \ + reiser4_panic(label, "assertion failed: %s", #cond); \ + } \ +}) + +/* like assertion, but @expr is evaluated even if REISER4_DEBUG is off. */ +#define check_me( label, expr ) assert( label, ( expr ) ) + +#define ON_DEBUG( exp ) exp + +extern int schedulable(void); +extern void call_on_each_assert(void); + +#else + +#define dinfo( format, args... ) noop +#define impossible( label, format, args... ) noop +#define assert( label, cond ) noop +#define check_me( label, expr ) ( ( void ) ( expr ) ) +#define ON_DEBUG( exp ) +#define schedulable() might_sleep() + +/* REISER4_DEBUG */ +#endif + +#if REISER4_DEBUG +/* per-thread information about lock acquired by this thread. Used by lock + * ordering checking in spin_macros.h */ +typedef struct lock_counters_info { + int rw_locked_tree; + int read_locked_tree; + int write_locked_tree; + + int rw_locked_dk; + int read_locked_dk; + int write_locked_dk; + + int rw_locked_cbk_cache; + int read_locked_cbk_cache; + int write_locked_cbk_cache; + + int rw_locked_zlock; + int read_locked_zlock; + int write_locked_zlock; + + int spin_locked_jnode; + int spin_locked_jload; + int spin_locked_txnh; + int spin_locked_atom; + int spin_locked_stack; + int spin_locked_txnmgr; + int spin_locked_ktxnmgrd; + int spin_locked_fq; + int spin_locked_super; + int spin_locked_inode_object; + int spin_locked_epoch; + int spin_locked_super_eflush; + int spin_locked; + int long_term_locked_znode; + + int inode_sem_r; + int inode_sem_w; + + int d_refs; + int x_refs; + int t_refs; +} lock_counters_info; + +extern lock_counters_info *lock_counters(void); +#define IN_CONTEXT(a, b) (is_in_reiser4_context() ? (a) : (b)) + +/* increment lock-counter @counter, if present */ +#define LOCK_CNT_INC(counter) IN_CONTEXT(++(lock_counters()->counter), 0) + +/* decrement lock-counter @counter, if present */ +#define LOCK_CNT_DEC(counter) IN_CONTEXT(--(lock_counters()->counter), 0) + +/* check that lock-counter is zero. This is for use in assertions */ +#define LOCK_CNT_NIL(counter) IN_CONTEXT(lock_counters()->counter == 0, 1) + +/* check that lock-counter is greater than zero. This is for use in + * assertions */ +#define LOCK_CNT_GTZ(counter) IN_CONTEXT(lock_counters()->counter > 0, 1) + +#else /* REISER4_DEBUG */ + +/* no-op versions on the above */ + +typedef struct lock_counters_info { +} lock_counters_info; + +#define lock_counters() ((lock_counters_info *)NULL) +#define LOCK_CNT_INC(counter) noop +#define LOCK_CNT_DEC(counter) noop +#define LOCK_CNT_NIL(counter) (1) +#define LOCK_CNT_GTZ(counter) (1) + +#endif /* REISER4_DEBUG */ + + +/* flags controlling debugging behavior. Are set through debug_flags=N mount + option. */ +typedef enum { + /* print a lot of information during panic. When this is on all jnodes + * are listed. This can be *very* large output. Usually you don't want + * this. Especially over serial line. */ + REISER4_VERBOSE_PANIC = 0x00000001, + /* print a lot of information during umount */ + REISER4_VERBOSE_UMOUNT = 0x00000002, + /* print gathered statistics on umount */ + REISER4_STATS_ON_UMOUNT = 0x00000004, + /* check node consistency */ + REISER4_CHECK_NODE = 0x00000008 +} reiser4_debug_flags; + +extern int is_in_reiser4_context(void); + +/* + * evaluate expression @e only if with reiser4 context + */ +#define ON_CONTEXT(e) do { \ + if(is_in_reiser4_context()) { \ + e; \ + } } while(0) + +/* + * evaluate expression @e only when within reiser4_context and debugging is + * on. + */ +#define ON_DEBUG_CONTEXT( e ) ON_DEBUG( ON_CONTEXT( e ) ) + +/* + * complain about unexpected function result and crash. Used in "default" + * branches of switch statements and alike to assert that invalid results are + * not silently ignored. + */ +#define wrong_return_value( label, function ) \ + impossible( label, "wrong return value from " function ) + +/* Issue warning message to the console */ +#define warning( label, format, ... ) \ + DCALL( KERN_WARNING, \ + printk, 1, label, "WARNING: " format , ## __VA_ARGS__ ) + +/* mark not yet implemented functionality */ +#define not_yet( label, format, ... ) \ + reiser4_panic( label, "NOT YET IMPLEMENTED: " format , ## __VA_ARGS__ ) + +extern void reiser4_do_panic(const char *format, ...) +__attribute__ ((noreturn, format(printf, 1, 2))); + +extern void reiser4_print_prefix(const char *level, int reperr, const char *mid, + const char *function, + const char *file, int lineno); + +extern int preempt_point(void); +extern void reiser4_print_stats(void); + +extern void *reiser4_kmalloc(size_t size, int gfp_flag); +extern void reiser4_kfree(void *area); +extern void reiser4_kfree_in_sb(void *area, struct super_block *sb); + +#if REISER4_DEBUG +extern void print_lock_counters(const char *prefix, + const lock_counters_info * info); +extern int no_counters_are_held(void); +extern int commit_check_locks(void); +#else +#define no_counters_are_held() (1) +#define commit_check_locks() (1) +#endif + + +/* true if @i is power-of-two. Useful for rate-limited warnings, etc. */ +#define IS_POW(i) \ +({ \ + typeof(i) __i; \ + \ + __i = (i); \ + !(__i & (__i - 1)); \ +}) + +#define KERNEL_DEBUGGER (1) + +#if KERNEL_DEBUGGER +/* + * Check condition @cond and drop into kernel debugger (kgdb) if it's true. If + * kgdb is not compiled in, do nothing. + */ +#define DEBUGON(cond) \ +({ \ + extern void debugtrap(void); \ + \ + if (unlikely(cond)) \ + debugtrap(); \ +}) +#else +#define DEBUGON(cond) noop +#endif + +/* + * Error code tracing facility. (Idea is borrowed from XFS code.) + * + * Suppose some strange and/or unexpected code is returned from some function + * (for example, write(2) returns -EEXIST). It is possible to place a + * breakpoint in the reiser4_write(), but it is too late here. How to find out + * in what particular place -EEXIST was generated first? + * + * In reiser4 all places where actual error codes are produced (that is, + * statements of the form + * + * return -EFOO; // (1), or + * + * result = -EFOO; // (2) + * + * are replaced with + * + * return RETERR(-EFOO); // (1a), and + * + * result = RETERR(-EFOO); // (2a) respectively + * + * RETERR() macro fills a backtrace in reiser4_context. This back-trace is + * printed in error and warning messages. Moreover, it's possible to put a + * conditional breakpoint in return_err (low-level function called by RETERR() + * to do the actual work) to break into debugger immediately when particular + * error happens. + * + */ + +#if REISER4_DEBUG + +/* + * data-type to store information about where error happened ("error site"). + */ +typedef struct err_site { + int code; /* error code */ + const char *file; /* source file, filled by __FILE__ */ + int line; /* source file line, filled by __LINE__ */ +#ifdef CONFIG_FRAME_POINTER + void *bt[5]; +#endif +} err_site; + +extern void return_err(int code, const char *file, int line); + +/* + * fill &get_current_context()->err_site with error information. + */ +#define RETERR(code) \ +({ \ + typeof(code) __code; \ + \ + __code = (code); \ + return_err(__code, __FILE__, __LINE__); \ + __code; \ +}) + +#else + +/* + * no-op versions of the above + */ + +typedef struct err_site {} err_site; +#define RETERR(code) code +#endif + +#if REISER4_LARGE_KEY +/* + * conditionally compile arguments only if REISER4_LARGE_KEY is on. + */ +#define ON_LARGE_KEY(...) __VA_ARGS__ +#else +#define ON_LARGE_KEY(...) +#endif + +#define reiser4_internal + +/* __FS_REISER4_DEBUG_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/dformat.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/dformat.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,164 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Formats of on-disk data and conversion functions. */ + +/* put all item formats in the files describing the particular items, + our model is, everything you need to do to add an item to reiser4, + (excepting the changes to the plugin that uses the item which go + into the file defining that plugin), you put into one file. */ +/* Data on disk are stored in little-endian format. + To declare fields of on-disk structures, use d8, d16, d32 and d64. + d??tocpu() and cputod??() to convert. */ + +#if !defined( __FS_REISER4_DFORMAT_H__ ) +#define __FS_REISER4_DFORMAT_H__ + + +#include +#include +#include + +/* our default disk byteorder is little endian */ + +#if defined( __LITTLE_ENDIAN ) +#define CPU_IN_DISK_ORDER (1) +#else +#define CPU_IN_DISK_ORDER (0) +#endif + +/* code on-disk data-types as structs with a single field + to rely on compiler type-checking. Like include/asm-i386/page.h */ +typedef struct d8 { + __u8 datum; +} d8 __attribute__ ((aligned(1))); +typedef struct d16 { + __u16 datum; +} d16 __attribute__ ((aligned(2))); +typedef struct d32 { + __u32 datum; +} d32 __attribute__ ((aligned(4))); +typedef struct d64 { + __u64 datum; +} d64 __attribute__ ((aligned(8))); + +#define PACKED __attribute__((packed)) + +static inline __u8 +d8tocpu(const d8 * ondisk /* on-disk value to convert */ ) +{ + return ondisk->datum; +} + +static inline __u16 +d16tocpu(const d16 * ondisk /* on-disk value to convert */ ) +{ + return __le16_to_cpu(get_unaligned(&ondisk->datum)); +} + +static inline __u32 +d32tocpu(const d32 * ondisk /* on-disk value to convert */ ) +{ + return __le32_to_cpu(get_unaligned(&ondisk->datum)); +} + +static inline __u64 +d64tocpu(const d64 * ondisk /* on-disk value to convert */ ) +{ + return __le64_to_cpu(get_unaligned(&ondisk->datum)); +} + +static inline d8 * +cputod8(unsigned int oncpu /* CPU value to convert */ , + d8 * ondisk /* result */ ) +{ + assert("nikita-1264", oncpu < 0x100); + put_unaligned(oncpu, &ondisk->datum); + return ondisk; +} + +static inline d16 * +cputod16(unsigned int oncpu /* CPU value to convert */ , + d16 * ondisk /* result */ ) +{ + assert("nikita-1265", oncpu < 0x10000); + put_unaligned(__cpu_to_le16(oncpu), &ondisk->datum); + return ondisk; +} + +static inline d32 * +cputod32(__u32 oncpu /* CPU value to convert */ , + d32 * ondisk /* result */ ) +{ + put_unaligned(__cpu_to_le32(oncpu), &ondisk->datum); + return ondisk; +} + +static inline d64 * +cputod64(__u64 oncpu /* CPU value to convert */ , + d64 * ondisk /* result */ ) +{ + put_unaligned(__cpu_to_le64(oncpu), &ondisk->datum); + return ondisk; +} + +/* data-type for block number on disk: these types enable changing the block + size to other sizes, but they are only a start. Suppose we wanted to + support 48bit block numbers. The dblock_nr blk would be changed to "short + blk[3]". The block_nr type should remain an integral type greater or equal + to the dblock_nr type in size so that CPU arithmetic operations work. */ +typedef __u64 reiser4_block_nr; + +/* data-type for block number on disk, disk format */ +union reiser4_dblock_nr { + d64 blk; +}; + +static inline reiser4_block_nr +dblock_to_cpu(const reiser4_dblock_nr * dblock) +{ + return d64tocpu(&dblock->blk); +} + +static inline void +cpu_to_dblock(reiser4_block_nr block, reiser4_dblock_nr * dblock) +{ + cputod64(block, &dblock->blk); +} + +/* true if disk addresses are the same */ +static inline int +disk_addr_eq(const reiser4_block_nr * b1 /* first block + * number to + * compare */ , + const reiser4_block_nr * b2 /* second block + * number to + * compare */ ) +{ + assert("nikita-1033", b1 != NULL); + assert("nikita-1266", b2 != NULL); + + return !memcmp(b1, b2, sizeof *b1); +} + +/* structure of master reiser4 super block */ +typedef struct reiser4_master_sb { + char magic[16]; /* "ReIsEr4" */ + d16 disk_plugin_id; /* id of disk layout plugin */ + d16 blocksize; + char uuid[16]; /* unique id */ + char label[16]; /* filesystem label */ + d64 diskmap; /* location of the diskmap. 0 if not present */ +} reiser4_master_sb; + +/* __FS_REISER4_DFORMAT_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/dscale.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/dscale.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,173 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Scalable on-disk integers */ + +/* + * Various on-disk structures contain integer-like structures. Stat-data + * contain [yes, "data" is plural, check the dictionary] file size, link + * count; extent unit contains extent width etc. To accommodate for general + * case enough space is reserved to keep largest possible value. 64 bits in + * all cases above. But in overwhelming majority of cases numbers actually + * stored in these fields will be comparatively small and reserving 8 bytes is + * a waste of precious disk bandwidth. + * + * Scalable integers are one way to solve this problem. dscale_write() + * function stores __u64 value in the given area consuming from 1 to 9 bytes, + * depending on the magnitude of the value supplied. dscale_read() reads value + * previously stored by dscale_write(). + * + * dscale_write() produces format not completely unlike of UTF: two highest + * bits of the first byte are used to store "tag". One of 4 possible tag + * values is chosen depending on the number being encoded: + * + * 0 ... 0x3f => 0 [table 1] + * 0x40 ... 0x3fff => 1 + * 0x4000 ... 0x3fffffff => 2 + * 0x40000000 ... 0xffffffffffffffff => 3 + * + * (see dscale_range() function) + * + * Values in the range 0x40000000 ... 0xffffffffffffffff require 8 full bytes + * to be stored, so in this case there is no place in the first byte to store + * tag. For such values tag is stored in an extra 9th byte. + * + * As _highest_ bits are used for the test (which is natural) scaled integers + * are stored in BIG-ENDIAN format in contrast with the rest of reiser4 which + * uses LITTLE-ENDIAN. + * + */ + +#include "debug.h" +#include "dscale.h" + +/* return tag of scaled integer stored at @address */ +static int gettag(const unsigned char *address) +{ + /* tag is stored in two highest bits */ + return (*address) >> 6; +} + +/* clear tag from value. Clear tag embedded into @value. */ +static void cleartag(__u64 *value, int tag) +{ + /* + * W-w-what ?! + * + * Actually, this is rather simple: @value passed here was read by + * dscale_read(), converted from BIG-ENDIAN, and padded to __u64 by + * zeroes. Tag is still stored in the highest (arithmetically) + * non-zero bits of @value, but relative position of tag within __u64 + * depends on @tag. + * + * For example if @tag is 0, it's stored 2 highest bits of lowest + * byte, and its offset (counting from lowest bit) is 8 - 2 == 6 bits. + * + * If tag is 1, it's stored in two highest bits of 2nd lowest byte, + * and it's offset if (2 * 8) - 2 == 14 bits. + * + * See table 1 above for details. + * + * All these cases are captured by the formula: + */ + *value &= ~(3 << (((1 << tag) << 3) - 2)); + /* + * That is, clear two (3 == 0t11) bits at the offset + * + * 8 * (2 ^ tag) - 2, + * + * that is, two highest bits of (2 ^ tag)-th byte of @value. + */ +} + +/* return tag for @value. See table 1 above for details. */ +static int dscale_range(__u64 value) +{ + if (value > 0x3fffffff) + return 3; + if (value > 0x3fff) + return 2; + if (value > 0x3f) + return 1; + return 0; +} + +/* restore value stored at @adderss by dscale_write() and return number of + * bytes consumed */ +reiser4_internal int dscale_read(unsigned char *address, __u64 *value) +{ + int tag; + + /* read tag */ + tag = gettag(address); + switch (tag) { + case 3: + /* In this case tag is stored in an extra byte, skip this byte + * and decode value stored in the next 8 bytes.*/ + *value = __be64_to_cpu(get_unaligned((__u64 *)(address + 1))); + /* worst case: 8 bytes for value itself plus one byte for + * tag. */ + return 9; + case 0: + *value = get_unaligned(address); + break; + case 1: + *value = __be16_to_cpu(get_unaligned((__u16 *)address)); + break; + case 2: + *value = __be32_to_cpu(get_unaligned((__u32 *)address)); + break; + default: + return RETERR(-EIO); + } + /* clear tag embedded into @value */ + cleartag(value, tag); + /* number of bytes consumed is (2 ^ tag)---see table 1.*/ + return 1 << tag; +} + +/* store @value at @address and return number of bytes consumed */ +reiser4_internal int dscale_write(unsigned char *address, __u64 value) +{ + int tag; + int shift; + unsigned char *valarr; + + tag = dscale_range(value); + value = __cpu_to_be64(value); + valarr = (unsigned char *)&value; + shift = (tag == 3) ? 1 : 0; + memcpy(address + shift, valarr + sizeof value - (1 << tag), 1 << tag); + *address |= (tag << 6); + return shift + (1 << tag); +} + +/* number of bytes required to store @value */ +reiser4_internal int dscale_bytes(__u64 value) +{ + int bytes; + + bytes = 1 << dscale_range(value); + if (bytes == 8) + ++ bytes; + return bytes; +} + +/* returns true if @value and @other require the same number of bytes to be + * stored. Used by detect when data structure (like stat-data) has to be + * expanded or contracted. */ +reiser4_internal int dscale_fit(__u64 value, __u64 other) +{ + return dscale_range(value) == dscale_range(other); +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/dscale.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/dscale.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,27 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Scalable on-disk integers. See dscale.h for details. */ + +#if !defined( __FS_REISER4_DSCALE_H__ ) +#define __FS_REISER4_DSCALE_H__ + +#include "dformat.h" + +extern int dscale_read (unsigned char *address, __u64 *value); +extern int dscale_write(unsigned char *address, __u64 value); +extern int dscale_bytes(__u64 value); +extern int dscale_fit (__u64 value, __u64 other); + +/* __FS_REISER4_DSCALE_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/emergency_flush.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/emergency_flush.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,913 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* This file exists only until VM gets fixed to reserve pages properly, which + * might or might not be very political. */ + +/* Implementation of emergency flush. */ + +/* OVERVIEW: + + Before writing a node to the disk, some complex process (flush.[ch]) is + to be performed. Flush is the main necessary preliminary step before + writing pages back to the disk, but it has some characteristics that make + it completely different from traditional ->writepage(): + + 1 It operates on a large number of nodes, possibly far away from the + starting node, both in tree and disk order. + + 2 it can involve reading of nodes from the disk (during extent + allocation, for example). + + 3 it can allocate memory (during insertion of allocated extents). + + 4 it participates in the locking protocol which reiser4 uses to + implement concurrent tree modifications. + + 5 it is CPU consuming and long + + As a result, flush reorganizes some part of reiser4 tree and produces + large queue of nodes ready to be submitted for io. + + Items (3) and (4) alone make flush unsuitable for being called directly + from reiser4 ->writepage() callback, because of OOM and deadlocks + against threads waiting for memory. + + So, flush is performed from within balance_dirty_page() path when dirty + pages are generated. If balance_dirty_page() fails to throttle writers + and page replacement finds dirty page on the inactive list, we resort to + "emergency flush" in our ->vm_writeback(). + + Emergency flush is relatively dumb algorithm, implemented in this file, + that tries to write tree nodes to the disk without taking locks and without + thoroughly optimizing tree layout. We only want to call emergency flush in + desperate situations, because it is going to produce sub-optimal disk + layouts. + + DETAILED DESCRIPTION + + Emergency flush (eflush) is designed to work as low level mechanism with + no or little impact on the rest of (already too complex) code. + + eflush is initiated from ->writepage() method called by VM on memory + pressure. It is supposed that ->writepage() is rare call path, because + balance_dirty_pages() throttles writes and tries to keep memory in + balance. + + eflush main entry point (emergency_flush()) checks whether jnode is + eligible for emergency flushing. Check is performed by flushable() + function which see for details. After successful check, new block number + ("emergency block") is allocated and io is initiated to write jnode + content to that block. + + After io is finished, jnode will be cleaned and VM will be able to free + page through call to ->releasepage(). + + emergency_flush() also contains special case invoked when it is possible + to avoid allocation of new node. + + Node selected for eflush is marked (by JNODE_EFLUSH bit in ->flags field) + and added to the special hash table of all eflushed nodes. This table + doesn't have linkage within each jnode, as this would waste memory in + assumption that eflush is rare. In stead new small memory object + (eflush_node_t) is allocated that contains pointer to jnode, emergency + block number, and is inserted into hash table. Per super block counter of + eflushed nodes is incremented. See section [INODE HANDLING] below for + more on this. + + It should be noted that emergency flush may allocate memory and wait for + io completion (bitmap read). + + Basically eflushed node has following distinctive characteristics: + + (1) JNODE_EFLUSH bit is set + + (2) no page + + (3) there is an element in hash table, for this node + + (4) node content is stored on disk in block whose number is stored + in the hash table element + + UNFLUSH + + Unflush is reverse of eflush, that is process bringing page of eflushed + inode back into memory. + + In accordance with the policy that eflush is low level and low impact + mechanism, transparent to the rest of the code, unflushing is performed + deeply within jload_gfp() which is main function used to load and pin + jnode page into memory. + + Specifically, if jload_gfp() determines that it is called on eflushed + node it gets emergency block number to start io against from the hash + table rather than from jnode itself. This is done in + jnode_get_io_block() function. After io completes, hash table element + for this node is removed and JNODE_EFLUSH bit is cleared. + + LOCKING + + The page lock is used to avoid eflush/e-unflush/jnode_get_io_block races. + emergency_flush() and jnode_get_io_block are called under the page lock. + The eflush_del() function (emergency unflush) may be called for a node w/o + page attached. In that case eflush_del() allocates a page and locks it. + + PROBLEMS + + 1. INODE HANDLING + + Usually (i.e., without eflush), jnode has a page attached to it. This + page pins corresponding struct address_space, and, hence, inode in + memory. Once inode has been eflushed, its page is gone and inode can be + wiped out of memory by the memory pressure (prune_icache()). This leads + to the number of complications: + + (1) jload_gfp() has to attach jnode tho the address space's radix + tree. This requires existence if inode. + + (2) normal flush needs jnode's inode to start slum collection from + unformatted jnode. + + (1) is really a problem, because it is too late to load inode (which + would lead to loading of stat data, etc.) within jload_gfp(). + + We, therefore, need some way to protect inode from being recycled while + having accessible eflushed nodes. + + I'll describe old solution here so it can be compared with new one. + + Original solution pinned inode by __iget() when first its node was + eflushed and released (through iput()) when last was unflushed. This + required maintenance of inode->eflushed counter in inode. + + Problem arise if last name of inode is unlinked when it has eflushed + nodes. In this case, last iput() that leads to the removal of file is + iput() made by unflushing from within jload_gfp(). Obviously, calling + truncate, and tree traversals from jload_gfp() is not a good idea. + + New solution is to pin inode in memory by adding I_EFLUSH bit to its + ->i_state field. This protects inode from being evicted by + prune_icache(). + + DISK SPACE ALLOCATION + + This section will describe how emergency block is allocated and how + block counters (allocated, grabbed, etc.) are manipulated. To be done. + + *****HISTORICAL SECTION**************************************************** + + DELAYED PARENT UPDATE + + Important point of emergency flush is that update of parent is sometimes + delayed: we don't update parent immediately if: + + 1 Child was just allocated, but parent is locked. Waiting for parent + lock in emergency flush is impossible (deadlockable). + + 2 Part of extent was allocated, but parent has not enough space to + insert allocated extent unit. Balancing in emergency flush is + impossible, because it will possibly wait on locks. + + When we delay update of parent node, we mark it as such (and possibly + also mark children to simplify delayed update later). Question: when + parent should be really updated? + + WHERE TO WRITE PAGE INTO? + + + So, it was decided that flush has to be performed from a separate + thread. Reiser4 has a thread used to periodically commit old transactions, + and this thread can be used for the flushing. That is, flushing thread + does flush and accumulates nodes prepared for the IO on the special + queue. reiser4_vm_writeback() submits nodes from this queue, if queue is + empty, it only wakes up flushing thread and immediately returns. + + Still there are some problems with integrating this stuff into VM + scanning: + + 1 As ->vm_writeback() returns immediately without actually submitting + pages for IO, throttling on PG_writeback in shrink_list() will not + work. This opens a possibility (on a fast CPU), of try_to_free_pages() + completing scanning and calling out_of_memory() before flushing thread + managed to add anything to the queue. + + 2 It is possible, however unlikely, that flushing thread will be + unable to flush anything, because there is not enough memory. In this + case reiser4 resorts to the "emergency flush": some dumb algorithm, + implemented in this file, that tries to write tree nodes to the disk + without taking locks and without thoroughly optimizing tree layout. We + only want to call emergency flush in desperate situations, because it + is going to produce sub-optimal disk layouts. + + 3 Nodes prepared for IO can be from the active list, this means that + they will not be met/freed by shrink_list() after IO completion. New + blk_congestion_wait() should help with throttling but not + freeing. This is not fatal though, because inactive list refilling + will ultimately get to these pages and reclaim them. + + REQUIREMENTS + + To make this work we need at least some hook inside VM scanning which + gets triggered after scanning (or scanning with particular priority) + failed to free pages. This is already present in the + mm/vmscan.c:set_shrinker() interface. + + Another useful thing that we would like to have is passing scanning + priority down to the ->vm_writeback() that will allow file system to + switch to the emergency flush more gracefully. + + POSSIBLE ALGORITHMS + + 1 Start emergency flush from ->vm_writeback after reaching some priority. + This allows to implement simple page based algorithm: look at the page VM + supplied us with and decide what to do. + + 2 Start emergency flush from shrinker after reaching some priority. + This delays emergency flush as far as possible. + + *****END OF HISTORICAL SECTION********************************************** + +*/ + +#include "forward.h" +#include "debug.h" +#include "page_cache.h" +#include "tree.h" +#include "jnode.h" +#include "znode.h" +#include "inode.h" +#include "super.h" +#include "block_alloc.h" +#include "emergency_flush.h" + +#include +#include +#include +#include +#include + +#if REISER4_USE_EFLUSH + +static int flushable(const jnode * node, struct page *page, int); +static int needs_allocation(const jnode * node); +static eflush_node_t *ef_alloc(int flags); +static reiser4_ba_flags_t ef_block_flags(const jnode *node); +static int ef_free_block(jnode *node, const reiser4_block_nr *blk, block_stage_t stage, eflush_node_t *ef); +static int ef_prepare(jnode *node, reiser4_block_nr *blk, eflush_node_t **enode, reiser4_blocknr_hint *hint); +static int eflush_add(jnode *node, reiser4_block_nr *blocknr, eflush_node_t *ef); + +/* slab for eflush_node_t's */ +static kmem_cache_t *eflush_slab; + +#define EFLUSH_START_BLOCK ((reiser4_block_nr)0) + +#define INC_STAT(node, counter) \ + reiser4_stat_inc_at_level(jnode_get_level(node), counter); + +/* this function exists only until VM gets fixed to reserve pages properly, + * which might or might not be very political. */ +/* try to flush @page to the disk + * + * Return 0 if page was successfully paged out. 1 if it is busy, error + * otherwise. + */ +reiser4_internal int +emergency_flush(struct page *page) +{ + struct super_block *sb; + jnode *node; + int result; + assert("nikita-2721", page != NULL); + assert("nikita-2724", PageLocked(page)); + + // warning("nikita-3112", "Emergency flush. Notify Reiser@Namesys.COM"); + + /* + * Page is locked, hence page<->jnode mapping cannot change. + */ + + sb = page->mapping->host->i_sb; + node = jprivate(page); + + assert("vs-1452", node != NULL); + + jref(node); + + result = 0; + LOCK_JNODE(node); + /* + * page was dirty and under eflush. This is (only?) possible if page + * was re-dirtied through mmap(2) after eflush IO was submitted, but + * before ->releasepage() freed page. + */ + eflush_del(node, 1); + + LOCK_JLOAD(node); + if (flushable(node, page, 1)) { + if (needs_allocation(node)) { + reiser4_block_nr blk; + eflush_node_t *efnode; + reiser4_blocknr_hint hint; + + blk = 0ull; + efnode = NULL; + + /* Set JNODE_EFLUSH bit _before_ allocating a block, + * that prevents flush reserved block from using here + * and by a reiser4 flush process */ + JF_SET(node, JNODE_EFLUSH); + + blocknr_hint_init(&hint); + + result = ef_prepare(node, &blk, &efnode, &hint); + if (flushable(node, page, 0) && result == 0) { + assert("nikita-2759", efnode != NULL); + eflush_add(node, &blk, efnode); + + result = page_io(page, node, WRITE, + GFP_NOFS | __GFP_HIGH); + } else { + JF_CLR(node, JNODE_EFLUSH); + UNLOCK_JLOAD(node); + UNLOCK_JNODE(node); + if (blk != 0ull) { + ef_free_block(node, &blk, + hint.block_stage, efnode); + kmem_cache_free(eflush_slab, efnode); + } + result = 1; + } + + blocknr_hint_done(&hint); + } else { + /* eflush without allocation temporary location for a node */ + txn_atom *atom; + flush_queue_t *fq; + + /* get flush queue for this node */ + result = fq_by_jnode_gfp(node, &fq, GFP_ATOMIC); + + if (result) + return result; + + atom = node->atom; + + if (!flushable(node, page, 1) || needs_allocation(node) || !jnode_is_dirty(node)) { + UNLOCK_JLOAD(node); + UNLOCK_JNODE(node); + UNLOCK_ATOM(atom); + fq_put(fq); + return 1; + } + + /* ok, now we can flush it */ + unlock_page(page); + + queue_jnode(fq, node); + + UNLOCK_JLOAD(node); + UNLOCK_JNODE(node); + UNLOCK_ATOM(atom); + + result = write_fq(fq, NULL, 0); + if (result != 0) + lock_page(page); + + /* Even if we wrote nothing, We unlocked the page, so let know to the caller that page should + not be unlocked again */ + fq_put(fq); + } + + } else { + UNLOCK_JLOAD(node); + UNLOCK_JNODE(node); + result = 1; + } + + jput(node); + return result; +} + +static int +flushable(const jnode * node, struct page *page, int check_eflush) +{ + assert("nikita-2725", node != NULL); + assert("nikita-2726", spin_jnode_is_locked(node)); + assert("nikita-3388", spin_jload_is_locked(node)); + + if (jnode_is_loaded(node)) { /* loaded */ + return 0; + } + if (JF_ISSET(node, JNODE_FLUSH_QUEUED)) { /* already pending io */ + return 0; + } + if (JF_ISSET(node, JNODE_EPROTECTED)) { /* protected from e-flush */ + return 0; + } + if (JF_ISSET(node, JNODE_HEARD_BANSHEE)) { + return 0; + } + if (page == NULL) { /* nothing to flush */ + return 0; + } + if (PageWriteback(page)) { /* already under io */ + return 0; + } + /* don't flush bitmaps or journal records */ + if (!jnode_is_znode(node) && !jnode_is_unformatted(node)) { + return 0; + } + /* don't flush cluster pages */ + if (jnode_is_cluster_page(node)) { + return 0; + } + if (check_eflush && JF_ISSET(node, JNODE_EFLUSH)) { /* already flushed */ + return 0; + } + return 1; +} + +#undef INC_STAT + +/* does node need allocation for eflushing? */ +static int +needs_allocation(const jnode * node) +{ + return !(JF_ISSET(node, JNODE_RELOC) && !blocknr_is_fake(jnode_get_block(node))); +} + + +static inline int +jnode_eq(jnode * const * j1, jnode * const * j2) +{ + assert("nikita-2733", j1 != NULL); + assert("nikita-2734", j2 != NULL); + + return *j1 == *j2; +} + +static ef_hash_table * +get_jnode_enhash(const jnode *node) +{ + struct super_block *super; + + assert("nikita-2739", node != NULL); + + super = jnode_get_tree(node)->super; + return &get_super_private(super)->efhash_table; +} + +static inline __u32 +jnode_hfn(ef_hash_table *table, jnode * const * j) +{ + __u32 val; + + assert("nikita-2735", j != NULL); + assert("nikita-3346", IS_POW(table->_buckets)); + + val = (unsigned long)*j; + val /= sizeof(**j); + return val & (table->_buckets - 1); +} + + +/* The hash table definition */ +#define KMALLOC(size) vmalloc(size) +#define KFREE(ptr, size) vfree(ptr) +TYPE_SAFE_HASH_DEFINE(ef, eflush_node_t, jnode *, node, linkage, jnode_hfn, jnode_eq); +#undef KFREE +#undef KMALLOC + +reiser4_internal int +eflush_init(void) +{ + eflush_slab = kmem_cache_create("eflush", sizeof (eflush_node_t), + 0, SLAB_HWCACHE_ALIGN, NULL, NULL); + if (eflush_slab == NULL) + return RETERR(-ENOMEM); + else + return 0; +} + +reiser4_internal int +eflush_done(void) +{ + return kmem_cache_destroy(eflush_slab); +} + +reiser4_internal int +eflush_init_at(struct super_block *super) +{ + return ef_hash_init(&get_super_private(super)->efhash_table, + 8192); +} + +reiser4_internal void +eflush_done_at(struct super_block *super) +{ + ef_hash_done(&get_super_private(super)->efhash_table); +} + +static eflush_node_t * +ef_alloc(int flags) +{ + return kmem_cache_alloc(eflush_slab, flags); +} + +#define EFLUSH_MAGIC 4335203 + +static int +eflush_add(jnode *node, reiser4_block_nr *blocknr, eflush_node_t *ef) +{ + reiser4_tree *tree; + + assert("nikita-2737", node != NULL); + assert("nikita-2738", JF_ISSET(node, JNODE_EFLUSH)); + assert("nikita-3382", !JF_ISSET(node, JNODE_EPROTECTED)); + assert("nikita-2765", spin_jnode_is_locked(node)); + assert("nikita-3381", spin_jload_is_locked(node)); + + tree = jnode_get_tree(node); + + ef->node = node; + ef->blocknr = *blocknr; + ef->hadatom = (node->atom != NULL); + ef->incatom = 0; + jref(node); + spin_lock_eflush(tree->super); + ef_hash_insert(get_jnode_enhash(node), ef); + ON_DEBUG(++ get_super_private(tree->super)->eflushed); + spin_unlock_eflush(tree->super); + + if (jnode_is_unformatted(node)) { + struct inode *inode; + reiser4_inode *info; + + WLOCK_TREE(tree); + + inode = mapping_jnode(node)->host; + info = reiser4_inode_data(inode); + + if (!ef->hadatom) { + radix_tree_tag_set(jnode_tree_by_reiser4_inode(info), + index_jnode(node), EFLUSH_TAG_ANONYMOUS); + ON_DEBUG(info->anonymous_eflushed ++); + } else { + ON_DEBUG(info->captured_eflushed ++); + } + WUNLOCK_TREE(tree); + /*XXXX*/ + inc_unfm_ef(); + } + + /* FIXME: do we need it here, if eflush add/del are protected by page lock? */ + UNLOCK_JLOAD(node); + + /* + * jnode_get_atom() can possible release jnode spin lock. This + * means it can only be called _after_ JNODE_EFLUSH is set, because + * otherwise we would have to re-check flushable() once more. No + * thanks. + */ + + if (ef->hadatom) { + txn_atom *atom; + + atom = jnode_get_atom(node); + if (atom != NULL) { + ++ atom->flushed; + ef->incatom = 1; + UNLOCK_ATOM(atom); + } + } + + UNLOCK_JNODE(node); + return 0; +} + +/* Arrghh... cast to keep hash table code happy. */ +#define C(node) ((jnode *const *)&(node)) + +reiser4_internal reiser4_block_nr * +eflush_get(const jnode *node) +{ + eflush_node_t *ef; + reiser4_tree *tree; + + assert("nikita-2740", node != NULL); + assert("nikita-2741", JF_ISSET(node, JNODE_EFLUSH)); + assert("nikita-2767", spin_jnode_is_locked(node)); + + + tree = jnode_get_tree(node); + spin_lock_eflush(tree->super); + ef = ef_hash_find(get_jnode_enhash(node), C(node)); + spin_unlock_eflush(tree->super); + + assert("nikita-2742", ef != NULL); + return &ef->blocknr; +} + +/* free resources taken for emergency flushing of the node */ +reiser4_internal void +eflush_free (jnode * node) +{ + eflush_node_t *ef; + ef_hash_table *table; + reiser4_tree *tree; + txn_atom *atom; + struct inode *inode = NULL; + reiser4_block_nr blk; + + assert ("zam-1026", spin_jnode_is_locked(node)); + + table = get_jnode_enhash(node); + tree = jnode_get_tree(node); + + spin_lock_eflush(tree->super); + ef = ef_hash_find(table, C(node)); + BUG_ON(ef == NULL); + assert("nikita-2745", ef != NULL); + blk = ef->blocknr; + ef_hash_remove(table, ef); + ON_DEBUG(-- get_super_private(tree->super)->eflushed); + spin_unlock_eflush(tree->super); + + if (ef->incatom) { + atom = jnode_get_atom(node); + assert("nikita-3311", atom != NULL); + -- atom->flushed; + UNLOCK_ATOM(atom); + } + + assert("vs-1215", JF_ISSET(node, JNODE_EFLUSH)); + + if (jnode_is_unformatted(node)) { + reiser4_inode *info; + + WLOCK_TREE(tree); + + inode = mapping_jnode(node)->host; + info = reiser4_inode_data(inode); + + /* clear e-flush specific tags from node's radix tree slot */ + if (!ef->hadatom) { + radix_tree_tag_clear( + jnode_tree_by_reiser4_inode(info), index_jnode(node), + EFLUSH_TAG_ANONYMOUS); + ON_DEBUG(info->anonymous_eflushed --); + } else + ON_DEBUG(info->captured_eflushed --); + + assert("nikita-3355", jnode_tree_by_reiser4_inode(info)->rnode != NULL); + + WUNLOCK_TREE(tree); + + /*XXXX*/ + dec_unfm_ef(); + + } + UNLOCK_JNODE(node); + +#if REISER4_DEBUG + if (blocknr_is_fake(jnode_get_block(node))) + assert ("zam-817", ef->initial_stage == BLOCK_UNALLOCATED); + else + assert ("zam-818", ef->initial_stage == BLOCK_GRABBED); +#endif + + jput(node); + + ef_free_block(node, &blk, + blocknr_is_fake(jnode_get_block(node)) ? + BLOCK_UNALLOCATED : BLOCK_GRABBED, ef); + + kmem_cache_free(eflush_slab, ef); + + LOCK_JNODE(node); +} + +reiser4_internal void +eflush_del (jnode * node, int page_locked) +{ + struct page * page; + + assert("nikita-2743", node != NULL); + assert("nikita-2770", spin_jnode_is_locked(node)); + + if (!JF_ISSET(node, JNODE_EFLUSH)) + return; + + if (page_locked) { + page = jnode_page(node); + assert("nikita-2806", page != NULL); + assert("nikita-2807", PageLocked(page)); + } else { + UNLOCK_JNODE(node); + page = jnode_get_page_locked(node, GFP_NOFS); + LOCK_JNODE(node); + if (page == NULL) { + warning ("zam-1025", "eflush_del failed to get page back\n"); + return; + } + if (unlikely(!JF_ISSET(node, JNODE_EFLUSH))) + /* race: some other thread unflushed jnode. */ + goto out; + } + + if (PageWriteback(page)) { + UNLOCK_JNODE(node); + page_cache_get(page); + reiser4_wait_page_writeback(page); + page_cache_release(page); + LOCK_JNODE(node); + if (unlikely(!JF_ISSET(node, JNODE_EFLUSH))) + /* race: some other thread unflushed jnode. */ + goto out; + } + + /* we have to make page dirty again. Note that we do not have to do here + anything specific to reiser4 but usual dirty page accounting. If */ + if (!TestSetPageDirty(page)) { + BUG_ON(jnode_get_mapping(node) != page->mapping); + if (mapping_cap_account_dirty(page->mapping)) + inc_page_state(nr_dirty); + } + +#if 0 + if (JF_ISSET(node, JNODE_KEEPME)) + /* jnode is already tagged in reiser4_inode's tree of jnodes */ + reiser4_set_page_dirty2(page); + else + /* jnode had atom when */ + /* + * either jnode was dirty or page was dirtied through mmap. Page's dirty + * bit was cleared before io was submitted. If page is left clean, we + * would have dirty jnode with clean page. Neither ->writepage() nor + * ->releasepage() can free it. Re-dirty page, so ->writepage() will be + * called again if necessary. + */ + set_page_dirty_internal(page, 0); +#endif + + assert("nikita-2766", atomic_read(&node->x_count) > 1); + /* release allocated disk block and in-memory structures */ + eflush_free(node); + assert("vs-1736", PageLocked(page)); + JF_CLR(node, JNODE_EFLUSH); + ON_DEBUG(JF_SET(node, JNODE_UNEFLUSHED)); + out: + if (!page_locked) + unlock_page(page); +} + +reiser4_internal int +emergency_unflush(jnode *node) +{ + int result; + + assert("nikita-2778", node != NULL); + assert("nikita-3046", schedulable()); + + if (JF_ISSET(node, JNODE_EFLUSH)) { + result = jload(node); + if (result == 0) { + struct page *page; + + assert("nikita-2777", !JF_ISSET(node, JNODE_EFLUSH)); + page = jnode_page(node); + assert("nikita-2779", page != NULL); + wait_on_page_writeback(page); + + jrelse(node); + } + } else + result = 0; + return result; +} + +static reiser4_ba_flags_t +ef_block_flags(const jnode *node) +{ + return jnode_is_znode(node) ? BA_FORMATTED : 0; +} + +static int ef_free_block(jnode *node, + const reiser4_block_nr *blk, + block_stage_t stage, eflush_node_t *ef) +{ + int result = 0; + + /* We cannot just ask block allocator to return block into flush + * reserved space, because there is no current atom at this point. */ + result = reiser4_dealloc_block(blk, stage, ef_block_flags(node)); + if (result == 0 && stage == BLOCK_GRABBED) { + txn_atom *atom; + + if (ef->reserve) { + /* further, transfer block from grabbed into flush + * reserved space. */ + LOCK_JNODE(node); + atom = jnode_get_atom(node); + assert("nikita-2785", atom != NULL); + grabbed2flush_reserved_nolock(atom, 1); + UNLOCK_ATOM(atom); + JF_SET(node, JNODE_FLUSH_RESERVED); + UNLOCK_JNODE(node); + } else { + reiser4_context * ctx = get_current_context(); + grabbed2free(ctx, get_super_private(ctx->super), + (__u64)1); + } + } + return result; +} + +static int +ef_prepare(jnode *node, reiser4_block_nr *blk, eflush_node_t **efnode, reiser4_blocknr_hint * hint) +{ + int result; + int usedreserve; + + assert("nikita-2760", node != NULL); + assert("nikita-2761", blk != NULL); + assert("nikita-2762", efnode != NULL); + assert("nikita-2763", spin_jnode_is_locked(node)); + assert("nikita-3387", spin_jload_is_locked(node)); + + hint->blk = EFLUSH_START_BLOCK; + hint->max_dist = 0; + hint->level = jnode_get_level(node); + usedreserve = 0; + if (blocknr_is_fake(jnode_get_block(node))) + hint->block_stage = BLOCK_UNALLOCATED; + else { + txn_atom *atom; + switch (jnode_is_leaf(node)) { + default: + /* We cannot just ask block allocator to take block from + * flush reserved space, because there is no current + * atom at this point. */ + atom = jnode_get_atom(node); + if (atom != NULL) { + if (JF_ISSET(node, JNODE_FLUSH_RESERVED)) { + usedreserve = 1; + flush_reserved2grabbed(atom, 1); + JF_CLR(node, JNODE_FLUSH_RESERVED); + UNLOCK_ATOM(atom); + break; + } else + UNLOCK_ATOM(atom); + } + /* fall through */ + /* node->atom == NULL if page was dirtied through + * mmap */ + case 0: + result = reiser4_grab_space_force((__u64)1, BA_RESERVED); + grab_space_enable(); + if (result) { + warning("nikita-3323", + "Cannot allocate eflush block"); + return result; + } + } + + hint->block_stage = BLOCK_GRABBED; + } + + /* XXX protect @node from being concurrently eflushed. Otherwise, + * there is a danger of underflowing block space */ + UNLOCK_JLOAD(node); + UNLOCK_JNODE(node); + + *efnode = ef_alloc(GFP_NOFS | __GFP_HIGH); + if (*efnode == NULL) { + result = RETERR(-ENOMEM); + goto out; + } + +#if REISER4_DEBUG + (*efnode)->initial_stage = hint->block_stage; +#endif + (*efnode)->reserve = usedreserve; + + result = reiser4_alloc_block(hint, blk, ef_block_flags(node)); + if (result) + kmem_cache_free(eflush_slab, *efnode); + out: + LOCK_JNODE(node); + LOCK_JLOAD(node); + return result; +} + +#endif /* REISER4_USE_EFLUSH */ + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 80 + LocalWords: " unflush eflushed LocalWords eflush writepage VM releasepage unflushing io " + End: +*/ diff -puN /dev/null fs/reiser4/emergency_flush.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/emergency_flush.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,75 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Emergency flush */ + +#ifndef __EMERGENCY_FLUSH_H__ +#define __EMERGENCY_FLUSH_H__ + +#if REISER4_USE_EFLUSH + +#include "block_alloc.h" + +struct eflush_node; +typedef struct eflush_node eflush_node_t; + +TYPE_SAFE_HASH_DECLARE(ef, eflush_node_t); + +struct eflush_node { + jnode *node; + reiser4_block_nr blocknr; + ef_hash_link linkage; + struct list_head inode_link; /* for per inode list of eflush nodes */ + struct list_head inode_anon_link; + int hadatom :1; + int incatom :1; + int reserve :1; +#if REISER4_DEBUG + block_stage_t initial_stage; +#endif +}; + +int eflush_init(void); +int eflush_done(void); + +extern int eflush_init_at(struct super_block *super); +extern void eflush_done_at(struct super_block *super); + +extern reiser4_block_nr *eflush_get(const jnode *node); +extern void eflush_del(jnode *node, int page_locked); +extern void eflush_free(jnode *); + +extern int emergency_flush(struct page *page); +extern int emergency_unflush(jnode *node); + +/* tag to tag eflushed anonymous jnodes in reiser4_inode's radix tree of jnodes */ +#define EFLUSH_TAG_ANONYMOUS PAGECACHE_TAG_DIRTY + +#else /* REISER4_USE_EFLUSH */ + +#define eflush_init() (0) +#define eflush_done() (0) + +#define eflush_init_at(super) (0) +#define eflush_done_at(super) (0) + +#define eflush_get(node) NULL +#define eflush_del(node, flag) do{}while(0) +#define eflush_free(node) do{}while(0) + +#define emergency_unflush(node) (0) +#define emergency_flush(page) (1) + +#endif /* REISER4_USE_EFLUSH */ + +/* __EMERGENCY_FLUSH_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/entd.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/entd.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,375 @@ +/* Copyright 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Ent daemon. */ + +#include "debug.h" +#include "kcond.h" +#include "txnmgr.h" +#include "tree.h" +#include "entd.h" +#include "super.h" +#include "context.h" +#include "reiser4.h" +#include "vfs_ops.h" +#include "page_cache.h" + +#include /* struct task_struct */ +#include +#include +#include +#include /* INITIAL_JIFFIES */ +#include /* bdi_write_congested */ + +TYPE_SAFE_LIST_DEFINE(wbq, struct wbq, link); + +#define DEF_PRIORITY 12 +#define MAX_ENTD_ITERS 10 +#define ENTD_ASYNC_REQUESTS_LIMIT 32 + +static void entd_flush(struct super_block *super); +static int entd(void *arg); + +/* + * set ->comm field of end thread to make its state visible to the user level + */ +#define entd_set_comm(state) \ + snprintf(current->comm, sizeof(current->comm), \ + "ent:%s%s", super->s_id, (state)) + +/* get ent context for the @super */ +static inline entd_context * +get_entd_context(struct super_block *super) +{ + return &get_super_private(super)->entd; +} + +/* initialize ent thread context */ +reiser4_internal void +init_entd_context(struct super_block *super) +{ + entd_context * ctx; + + assert("nikita-3104", super != NULL); + + ctx = get_entd_context(super); + + memset(ctx, 0, sizeof *ctx); + kcond_init(&ctx->startup); + kcond_init(&ctx->wait); + init_completion(&ctx->finish); + spin_lock_init(&ctx->guard); + + /* start ent thread.. */ + kernel_thread(entd, super, CLONE_VM | CLONE_FS | CLONE_FILES); + + spin_lock(&ctx->guard); + /* and wait for its initialization to finish */ + while (ctx->tsk == NULL) + kcond_wait(&ctx->startup, &ctx->guard, 0); + spin_unlock(&ctx->guard); +#if REISER4_DEBUG + flushers_list_init(&ctx->flushers_list); +#endif + wbq_list_init(&ctx->wbq_list); +} + +static void wakeup_wbq (entd_context * ent, struct wbq * rq) +{ + wbq_list_remove(rq); + ent->nr_synchronous_requests --; + rq->wbc->nr_to_write --; + up(&rq->sem); +} + +static void wakeup_all_wbq (entd_context * ent) +{ + struct wbq * rq; + + spin_lock(&ent->guard); + while (!wbq_list_empty(&ent->wbq_list)) { + rq = wbq_list_front(&ent->wbq_list); + wakeup_wbq(ent, rq); + } + spin_unlock(&ent->guard); +} + +/* ent thread function */ +static int +entd(void *arg) +{ + struct super_block *super; + struct task_struct *me; + entd_context *ent; + + super = arg; + /* standard kernel thread prologue */ + me = current; + /* reparent_to_init() is done by daemonize() */ + daemonize("ent:%s", super->s_id); + + /* block all signals */ + spin_lock_irq(&me->sighand->siglock); + siginitsetinv(&me->blocked, 0); + recalc_sigpending(); + spin_unlock_irq(&me->sighand->siglock); + + /* do_fork() just copies task_struct into the new + thread. ->fs_context shouldn't be copied of course. This shouldn't + be a problem for the rest of the code though. + */ + me->journal_info = NULL; + + ent = get_entd_context(super); + + spin_lock(&ent->guard); + ent->tsk = me; + /* signal waiters that initialization is completed */ + kcond_broadcast(&ent->startup); + spin_unlock(&ent->guard); + while (1) { + int result = 0; + + if (me->flags & PF_FREEZE) + refrigerator(PF_FREEZE); + + spin_lock(&ent->guard); + + while (ent->nr_all_requests != 0) { + assert("zam-1043", ent->nr_all_requests >= ent->nr_synchronous_requests); + if (ent->nr_synchronous_requests != 0) { + struct wbq * rq = wbq_list_front(&ent->wbq_list); + + if (++ rq->nr_entd_iters > MAX_ENTD_ITERS) { + ent->nr_all_requests --; + wakeup_wbq(ent, rq); + continue; + } + } else { + /* endless loop avoidance. */ + ent->nr_all_requests --; + } + + spin_unlock(&ent->guard); + entd_set_comm("!"); + entd_flush(super); + spin_lock(&ent->guard); + } + + entd_set_comm("."); + + /* wait for work */ + result = kcond_wait(&ent->wait, &ent->guard, 1); + if (result != -EINTR && result != 0) + /* some other error */ + warning("nikita-3099", "Error: %i", result); + + /* we are asked to exit */ + if (ent->done) { + spin_unlock(&ent->guard); + break; + } + + spin_unlock(&ent->guard); + } + wakeup_all_wbq(ent); + complete_and_exit(&ent->finish, 0); + /* not reached. */ + return 0; +} + +/* called by umount */ +reiser4_internal void +done_entd_context(struct super_block *super) +{ + entd_context * ent; + + assert("nikita-3103", super != NULL); + + ent = get_entd_context(super); + + spin_lock(&ent->guard); + ent->done = 1; + kcond_signal(&ent->wait); + spin_unlock(&ent->guard); + + /* wait until daemon finishes */ + wait_for_completion(&ent->finish); +} + +/* called at the beginning of jnode_flush to register flusher thread with ent + * daemon */ +reiser4_internal void enter_flush (struct super_block * super) +{ + entd_context * ent; + + assert ("zam-1029", super != NULL); + ent = get_entd_context(super); + + assert ("zam-1030", ent != NULL); + + spin_lock(&ent->guard); + ent->flushers ++; +#if REISER4_DEBUG + flushers_list_push_front(&ent->flushers_list, get_current_context()); +#endif + spin_unlock(&ent->guard); +} + +/* called at the end of jnode_flush */ +reiser4_internal void leave_flush (struct super_block * super) +{ + entd_context * ent; + + assert ("zam-1027", super != NULL); + ent = get_entd_context(super); + + assert ("zam-1028", ent != NULL); + + spin_lock(&ent->guard); + ent->flushers --; + if (ent->flushers == 0 && ent->nr_synchronous_requests != 0) + kcond_signal(&ent->wait); +#if REISER4_DEBUG + flushers_list_remove_clean(get_current_context()); +#endif + spin_unlock(&ent->guard); +} + +/* signal to ent thread that it has more work to do */ +static void kick_entd(entd_context * ent) +{ + kcond_signal(&ent->wait); +} + +static void entd_capture_anonymous_pages( + struct super_block * super, struct writeback_control * wbc) +{ + spin_lock(&inode_lock); + generic_sync_sb_inodes(super, wbc); + spin_unlock(&inode_lock); +} + +static void entd_flush(struct super_block *super) +{ + long nr_submitted = 0; + int result; + reiser4_context ctx; + struct writeback_control wbc = { + .bdi = NULL, + .sync_mode = WB_SYNC_NONE, + .older_than_this = NULL, + .nr_to_write = 32, + .nonblocking = 0, + }; + + init_context(&ctx, super); + + ctx.entd = 1; + + entd_capture_anonymous_pages(super, &wbc); + result = flush_some_atom(&nr_submitted, &wbc, JNODE_FLUSH_WRITE_BLOCKS); + if (result != 0) + warning("nikita-3100", "Flush failed: %i", result); + + context_set_commit_async(&ctx); + reiser4_exit_context(&ctx); +} + +void write_page_by_ent (struct page * page, struct writeback_control * wbc) +{ + struct super_block * sb; + entd_context * ent; + struct wbq rq; + int phantom; + + sb = page->mapping->host->i_sb; + ent = get_entd_context(sb); + + phantom = jprivate(page) == NULL || !jnode_check_dirty(jprivate(page)); + /* re-dirty page */ + set_page_dirty_internal(page, phantom); + /* unlock it to avoid deadlocks with the thread which will do actual i/o */ + unlock_page(page); + + /* entd is not running. */ + if (ent == NULL || ent->done) + return; + + /* init wbq */ + wbq_list_clean(&rq); + rq.nr_entd_iters = 0; + rq.page = page; + rq.wbc = wbc; + + spin_lock(&ent->guard); + if (ent->flushers == 0) + kick_entd(ent); + ent->nr_all_requests ++; + if (ent->nr_all_requests <= ent->nr_synchronous_requests + ENTD_ASYNC_REQUESTS_LIMIT) { + spin_unlock(&ent->guard); + return; + } + sema_init(&rq.sem, 0); + wbq_list_push_back(&ent->wbq_list, &rq); + ent->nr_synchronous_requests ++; + spin_unlock(&ent->guard); + down(&rq.sem); + + /* don't release rq until wakeup_wbq stops using it. */ + spin_lock(&ent->guard); + spin_unlock(&ent->guard); + /* wbq dequeued by the ent thread (by another then current thread). */ +} + +/* ent should be locked */ +static struct wbq * get_wbq (entd_context * ent) +{ + if (wbq_list_empty(&ent->wbq_list)) { + spin_unlock(&ent->guard); + return NULL; + } + return wbq_list_front(&ent->wbq_list); +} + + +void ent_writes_page (struct super_block * sb, struct page * page) +{ + entd_context * ent = get_entd_context(sb); + struct wbq * rq; + + assert("zam-1041", ent != NULL); + + if (PageActive(page) || ent->nr_all_requests == 0) + return; + + SetPageReclaim(page); + + spin_lock(&ent->guard); + if (ent->nr_all_requests > 0) { + ent->nr_all_requests --; + rq = get_wbq(ent); + if (rq == NULL) + /* get_wbq() releases entd->guard spinlock if NULL is + * returned. */ + return; + wakeup_wbq(ent, rq); + } + spin_unlock(&ent->guard); +} + +int wbq_available (void) { + struct super_block * sb = reiser4_get_current_sb(); + entd_context * ent = get_entd_context(sb); + return ent->nr_all_requests; +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 80 + End: +*/ diff -puN /dev/null fs/reiser4/entd.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/entd.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,83 @@ +/* Copyright 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Ent daemon. */ + +#ifndef __ENTD_H__ +#define __ENTD_H__ + +#include "kcond.h" +#include "context.h" + +#include +#include +#include +#include /* for struct task_struct */ +#include "type_safe_list.h" + +TYPE_SAFE_LIST_DECLARE(wbq); + +/* write-back request. */ +struct wbq { + wbq_list_link link; + struct writeback_control * wbc; + struct page * page; + struct semaphore sem; + int nr_entd_iters; +}; + +/* ent-thread context. This is used to synchronize starting/stopping ent + * threads. */ +typedef struct entd_context { + /* + * condition variable that is signaled by ent thread after it + * successfully started up. + */ + kcond_t startup; + /* + * completion that is signaled by ent thread just before it + * terminates. + */ + struct completion finish; + /* + * condition variable that ent thread waits on for more work. It's + * signaled by write_page_by_ent(). + */ + kcond_t wait; + /* spinlock protecting other fields */ + spinlock_t guard; + /* ent thread */ + struct task_struct *tsk; + /* set to indicate that ent thread should leave. */ + int done; + /* counter of active flushers */ + int flushers; +#if REISER4_DEBUG + /* list of all active flushers */ + flushers_list_head flushers_list; +#endif + int nr_all_requests; + int nr_synchronous_requests; + wbq_list_head wbq_list; +} entd_context; + +extern void init_entd_context(struct super_block *super); +extern void done_entd_context(struct super_block *super); + +extern void enter_flush(struct super_block *super); +extern void leave_flush(struct super_block *super); + +extern void write_page_by_ent(struct page *, struct writeback_control *); +extern int wbq_available (void); +extern void ent_writes_page (struct super_block *, struct page *); +/* __ENTD_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/eottl.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/eottl.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,373 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "forward.h" +#include "debug.h" +#include "key.h" +#include "coord.h" +#include "plugin/item/item.h" +#include "plugin/node/node.h" +#include "znode.h" +#include "block_alloc.h" +#include "tree_walk.h" +#include "tree_mod.h" +#include "carry.h" +#include "tree.h" +#include "super.h" + +#include /* for __u?? */ + +/* Extents on the twig level (EOTTL) handling. + + EOTTL poses some problems to the tree traversal, that are better + explained by example. + + Suppose we have block B1 on the twig level with the following items: + + 0. internal item I0 with key (0:0:0:0) (locality, key-type, object-id, offset) + 1. extent item E1 with key (1:4:100:0), having 10 blocks of 4k each + 2. internal item I2 with key (10:0:0:0) + + We are trying to insert item with key (5:0:0:0). Lookup finds node + B1, and then intra-node lookup is done. This lookup finished on the + E1, because the key we are looking for is larger than the key of E1 + and is smaller than key the of I2. + + Here search is stuck. + + After some thought it is clear what is wrong here: extents on the + twig level break some basic property of the *search* tree (on the + pretext, that they restore property of balanced tree). + + Said property is the following: if in the internal node of the search + tree we have [ ... Key1 Pointer Key2 ... ] then, all data that are or + will be keyed in the tree with the Key such that Key1 <= Key < Key2 + are accessible through the Pointer. + + This is not true, when Pointer is Extent-Pointer, simply because + extent cannot expand indefinitely to the right to include any item + with + + Key1 <= Key <= Key2. + + For example, our E1 extent is only responsible for the data with keys + + (1:4:100:0) <= key <= (1:4:100:0xffffffffffffffff), and + + so, key range + + ( (1:4:100:0xffffffffffffffff), (10:0:0:0) ) + + is orphaned: there is no way to get there from the tree root. + + In other words, extent pointers are different than normal child + pointers as far as search tree is concerned, and this creates such + problems. + + Possible solution for this problem is to insert our item into node + pointed to by I2. There are some problems through: + + (1) I2 can be in a different node. + (2) E1 can be immediately followed by another extent E2. + + (1) is solved by calling reiser4_get_right_neighbor() and accounting + for locks/coords as necessary. + + (2) is more complex. Solution here is to insert new empty leaf node + and insert internal item between E1 and E2 pointing to said leaf + node. This is further complicated by possibility that E2 is in a + different node, etc. + + Problems: + + (1) if there was internal item I2 immediately on the right of an + extent E1 we and we decided to insert new item S1 into node N2 + pointed to by I2, then key of S1 will be less than smallest key in + the N2. Normally, search key checks that key we are looking for is in + the range of keys covered by the node key is being looked in. To work + around of this situation, while preserving useful consistency check + new flag CBK_TRUST_DK was added to the cbk falgs bitmask. This flag + is automatically set on entrance to the coord_by_key() and is only + cleared when we are about to enter situation described above. + + (2) If extent E1 is immediately followed by another extent E2 and we + are searching for the key that is between E1 and E2 we only have to + insert new empty leaf node when coord_by_key was called for + insertion, rather than just for lookup. To distinguish these cases, + new flag CBK_FOR_INSERT was added to the cbk falgs bitmask. This flag + is automatically set by coord_by_key calls performed by + insert_by_key() and friends. + + (3) Insertion of new empty leaf node (possibly) requires + balancing. In any case it requires modification of node content which + is only possible under write lock. It may well happen that we only + have read lock on the node where new internal pointer is to be + inserted (common case: lookup of non-existent stat-data that fells + between two extents). If only read lock is held, tree traversal is + restarted with lock_level modified so that next time we hit this + problem, write lock will be held. Once we have write lock, balancing + will be performed. + + + + + + +*/ + +/* look to the right of @coord. If it is an item of internal type - 1 is + returned. If that item is in right neighbor and it is internal - @coord and + @lh are switched to that node: move lock handle, zload right neighbor and + zrelse znode coord was set to at the beginning +*/ +/* Audited by: green(2002.06.15) */ +static int +is_next_item_internal(coord_t * coord) +{ + if (coord->item_pos != node_num_items(coord->node) - 1) { + /* next item is in the same node */ + coord_t right; + + coord_dup(&right, coord); + check_me("vs-742", coord_next_item(&right) == 0); + if (item_is_internal(&right)) { + coord_dup(coord, &right); + return 1; + } + } + return 0; +} + +/* inserting empty leaf after (or between) item of not internal type we have to + know which right delimiting key corresponding znode has to be inserted with */ +static reiser4_key * +rd_key(coord_t * coord, reiser4_key * key) +{ + coord_t dup; + + assert("nikita-2281", coord_is_between_items(coord)); + coord_dup(&dup, coord); + + RLOCK_DK(current_tree); + + if (coord_set_to_right(&dup) == 0) + /* get right delimiting key from an item to the right of @coord */ + unit_key_by_coord(&dup, key); + else + /* use right delimiting key of parent znode */ + *key = *znode_get_rd_key(coord->node); + + RUNLOCK_DK(current_tree); + return key; +} + + +ON_DEBUG(void check_dkeys(const znode *);) + +/* this is used to insert empty node into leaf level if tree lookup can not go + further down because it stopped between items of not internal type */ +static int +add_empty_leaf(coord_t * insert_coord, lock_handle * lh, const reiser4_key * key, const reiser4_key * rdkey) +{ + int result; + carry_pool *pool; + carry_level todo; + carry_op *op; + znode *node; + reiser4_item_data item; + carry_insert_data cdata; + reiser4_tree *tree; + + pool = init_carry_pool(); + if (IS_ERR(pool)) + return PTR_ERR(pool); + init_carry_level(&todo, pool); + assert("vs-49827", znode_contains_key_lock(insert_coord->node, key)); + + tree = znode_get_tree(insert_coord->node); + node = new_node(insert_coord->node, LEAF_LEVEL); + if (IS_ERR(node)) { + done_carry_pool(pool); + return PTR_ERR(node); + } + + /* setup delimiting keys for node being inserted */ + WLOCK_DK(tree); + znode_set_ld_key(node, key); + znode_set_rd_key(node, rdkey); + ON_DEBUG(node->creator = current); + ON_DEBUG(node->first_key = *key); + WUNLOCK_DK(tree); + + ZF_SET(node, JNODE_ORPHAN); + op = post_carry(&todo, COP_INSERT, insert_coord->node, 0); + if (!IS_ERR(op)) { + cdata.coord = insert_coord; + cdata.key = key; + cdata.data = &item; + op->u.insert.d = &cdata; + op->u.insert.type = COPT_ITEM_DATA; + build_child_ptr_data(node, &item); + item.arg = NULL; + /* have @insert_coord to be set at inserted item after + insertion is done */ + todo.track_type = CARRY_TRACK_CHANGE; + todo.tracked = lh; + + result = carry(&todo, 0); + if (result == 0) { + /* + * pin node in memory. This is necessary for + * znode_make_dirty() below. + */ + result = zload(node); + if (result == 0) { + lock_handle local_lh; + + /* + * if we inserted new child into tree we have + * to mark it dirty so that flush will be able + * to process it. + */ + init_lh(&local_lh); + result = longterm_lock_znode(&local_lh, node, + ZNODE_WRITE_LOCK, + ZNODE_LOCK_LOPRI); + if (result == 0) { + znode_make_dirty(node); + + /* when internal item pointing to @node + was inserted into twig node + create_hook_internal did not connect + it properly because its right + neighbor was not known. Do it + here */ + WLOCK_TREE(tree); + assert("nikita-3312", znode_is_right_connected(node)); + assert("nikita-2984", node->right == NULL); + ZF_CLR(node, JNODE_RIGHT_CONNECTED); + WUNLOCK_TREE(tree); + result = connect_znode(insert_coord, node); + if (result == 0) + ON_DEBUG(check_dkeys(node)); + + done_lh(lh); + move_lh(lh, &local_lh); + assert("vs-1676", node_is_empty(node)); + coord_init_first_unit(insert_coord, node); + } else { + warning("nikita-3136", + "Cannot lock child"); + print_znode("child", node); + } + done_lh(&local_lh); + zrelse(node); + } + } + } else + result = PTR_ERR(op); + zput(node); + done_carry_pool(pool); + return result; +} + +/* handle extent-on-the-twig-level cases in tree traversal */ +reiser4_internal int +handle_eottl(cbk_handle * h /* cbk handle */ , + int *outcome /* how traversal should proceed */ ) +{ + int result; + reiser4_key key; + coord_t *coord; + + coord = h->coord; + + if (h->level != TWIG_LEVEL || (coord_is_existing_item(coord) && item_is_internal(coord))) { + /* Continue to traverse tree downward. */ + return 0; + } + /* strange item type found on non-stop level?! Twig + horrors? */ + assert("vs-356", h->level == TWIG_LEVEL); + assert("vs-357", ( { + coord_t lcoord; + coord_dup(&lcoord, coord); + check_me("vs-733", coord_set_to_left(&lcoord) == 0); + item_is_extent(&lcoord);} + )); + + if (*outcome == NS_FOUND) { + /* we have found desired key on twig level in extent item */ + h->result = CBK_COORD_FOUND; + *outcome = LOOKUP_DONE; + return 1; + } + + if (!(h->flags & CBK_FOR_INSERT)) { + /* tree traversal is not for insertion. Just return + CBK_COORD_NOTFOUND. */ + h->result = CBK_COORD_NOTFOUND; + *outcome = LOOKUP_DONE; + return 1; + } + + /* take a look at the item to the right of h -> coord */ + result = is_next_item_internal(coord); + if (result == 0) { + /* item to the right is also an extent one. Allocate a new node + and insert pointer to it after item h -> coord. + + This is a result of extents being located at the twig + level. For explanation, see comment just above + is_next_item_internal(). + */ + if (cbk_lock_mode(h->level, h) != ZNODE_WRITE_LOCK) { + /* we got node read locked, restart coord_by_key to + have write lock on twig level */ + h->lock_level = TWIG_LEVEL; + h->lock_mode = ZNODE_WRITE_LOCK; + *outcome = LOOKUP_REST; + return 1; + } + + result = add_empty_leaf(coord, h->active_lh, h->key, rd_key(coord, &key)); + if (result) { + h->error = "could not add empty leaf"; + h->result = result; + *outcome = LOOKUP_DONE; + return 1; + } + /* added empty leaf is locked, its parent node is unlocked, + coord is set as EMPTY */ + *outcome = LOOKUP_DONE; + h->result = CBK_COORD_NOTFOUND; + return 1; + /*assert("vs-358", keyeq(h->key, item_key_by_coord(coord, &key)));*/ + } else { + /* this is special case mentioned in the comment on + tree.h:cbk_flags. We have found internal item immediately + on the right of extent, and we are going to insert new item + there. Key of item we are going to insert is smaller than + leftmost key in the node pointed to by said internal item + (otherwise search wouldn't come to the extent in the first + place). + + This is a result of extents being located at the twig + level. For explanation, see comment just above + is_next_item_internal(). + */ + h->flags &= ~CBK_TRUST_DK; + } + assert("vs-362", item_is_internal(coord)); + return 0; +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/estimate.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/estimate.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,101 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "debug.h" +#include "dformat.h" +#include "tree.h" +#include "carry.h" +#include "inode.h" +#include "cluster.h" +#include "plugin/item/ctail.h" + +/* this returns how many nodes might get dirty and added nodes if @children nodes are dirtied + + Amount of internals which will get dirty or get allocated we estimate as 5% of the childs + 1 balancing. 1 balancing + is 2 neighbours, 2 new blocks and the current block on the leaf level, 2 neighbour nodes + the current (or 1 + neighbour and 1 new and the current) on twig level, 2 neighbour nodes on upper levels and 1 for a new root. So 5 for + leaf level, 3 for twig level, 2 on upper + 1 for root. + + Do not calculate the current node of the lowest level here - this is overhead only. + + children is almost always 1 here. Exception is flow insertion +*/ +static reiser4_block_nr +max_balance_overhead(reiser4_block_nr childen, tree_level tree_height) +{ + reiser4_block_nr ten_percent; + + ten_percent = ((103 * childen) >> 10); + + /* If we have too many balancings at the time, tree height can raise on more + then 1. Assume that if tree_height is 5, it can raise on 1 only. */ + return ((tree_height < 5 ? 5 : tree_height) * 2 + (4 + ten_percent)); +} + +/* this returns maximal possible number of nodes which can be modified plus number of new nodes which can be required to + perform insertion of one item into the tree */ +/* it is only called when tree height changes, or gets initialized */ +reiser4_internal reiser4_block_nr +calc_estimate_one_insert(tree_level height) +{ + return 1 + max_balance_overhead(1, height); +} + +reiser4_internal reiser4_block_nr +estimate_one_insert_item(reiser4_tree *tree) +{ + return tree->estimate_one_insert; +} + +/* this returns maximal possible number of nodes which can be modified plus number of new nodes which can be required to + perform insertion of one unit into an item in the tree */ +reiser4_internal reiser4_block_nr +estimate_one_insert_into_item(reiser4_tree *tree) +{ + /* estimate insert into item just like item insertion */ + return tree->estimate_one_insert; +} + +reiser4_internal reiser4_block_nr +estimate_one_item_removal(reiser4_tree *tree) +{ + /* on item removal reiser4 does not try to pack nodes more complact, so, only one node may be dirtied on leaf + level */ + return tree->estimate_one_insert; +} + +/* on leaf level insert_flow may add CARRY_FLOW_NEW_NODES_LIMIT new nodes and dirty 3 existing nodes (insert point and + both its neighbors). Max_balance_overhead should estimate number of blocks which may change/get added on internal + levels */ +reiser4_internal reiser4_block_nr +estimate_insert_flow(tree_level height) +{ + return 3 + CARRY_FLOW_NEW_NODES_LIMIT + max_balance_overhead(3 + CARRY_FLOW_NEW_NODES_LIMIT, height); +} + +/* returnes max number of nodes can be occupied by disk cluster */ +reiser4_internal reiser4_block_nr +estimate_disk_cluster(struct inode * inode) +{ + return 2 + cluster_nrpages(inode); +} + +/* how many nodes might get dirty and added nodes during insertion of a disk cluster */ +reiser4_internal reiser4_block_nr +estimate_insert_cluster(struct inode * inode, int unprepped) +{ + int per_cluster; + per_cluster = (unprepped ? 1 : cluster_nrpages(inode)); + + return 3 + per_cluster + max_balance_overhead(3 + per_cluster, REISER4_MAX_ZTREE_HEIGHT); +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/file_ops.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/file_ops.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,421 @@ +/* Copyright 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* + * Interface to VFS. Reiser4 file_operations are defined here. + * + * This file contains definitions of functions that are installed into ->i_fop + * field of reiser4 inodes. + * + * By the most part these functions simply find object plugin of inode + * involved, and call appropriate plugin method to do the actual work. + */ + +#include "forward.h" +#include "debug.h" +#include "dformat.h" +#include "coord.h" +#include "plugin/item/item.h" +#include "plugin/file/file.h" +#include "plugin/security/perm.h" +#include "plugin/disk_format/disk_format.h" +#include "plugin/plugin.h" +#include "plugin/plugin_set.h" +#include "plugin/object.h" +#include "txnmgr.h" +#include "jnode.h" +#include "znode.h" +#include "block_alloc.h" +#include "tree.h" +#include "vfs_ops.h" +#include "inode.h" +#include "page_cache.h" +#include "ktxnmgrd.h" +#include "super.h" +#include "reiser4.h" +#include "entd.h" +#include "emergency_flush.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + + +/* file operations */ + +static loff_t reiser4_llseek(struct file *, loff_t, int); +static ssize_t reiser4_read(struct file *, char *, size_t, loff_t *); +static ssize_t reiser4_write(struct file *, const char *, size_t, loff_t *); +static int reiser4_readdir(struct file *, void *, filldir_t); +static int reiser4_ioctl(struct inode *, struct file *, unsigned int cmd, unsigned long arg); +static int reiser4_mmap(struct file *, struct vm_area_struct *); +static int reiser4_release(struct inode *, struct file *); +static int reiser4_fsync(struct file *, struct dentry *, int datasync); +static int reiser4_open(struct inode *, struct file *); +static ssize_t reiser4_sendfile(struct file *, loff_t *, size_t, read_actor_t, void __user *); + +#if 0 +static unsigned int reiser4_poll(struct file *, struct poll_table_struct *); +static int reiser4_flush(struct file *); +static int reiser4_fasync(int, struct file *, int); +static int reiser4_lock(struct file *, int, struct file_lock *); +static ssize_t reiser4_readv(struct file *, const struct iovec *, unsigned long, loff_t *); +static ssize_t reiser4_writev(struct file *, const struct iovec *, unsigned long, loff_t *); +static ssize_t reiser4_sendpage(struct file *, struct page *, int, size_t, loff_t *, int); +static unsigned long reiser4_get_unmapped_area(struct file *, unsigned long, + unsigned long, unsigned long, unsigned long); +#endif + +/* + * ->llseek() file operation for reiser4. Calls ->seek() method of object + * plugin. + */ +static loff_t +reiser4_llseek(struct file *file, loff_t off, int origin) +{ + loff_t result; + file_plugin *fplug; + struct inode *inode = file->f_dentry->d_inode; + loff_t(*seek_fn) (struct file *, loff_t, int); + reiser4_context ctx; + + init_context(&ctx, inode->i_sb); + + fplug = inode_file_plugin(inode); + assert("nikita-2291", fplug != NULL); + seek_fn = fplug->seek ? : generic_file_llseek; + result = seek_fn(file, off, origin); + reiser4_exit_context(&ctx); + return result; +} + +/* reiser4_readdir() - our readdir() method. + + readdir(2)/getdents(2) interface is based on implicit assumption that + readdir can be restarted from any particular point by supplying file + system with off_t-full of data. That is, file system fill ->d_off + field in struct dirent and later user passes ->d_off to the + seekdir(3), which is, actually, implemented by glibc as lseek(2) on + directory. + + Reiser4 cannot restart readdir from 64 bits of data, because two last + components of the key of directory entry are unknown, which given 128 + bits: locality and type fields in the key of directory entry are + always known, to start readdir() from given point objectid and offset + fields have to be filled. + + See plugin/dir/dir.c:readdir_common() for the details of our solution. +*/ +static int +reiser4_readdir(struct file *f /* directory file being read */ , + void *dirent /* opaque data passed to us by VFS */ , + filldir_t filldir /* filler function passed to us + * by VFS */ ) +{ + dir_plugin *dplug; + int result; + struct inode *inode; + reiser4_context ctx; + + inode = f->f_dentry->d_inode; + init_context(&ctx, inode->i_sb); + + dplug = inode_dir_plugin(inode); + if ((dplug != NULL) && (dplug->readdir != NULL)) + result = dplug->readdir(f, dirent, filldir); + else + result = RETERR(-ENOTDIR); + + /* + * directory st_atime is updated by callers (if necessary). + */ + context_set_commit_async(&ctx); + reiser4_exit_context(&ctx); + return result; +} + +/* + reiser4_ioctl - handler for ioctl for inode supported commands: +*/ +static int +reiser4_ioctl(struct inode *inode, struct file *filp, unsigned int cmd, unsigned long arg) +{ + int result; + reiser4_context ctx; + + init_context(&ctx, inode->i_sb); + + if (inode_file_plugin(inode)->ioctl == NULL) + result = -ENOSYS; + else + result = inode_file_plugin(inode)->ioctl(inode, filp, cmd, arg); + + reiser4_exit_context(&ctx); + return result; +} + +/* ->mmap() VFS method in reiser4 file_operations */ +static int +reiser4_mmap(struct file *file, struct vm_area_struct *vma) +{ + struct inode *inode; + int result; + reiser4_context ctx; + + init_context(&ctx, file->f_dentry->d_inode->i_sb); + + inode = file->f_dentry->d_inode; + assert("nikita-2936", inode_file_plugin(inode)->mmap != NULL); + result = inode_file_plugin(inode)->mmap(file, vma); + reiser4_exit_context(&ctx); + return result; +} + +/* reiser4 implementation of ->read() VFS method, member of reiser4 struct file_operations + + reads some part of a file from the filesystem into the user space buffer + + gets the plugin for the file and calls its read method which does everything except some initialization + +*/ +static ssize_t +reiser4_read(struct file *file /* file to read from */ , + char *buf /* user-space buffer to put data read + * from the file */ , + size_t count /* bytes to read */ , + loff_t * off /* current position within the file, which needs to be increased by the act of reading. Reads + * start from here. */ ) +{ + ssize_t result; + struct inode *inode; + reiser4_context ctx; + + assert("umka-072", file != NULL); + assert("umka-074", off != NULL); + + inode = file->f_dentry->d_inode; + init_context(&ctx, inode->i_sb); + + result = perm_chk(inode, read, file, buf, count, off); + if (likely(result == 0)) { + file_plugin *fplug; + + fplug = inode_file_plugin(inode); + assert("nikita-417", fplug != NULL); + assert("nikita-2935", fplug->write != NULL); + + /* unix_file_read is one method that might be invoked below */ + result = fplug->read(file, buf, count, off); + } + reiser4_exit_context(&ctx); + return result; +} + +/* ->write() VFS method in reiser4 file_operations */ +static ssize_t +reiser4_write(struct file *file /* file to write on */ , + const char *buf /* user-space buffer to get data + * to write into the file */ , + size_t size /* bytes to write */ , + loff_t * off /* offset to start writing + * from. This is updated to indicate + * actual number of bytes written */ ) +{ + struct inode *inode; + ssize_t result; + reiser4_context ctx; + + assert("nikita-1421", file != NULL); + assert("nikita-1424", off != NULL); + + inode = file->f_dentry->d_inode; + init_context(&ctx, inode->i_sb); + + result = perm_chk(inode, write, file, buf, size, off); + if (likely(result == 0)) { + file_plugin *fplug; + + fplug = inode_file_plugin(inode); + assert("nikita-2934", fplug->read != NULL); + + result = fplug->write(file, buf, size, off); + } + context_set_commit_async(&ctx); + reiser4_exit_context(&ctx); + return result; +} + +/* Release reiser4 file. This is f_op->release() method. Called when last + holder closes a file */ +static int +reiser4_release(struct inode *i /* inode released */ , + struct file *f /* file released */ ) +{ + file_plugin *fplug; + int result; + reiser4_context ctx; + + assert("umka-081", i != NULL); + assert("nikita-1447", f != NULL); + + init_context(&ctx, i->i_sb); + fplug = inode_file_plugin(i); + assert("umka-082", fplug != NULL); + + + if (fplug->release != NULL && get_current_context() == &ctx) + result = fplug->release(i, f); + else + /* + no ->release method defined, or we are within reiser4 + context already. How latter is possible? Simple: + + (gdb) bt + #0 get_exclusive_access () + #2 0xc01e56d3 in release_unix_file () + #3 0xc01c3643 in reiser4_release () + #4 0xc014cae0 in __fput () + #5 0xc013ffc3 in remove_vm_struct () + #6 0xc0141786 in exit_mmap () + #7 0xc0118480 in mmput () + #8 0xc0133205 in oom_kill () + #9 0xc01332d1 in out_of_memory () + #10 0xc013bc1d in try_to_free_pages () + #11 0xc013427b in __alloc_pages () + #12 0xc013f058 in do_anonymous_page () + #13 0xc013f19d in do_no_page () + #14 0xc013f60e in handle_mm_fault () + #15 0xc01131e5 in do_page_fault () + #16 0xc0104935 in error_code () + #17 0xc025c0c6 in __copy_to_user_ll () + #18 0xc01d496f in read_tail () + #19 0xc01e4def in read_unix_file () + #20 0xc01c3504 in reiser4_read () + #21 0xc014bd4f in vfs_read () + #22 0xc014bf66 in sys_read () + */ + result = 0; + + reiser4_free_file_fsdata(f); + + reiser4_exit_context(&ctx); + return result; +} + +/* + * ->open file operation for reiser4. This is optional method. It's only + * present for mounts that support pseudo files. When "nopseudo" mount option + * is used, this method is zeroed, which speeds open(2) system call a bit. + */ +static int +reiser4_open(struct inode * inode, struct file * file) +{ + int result; + + reiser4_context ctx; + file_plugin *fplug; + + init_context(&ctx, inode->i_sb); + fplug = inode_file_plugin(inode); + + if (fplug->open != NULL) + result = fplug->open(inode, file); + else + result = 0; + + reiser4_exit_context(&ctx); + return result; +} + +/* ->fsync file operation for reiser4. */ +static int +reiser4_fsync(struct file *file, struct dentry *dentry, int datasync) +{ + int result; + reiser4_context ctx; + file_plugin *fplug; + struct inode *inode; + + inode = dentry->d_inode; + init_context(&ctx, inode->i_sb); + fplug = inode_file_plugin(inode); + if (fplug->sync != NULL) + result = fplug->sync(inode, datasync); + else + result = 0; + context_set_commit_async(&ctx); + reiser4_exit_context(&ctx); + return result; +} + +/* Reads @count bytes from @file and calls @actor for every read page. This is + needed for loop back devices support. */ +static ssize_t reiser4_sendfile(struct file *file, loff_t *ppos, + size_t count, read_actor_t actor, + void __user *target) +{ + int result; + file_plugin *fplug; + reiser4_context ctx; + struct inode *inode; + + inode = file->f_dentry->d_inode; + init_context(&ctx, inode->i_sb); + + fplug = inode_file_plugin(inode); + + if (fplug->sendfile != NULL) + result = fplug->sendfile(file, ppos, count, actor, target); + else + result = RETERR(-EINVAL); + + reiser4_exit_context(&ctx); + return result; +} + + +struct file_operations reiser4_file_operations = { + .llseek = reiser4_llseek, /* d */ + .read = reiser4_read, /* d */ + .write = reiser4_write, /* d */ + .readdir = reiser4_readdir, /* d */ +/* .poll = reiser4_poll, */ + .ioctl = reiser4_ioctl, + .mmap = reiser4_mmap, /* d */ + .open = reiser4_open, +/* .flush = reiser4_flush, */ + .release = reiser4_release, /* d */ + .fsync = reiser4_fsync /* d */, + .sendfile = reiser4_sendfile, +/* .fasync = reiser4_fasync, */ +/* .lock = reiser4_lock, */ +/* .readv = reiser4_readv, */ +/* .writev = reiser4_writev, */ +/* .sendpage = reiser4_sendpage, */ +/* .get_unmapped_area = reiser4_get_unmapped_area */ +}; + + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/flush.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/flush.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,3499 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* The design document for this file is at http://www.namesys.com/v4/v4.html. */ + +#include "forward.h" +#include "debug.h" +#include "dformat.h" +#include "key.h" +#include "coord.h" +#include "type_safe_list.h" +#include "plugin/item/item.h" +#include "plugin/plugin.h" +#include "plugin/object.h" +#include "txnmgr.h" +#include "jnode.h" +#include "znode.h" +#include "block_alloc.h" +#include "tree_walk.h" +#include "carry.h" +#include "tree.h" +#include "vfs_ops.h" +#include "inode.h" +#include "page_cache.h" +#include "wander.h" +#include "super.h" +#include "entd.h" +#include "reiser4.h" +#include "flush.h" +#include "writeout.h" + +#include +#include /* for struct super_block */ +#include /* for struct page */ +#include /* for struct bio */ +#include +#include + +/* IMPLEMENTATION NOTES */ + +/* PARENT-FIRST: Some terminology: A parent-first traversal is a way of assigning a total + order to the nodes of the tree in which the parent is placed before its children, which + are ordered (recursively) in left-to-right order. When we speak of a "parent-first preceder", it + describes the node that "came before in forward parent-first order". When we speak of a + "parent-first follower", it describes the node that "comes next in parent-first + order" (alternatively the node that "came before in reverse parent-first order"). + + The following pseudo-code prints the nodes of a tree in forward parent-first order: + + void parent_first (node) + { + print_node (node); + if (node->level > leaf) { + for (i = 0; i < num_children; i += 1) { + parent_first (node->child[i]); + } + } + } +*/ + +/* JUST WHAT ARE WE TRYING TO OPTIMIZE, HERE? The idea is to optimize block allocation so + that a left-to-right scan of the tree's data (i.e., the leaves in left-to-right order) + can be accomplished with sequential reads, which results in reading nodes in their + parent-first order. This is a read-optimization aspect of the flush algorithm, and + there is also a write-optimization aspect, which is that we wish to make large + sequential writes to the disk by allocating or reallocating blocks so that they can be + written in sequence. Sometimes the read-optimization and write-optimization goals + conflict with each other, as we discuss in more detail below. +*/ + +/* STATE BITS: The flush code revolves around the state of the jnodes it covers. Here are + the relevant jnode->state bits and their relevence to flush: + + JNODE_DIRTY: If a node is dirty, it must be flushed. But in order to be written it + must be allocated first. In order to be considered allocated, the jnode must have + exactly one of { JNODE_OVRWR, JNODE_RELOC } set. These two bits are exclusive, and + all dirtied jnodes eventually have one of these bits set during each transaction. + + JNODE_CREATED: The node was freshly created in its transaction and has no previous + block address, so it is unconditionally assigned to be relocated, although this is + mainly for code-convenience. It is not being 'relocated' from anything, but in + almost every regard it is treated as part of the relocate set. The JNODE_CREATED bit + remains set even after JNODE_RELOC is set, so the actual relocate can be + distinguished from the created-and-allocated set easily: relocate-set members + (belonging to the preserve-set) have (JNODE_RELOC) set and created-set members which + have no previous location to preserve have (JNODE_RELOC | JNODE_CREATED) set. + + JNODE_OVRWR: The node belongs to atom's overwrite set. The flush algorithm made the + decision to maintain the pre-existing location for this node and it will be written + to the wandered-log. + + JNODE_RELOC: The flush algorithm made the decision to relocate this block (if it was + not created, see note above). A block with JNODE_RELOC set is eligible for + early-flushing and may be submitted during flush_empty_queues. When the JNODE_RELOC + bit is set on a znode, the parent node's internal item is modified and the znode is + rehashed. + + JNODE_SQUEEZABLE: Before shifting everything left, the flush algorithm scans the node + and calls plugin->f.squeeze() method for its items. By this technology we update disk + clusters of cryptcompress objects. Also if leftmost point that was found by flush scan + has this flag (races with write(), rare case) the flush algorythm makes the decision + to pass it to squalloc() in spite of its flushprepped status for squeezing, not for + repeated allocation. + + JNODE_FLUSH_QUEUED: This bit is set when a call to flush enters the jnode into its + flush queue. This means the jnode is not on any clean or dirty list, instead it is + moved to one of the flush queue (see flush_queue.h) object private list. This + prevents multiple concurrent flushes from attempting to start flushing from the + same node. + + (DEAD STATE BIT) JNODE_FLUSH_BUSY: This bit was set during the bottom-up + squeeze-and-allocate on a node while its children are actively being squeezed and + allocated. This flag was created to avoid submitting a write request for a node + while its children are still being allocated and squeezed. Then flush queue was + re-implemented to allow unlimited number of nodes be queued. This flag support was + commented out in source code because we decided that there was no reason to submit + queued nodes before jnode_flush() finishes. However, current code calls fq_write() + during a slum traversal and may submit "busy nodes" to disk. Probably we can + re-enable the JNODE_FLUSH_BUSY bit support in future. + + With these state bits, we describe a test used frequently in the code below, + jnode_is_flushprepped() (and the spin-lock-taking jnode_check_flushprepped()). The + test for "flushprepped" returns true if any of the following are true: + + - The node is not dirty + - The node has JNODE_RELOC set + - The node has JNODE_OVRWR set + + If either the node is not dirty or it has already been processed by flush (and assigned + JNODE_OVRWR or JNODE_RELOC), then it is prepped. If jnode_is_flushprepped() returns + true then flush has work to do on that node. +*/ + +/* FLUSH_PREP_ONCE_PER_TRANSACTION: Within a single transaction a node is never + flushprepped twice (unless an explicit call to flush_unprep is made as described in + detail below). For example a node is dirtied, allocated, and then early-flushed to + disk and set clean. Before the transaction commits, the page is dirtied again and, due + to memory pressure, the node is flushed again. The flush algorithm will not relocate + the node to a new disk location, it will simply write it to the same, previously + relocated position again. +*/ + +/* THE BOTTOM-UP VS. TOP-DOWN ISSUE: This code implements a bottom-up algorithm where we + start at a leaf node and allocate in parent-first order by iterating to the right. At + each step of the iteration, we check for the right neighbor. Before advancing to the + right neighbor, we check if the current position and the right neighbor share the same + parent. If they do not share the same parent, the parent is allocated before the right + neighbor. + + This process goes recursively up the tree and squeeze nodes level by level as long as + the right neighbor and the current position have different parents, then it allocates + the right-neighbors-with-different-parents on the way back down. This process is + described in more detail in flush_squalloc_changed_ancestor and the recursive function + squalloc_one_changed_ancestor. But the purpose here is not to discuss the + specifics of the bottom-up approach as it is to contrast the bottom-up and top-down + approaches. + + The top-down algorithm was implemented earlier (April-May 2002). In the top-down + approach, we find a starting point by scanning left along each level past dirty nodes, + then going up and repeating the process until the left node and the parent node are + clean. We then perform a parent-first traversal from the starting point, which makes + allocating in parent-first order trivial. After one subtree has been allocated in this + manner, we move to the right, try moving upward, then repeat the parent-first + traversal. + + Both approaches have problems that need to be addressed. Both are approximately the + same amount of code, but the bottom-up approach has advantages in the order it acquires + locks which, at the very least, make it the better approach. At first glance each one + makes the other one look simpler, so it is important to remember a few of the problems + with each one. + + Main problem with the top-down approach: When you encounter a clean child during the + parent-first traversal, what do you do? You would like to avoid searching through a + large tree of nodes just to find a few dirty leaves at the bottom, and there is not an + obvious solution. One of the advantages of the top-down approach is that during the + parent-first traversal you check every child of a parent to see if it is dirty. In + this way, the top-down approach easily handles the main problem of the bottom-up + approach: unallocated children. + + The unallocated children problem is that before writing a node to disk we must make + sure that all of its children are allocated. Otherwise, the writing the node means + extra I/O because the node will have to be written again when the child is finally + allocated. + + WE HAVE NOT YET ELIMINATED THE UNALLOCATED CHILDREN PROBLEM. Except for bugs, this + should not cause any file system corruption, it only degrades I/O performance because a + node may be written when it is sure to be written at least one more time in the same + transaction when the remaining children are allocated. What follows is a description + of how we will solve the problem. +*/ + +/* HANDLING UNALLOCATED CHILDREN: During flush we may allocate a parent node then, + proceeding in parent first order, allocate some of its left-children, then encounter a + clean child in the middle of the parent. We do not allocate the clean child, but there + may remain unallocated (dirty) children to the right of the clean child. If we were to + stop flushing at this moment and write everything to disk, the parent might still + contain unallocated children. + + We could try to allocate all the descendents of every node that we allocate, but this + is not necessary. Doing so could result in allocating the entire tree: if the root + node is allocated then every unallocated node would have to be allocated before + flushing. Actually, we do not have to write a node just because we allocate it. It is + possible to allocate but not write a node during flush, when it still has unallocated + children. However, this approach is probably not optimal for the following reason. + + The flush algorithm is designed to allocate nodes in parent-first order in an attempt + to optimize reads that occur in the same order. Thus we are read-optimizing for a + left-to-right scan through all the leaves in the system, and we are hoping to + write-optimize at the same time because those nodes will be written together in batch. + What happens, however, if we assign a block number to a node in its read-optimized + order but then avoid writing it because it has unallocated children? In that + situation, we lose out on the write-optimization aspect because a node will have to be + written again to the its location on the device, later, which likely means seeking back + to that location. + + So there are tradeoffs. We can choose either: + + A. Allocate all unallocated children to preserve both write-optimization and + read-optimization, but this is not always desirable because it may mean having to + allocate and flush very many nodes at once. + + B. Defer writing nodes with unallocated children, keep their read-optimized locations, + but sacrifice write-optimization because those nodes will be written again. + + C. Defer writing nodes with unallocated children, but do not keep their read-optimized + locations. Instead, choose to write-optimize them later, when they are written. To + facilitate this, we "undo" the read-optimized allocation that was given to the node so + that later it can be write-optimized, thus "unpreparing" the flush decision. This is a + case where we disturb the FLUSH_PREP_ONCE_PER_TRANSACTION rule described above. By a + call to flush_unprep() we will: if the node was wandered, unset the JNODE_OVRWR bit; + if the node was relocated, unset the JNODE_RELOC bit, non-deferred-deallocate its block + location, and set the JNODE_CREATED bit, effectively setting the node back to an + unallocated state. + + We will take the following approach in v4.0: for twig nodes we will always finish + allocating unallocated children (A). For nodes with (level > TWIG) we will defer + writing and choose write-optimization (C). + + To summarize, there are several parts to a solution that avoids the problem with + unallocated children: + + FIXME-ZAM: Still no one approach is implemented to eliminate the "UNALLOCATED CHILDREN" + problem because there was an experiment which was done showed that we have 1-2 nodes + with unallocated children for thousands of written nodes. The experiment was simple + like coping / deletion of linux kernel sources. However the problem can arise in more + complex tests. I think we have jnode_io_hook to insert a check for unallocated + children and see what kind of problem we have. + + 1. When flush reaches a stopping point (e.g., a clean node), it should continue calling + squeeze-and-allocate on any remaining unallocated children. FIXME: Difficulty to + implement: should be simple -- amounts to adding a while loop to jnode_flush, see + comments in that function. + + 2. When flush reaches flush_empty_queue(), some of the (level > TWIG) nodes may still + have unallocated children. If the twig level has unallocated children it is an + assertion failure. If a higher-level node has unallocated children, then it should be + explicitly de-allocated by a call to flush_unprep(). FIXME: Difficulty to implement: + should be simple. + + 3. (CPU-Optimization) Checking whether a node has unallocated children may consume more + CPU cycles than we would like, and it is possible (but medium complexity) to optimize + this somewhat in the case where large sub-trees are flushed. The following observation + helps: if both the left- and right-neighbor of a node are processed by the flush + algorithm then the node itself is guaranteed to have all of its children allocated. + However, the cost of this check may not be so expensive after all: it is not needed for + leaves and flush can guarantee this property for twigs. That leaves only (level > + TWIG) nodes that have to be checked, so this optimization only helps if at least three + (level > TWIG) nodes are flushed in one pass, and the savings will be very small unless + there are many more (level > TWIG) nodes. But if there are many (level > TWIG) nodes + then the number of blocks being written will be very large, so the savings may be + insignificant. That said, the idea is to maintain both the left and right edges of + nodes that are processed in flush. When flush_empty_queue() is called, a relatively + simple test will tell whether the (level > TWIG) node is on the edge. If it is on the + edge, the slow check is necessary, but if it is in the interior then it can be assumed + to have all of its children allocated. FIXME: medium complexity to implement, but + simple to verify given that we must have a slow check anyway. + + 4. (Optional) This part is optional, not for v4.0--flush should work independently of + whether this option is used or not. Called RAPID_SCAN, the idea is to amend the + left-scan operation to take unallocated children into account. Normally, the left-scan + operation goes left as long as adjacent nodes are dirty up until some large maximum + value (FLUSH_SCAN_MAXNODES) at which point it stops and begins flushing. But scan-left + may stop at a position where there are unallocated children to the left with the same + parent. When RAPID_SCAN is enabled, the ordinary scan-left operation stops after + FLUSH_RELOCATE_THRESHOLD, which is much smaller than FLUSH_SCAN_MAXNODES, then procedes + with a rapid scan. The rapid scan skips all the interior children of a node--if the + leftmost child of a twig is dirty, check its left neighbor (the rightmost child of the + twig to the left). If the left neighbor of the leftmost child is also dirty, then + continue the scan at the left twig and repeat. This option will cause flush to + allocate more twigs in a single pass, but it also has the potential to write many more + nodes than would otherwise be written without the RAPID_SCAN option. RAPID_SCAN + was partially implemented, code removed August 12, 2002 by JMACD. +*/ + +/* FLUSH CALLED ON NON-LEAF LEVEL. Most of our design considerations assume that the + starting point for flush is a leaf node, but actually the flush code cares very little + about whether or not this is true. It is possible that all the leaf nodes are flushed + and dirty parent nodes still remain, in which case jnode_flush() is called on a + non-leaf argument. Flush doesn't care--it treats the argument node as if it were a + leaf, even when it is not. This is a simple approach, and there may be a more optimal + policy but until a problem with this approach is discovered, simplest is probably best. + + NOTE: In this case, the ordering produced by flush is parent-first only if you ignore + the leaves. This is done as a matter of simplicity and there is only one (shaky) + justification. When an atom commits, it flushes all leaf level nodes first, followed + by twigs, and so on. With flushing done in this order, if flush is eventually called + on a non-leaf node it means that (somehow) we reached a point where all leaves are + clean and only internal nodes need to be flushed. If that it the case, then it means + there were no leaves that were the parent-first preceder/follower of the parent. This + is expected to be a rare case, which is why we do nothing special about it. However, + memory pressure may pass an internal node to flush when there are still dirty leaf + nodes that need to be flushed, which could prove our original assumptions + "inoperative". If this needs to be fixed, then scan_left/right should have + special checks for the non-leaf levels. For example, instead of passing from a node to + the left neighbor, it should pass from the node to the left neighbor's rightmost + descendent (if dirty). + +*/ + +/* UNIMPLEMENTED AS YET: REPACKING AND RESIZING. We walk the tree in 4MB-16MB chunks, dirtying everything and putting + it into a transaction. We tell the allocator to allocate the blocks as far as possible towards one end of the + logical device--the left (starting) end of the device if we are walking from left to right, the right end of the + device if we are walking from right to left. We then make passes in alternating directions, and as we do this the + device becomes sorted such that tree order and block number order fully correlate. + + Resizing is done by shifting everything either all the way to the left or all the way + to the right, and then reporting the last block. +*/ + +/* RELOCATE DECISIONS: The code makes a decision to relocate in several places. This + descibes the policy from the highest level: + + The FLUSH_RELOCATE_THRESHOLD parameter: If we count this many consecutive nodes on the + leaf level during flush-scan (right, left), then we unconditionally decide to relocate + leaf nodes. + + Otherwise, there are two contexts in which we make a decision to relocate: + + 1. The REVERSE PARENT-FIRST context: Implemented in reverse_relocate_test(). + During the initial stages of flush, after scan-right completes, we want to ask the + question: should we relocate this leaf node and thus dirty the parent node. Then if + the node is a leftmost child its parent is its own parent-first preceder, thus we repeat + the question at the next level up, and so on. In these cases we are moving in the + reverse-parent first direction. + + There is another case which is considered the reverse direction, which comes at the end + of a twig in reverse_relocate_end_of_twig(). As we finish processing a twig we may + reach a point where there is a clean twig to the right with a dirty leftmost child. In + this case, we may wish to relocate the child by testing if it should be relocated + relative to its parent. + + 2. The FORWARD PARENT-FIRST context: Testing for forward relocation is done in + allocate_znode. What distinguishes the forward parent-first case from the + reverse-parent first case is that the preceder has already been allocated in the + forward case, whereas in the reverse case we don't know what the preceder is until we + finish "going in reverse". That simplifies the forward case considerably, and there we + actually use the block allocator to determine whether, e.g., a block closer to the + preceder is available. +*/ + +/* SQUEEZE_LEFT_EDGE: Unimplemented idea for future consideration. The idea is, once we + finish scan-left and find a starting point, if the parent's left neighbor is dirty then + squeeze the parent's left neighbor and the parent. This may change the + flush-starting-node's parent. Repeat until the child's parent is stable. If the child + is a leftmost child, repeat this left-edge squeezing operation at the next level up. + Note that we cannot allocate extents during this or they will be out of parent-first + order. There is also some difficult coordinate maintenence issues. We can't do a tree + search to find coordinates again (because we hold locks), we have to determine them + from the two nodes being squeezed. Looks difficult, but has potential to increase + space utilization. */ + +/* Flush-scan helper functions. */ +static void scan_init(flush_scan * scan); +static void scan_done(flush_scan * scan); + +/* Flush-scan algorithm. */ +static int scan_left(flush_scan * scan, flush_scan * right, jnode * node, unsigned limit); +static int scan_right(flush_scan * scan, jnode * node, unsigned limit); +static int scan_common(flush_scan * scan, flush_scan * other); +static int scan_formatted(flush_scan * scan); +static int scan_unformatted(flush_scan * scan, flush_scan * other); +static int scan_by_coord(flush_scan * scan); + +/* Initial flush-point ancestor allocation. */ +static int alloc_pos_and_ancestors(flush_pos_t * pos); +static int alloc_one_ancestor(const coord_t * coord, flush_pos_t * pos); +static int set_preceder(const coord_t * coord_in, flush_pos_t * pos); + +/* Main flush algorithm. Note on abbreviation: "squeeze and allocate" == "squalloc". */ +static int squalloc(flush_pos_t * pos); + +/* Flush squeeze implementation. */ +static int squeeze_right_non_twig(znode * left, znode * right); +static int shift_one_internal_unit(znode * left, znode * right); + +/* Flush reverse parent-first relocation routines. */ +static int reverse_relocate_if_close_enough(const reiser4_block_nr * pblk, const reiser4_block_nr * nblk); +static int reverse_relocate_test(jnode * node, const coord_t * parent_coord, flush_pos_t * pos); +static int reverse_relocate_check_dirty_parent(jnode * node, const coord_t * parent_coord, flush_pos_t * pos); + +/* Flush allocate write-queueing functions: */ +static int allocate_znode(znode * node, const coord_t * parent_coord, flush_pos_t * pos); +static int allocate_znode_update(znode * node, const coord_t * parent_coord, flush_pos_t * pos); +static int lock_parent_and_allocate_znode (znode *, flush_pos_t *); + +/* Flush helper functions: */ +static int jnode_lock_parent_coord(jnode * node, + coord_t * coord, + lock_handle * parent_lh, + load_count * parent_zh, + znode_lock_mode mode, int try); +static int neighbor_in_slum(znode * node, lock_handle * right_lock, sideof side, znode_lock_mode mode); +static int znode_same_parents(znode * a, znode * b); + +static int +znode_check_flushprepped(znode * node) +{ + return jnode_check_flushprepped(ZJNODE(node)); +} + +/* Flush position functions */ +static void pos_init(flush_pos_t * pos); +static int pos_valid(flush_pos_t * pos); +static void pos_done(flush_pos_t * pos); +static int pos_stop(flush_pos_t * pos); + +/* check that @org is first jnode extent unit, if extent is unallocated, + * because all jnodes of unallocated extent are dirty and of the same atom. */ +#define checkchild(scan) \ +assert("nikita-3435", \ + ergo(scan->direction == LEFT_SIDE && \ + (scan->parent_coord.node->level == TWIG_LEVEL) && \ + jnode_is_unformatted(scan->node) && \ + extent_is_unallocated(&scan->parent_coord), \ + extent_unit_index(&scan->parent_coord) == index_jnode(scan->node))) + + +/* This flush_cnt variable is used to track the number of concurrent flush operations, + useful for debugging. It is initialized in txnmgr.c out of laziness (because flush has + no static initializer function...) */ +ON_DEBUG(atomic_t flush_cnt;) + + +/* FIXME: remove me */#define FLUSH_CHECKS_CONGESTION 1 + +#if defined (FLUSH_CHECKS_CONGESTION) +/* check fs backing device for write congestion */ +static int check_write_congestion (void) +{ + struct super_block *sb; + struct backing_dev_info * bdi; + + sb = reiser4_get_current_sb(); + bdi = get_super_fake(sb)->i_mapping->backing_dev_info; + return bdi_write_congested(bdi); +} +#endif /* FLUSH_CHECKS_CONGESTION */ + +/* conditionally write flush queue */ +static int write_prepped_nodes (flush_pos_t * pos, int check_congestion) +{ + int ret; + + assert("zam-831", pos); + assert("zam-832", pos->fq); + + if (!(pos->flags & JNODE_FLUSH_WRITE_BLOCKS)) + return 0; + +#if defined (FLUSH_CHECKS_CONGESTION) + if (check_congestion && check_write_congestion()) + return 0; +#endif /* FLUSH_CHECKS_CONGESTION */ + + ret = write_fq(pos->fq, pos->nr_written, + WRITEOUT_SINGLE_STREAM | WRITEOUT_FOR_PAGE_RECLAIM); + return ret; +} + +/* Proper release all flush pos. resources then move flush position to new + locked node */ +static void move_flush_pos (flush_pos_t * pos, lock_handle * new_lock, + load_count * new_load, const coord_t * new_coord) +{ + assert ("zam-857", new_lock->node == new_load->node); + + if (new_coord) { + assert ("zam-858", new_coord->node == new_lock->node); + coord_dup(&pos->coord, new_coord); + } else { + coord_init_first_unit(&pos->coord, new_lock->node); + } + + if (pos->child) { + jput(pos->child); + pos->child = NULL; + } + + move_load_count(&pos->load, new_load); + done_lh(&pos->lock); + move_lh(&pos->lock, new_lock); +} + +/* delete empty node which link from the parent still exists. */ +static int delete_empty_node (znode * node) +{ + reiser4_key smallest_removed; + + assert("zam-1019", node != NULL); + assert("zam-1020", node_is_empty(node)); + assert("zam-1023", znode_is_wlocked(node)); + + return delete_node(node, &smallest_removed, NULL, 1); +} + +/* Prepare flush position for alloc_pos_and_ancestors() and squalloc() */ +static int prepare_flush_pos(flush_pos_t *pos, jnode * org) +{ + int ret; + load_count load; + lock_handle lock; + + init_lh(&lock); + init_load_count(&load); + + if (jnode_is_znode(org)) { + ret = longterm_lock_znode(&lock, JZNODE(org), + ZNODE_WRITE_LOCK, ZNODE_LOCK_HIPRI); + if (ret) + return ret; + + ret = incr_load_count_znode(&load, JZNODE(org)); + if (ret) + return ret; + + pos->state = (jnode_get_level(org) == LEAF_LEVEL) ? POS_ON_LEAF : POS_ON_INTERNAL; + move_flush_pos(pos, &lock, &load, NULL); + } else { + coord_t parent_coord; + ret = jnode_lock_parent_coord(org, &parent_coord, &lock, + &load, ZNODE_WRITE_LOCK, 0); + if (ret) + goto done; + + pos->state = POS_ON_EPOINT; + move_flush_pos(pos, &lock, &load, &parent_coord); + pos->child = jref(org); + if (extent_is_unallocated(&parent_coord) && extent_unit_index(&parent_coord) != index_jnode(org)) { + /* @org is not first child of its parent unit. This may happen + because longerm lock of its parent node was released between + scan_left and scan_right. For now work around this having flush to repeat */ + ret = -EAGAIN; + } + } + + done: + done_load_count(&load); + done_lh(&lock); + return ret; +} + +/* TODO LIST (no particular order): */ +/* I have labelled most of the legitimate FIXME comments in this file with letters to + indicate which issue they relate to. There are a few miscellaneous FIXMEs with + specific names mentioned instead that need to be inspected/resolved. */ +/* B. There is an issue described in reverse_relocate_test having to do with an + imprecise is_preceder? check having to do with partially-dirty extents. The code that + sets preceder hints and computes the preceder is basically untested. Careful testing + needs to be done that preceder calculations are done correctly, since if it doesn't + affect correctness we will not catch this stuff during regular testing. */ +/* C. EINVAL, E_DEADLOCK, E_NO_NEIGHBOR, ENOENT handling. It is unclear which of these are + considered expected but unlikely conditions. Flush currently returns 0 (i.e., success + but no progress, i.e., restart) whenever it receives any of these in jnode_flush(). + Many of the calls that may produce one of these return values (i.e., + longterm_lock_znode, reiser4_get_parent, reiser4_get_neighbor, ...) check some of these + values themselves and, for instance, stop flushing instead of resulting in a restart. + If any of these results are true error conditions then flush will go into a busy-loop, + as we noticed during testing when a corrupt tree caused find_child_ptr to return + ENOENT. It needs careful thought and testing of corner conditions. +*/ +/* D. Atomicity of flush_prep against deletion and flush concurrency. Suppose a created + block is assigned a block number then early-flushed to disk. It is dirtied again and + flush is called again. Concurrently, that block is deleted, and the de-allocation of + its block number does not need to be deferred, since it is not part of the preserve set + (i.e., it didn't exist before the transaction). I think there may be a race condition + where flush writes the dirty, created block after the non-deferred deallocated block + number is re-allocated, making it possible to write deleted data on top of non-deleted + data. Its just a theory, but it needs to be thought out. */ +/* F. bio_alloc() failure is not handled gracefully. */ +/* G. Unallocated children. */ +/* H. Add a WANDERED_LIST to the atom to clarify the placement of wandered blocks. */ +/* I. Rename flush-scan to scan-point, (flush-pos to flush-point?) */ + +/* JNODE_FLUSH: MAIN ENTRY POINT */ +/* This is the main entry point for flushing a jnode and its dirty neighborhood (dirty + neighborhood is named "slum"). Jnode_flush() is called if reiser4 has to write dirty + blocks to disk, it happens when Linux VM decides to reduce number of dirty pages or as + a part of transaction commit. + + Our objective here is to prep and flush the slum the jnode belongs to. We want to + squish the slum together, and allocate the nodes in it as we squish because allocation + of children affects squishing of parents. + + The "argument" @node tells flush where to start. From there, flush finds the left edge + of the slum, and calls squalloc (in which nodes are squeezed and allocated). To find a + "better place" to start squalloc first we perform a flush_scan. + + Flush-scanning may be performed in both left and right directions, but for different + purposes. When scanning to the left, we are searching for a node that precedes a + sequence of parent-first-ordered nodes which we will then flush in parent-first order. + During flush-scanning, we also take the opportunity to count the number of consecutive + leaf nodes. If this number is past some threshold (FLUSH_RELOCATE_THRESHOLD), then we + make a decision to reallocate leaf nodes (thus favoring write-optimization). + + Since the flush argument node can be anywhere in a sequence of dirty leaves, there may + also be dirty nodes to the right of the argument. If the scan-left operation does not + count at least FLUSH_RELOCATE_THRESHOLD nodes then we follow it with a right-scan + operation to see whether there is, in fact, enough nodes to meet the relocate + threshold. Each right- and left-scan operation uses a single flush_scan object. + + After left-scan and possibly right-scan, we prepare a flush_position object with the + starting flush point or parent coordinate, which was determined using scan-left. + + Next we call the main flush routine, squalloc, which iterates along the + leaf level, squeezing and allocating nodes (and placing them into the flush queue). + + After squalloc returns we take extra steps to ensure that all the children + of the final twig node are allocated--this involves repeating squalloc + until we finish at a twig with no unallocated children. + + Finally, we call flush_empty_queue to submit write-requests to disk. If we encounter + any above-twig nodes during flush_empty_queue that still have unallocated children, we + flush_unprep them. + + Flush treats several "failure" cases as non-failures, essentially causing them to start + over. E_DEADLOCK is one example. FIXME:(C) EINVAL, E_NO_NEIGHBOR, ENOENT: these should + probably be handled properly rather than restarting, but there are a bunch of cases to + audit. +*/ + +static int jnode_flush(jnode * node, long *nr_to_flush, long * nr_written, flush_queue_t * fq, int flags) +{ + long ret = 0; + flush_scan right_scan; + flush_scan left_scan; + flush_pos_t flush_pos; + int todo; + struct super_block *sb; + reiser4_super_info_data *sbinfo; + jnode * leftmost_in_slum = NULL; + + assert("jmacd-76619", lock_stack_isclean(get_current_lock_stack())); + assert("nikita-3022", schedulable()); + + /* lock ordering: delete_sema and flush_sema are unordered */ + assert("nikita-3185", + get_current_super_private()->delete_sema_owner != current); + + sb = reiser4_get_current_sb(); + sbinfo = get_super_private(sb); + if (!reiser4_is_set(sb, REISER4_MTFLUSH)) { + down(&sbinfo->flush_sema); + } + + /* Flush-concurrency debug code */ +#if REISER4_DEBUG + atomic_inc(&flush_cnt); +#endif + + enter_flush(sb); + + /* Initialize a flush position. */ + pos_init(&flush_pos); + + flush_pos.nr_to_flush = nr_to_flush; + flush_pos.nr_written = nr_written; + flush_pos.fq = fq; + flush_pos.flags = flags; + + scan_init(&right_scan); + scan_init(&left_scan); + + /*IF_TRACE (TRACE_FLUSH_VERB, print_tree_rec ("parent_first", current_tree, REISER4_TREE_BRIEF)); */ + /*IF_TRACE (TRACE_FLUSH_VERB, print_tree_rec ("parent_first", current_tree, REISER4_TREE_CHECK)); */ + + /* First scan left and remember the leftmost scan position. If the leftmost + position is unformatted we remember its parent_coord. We scan until counting + FLUSH_SCAN_MAXNODES. + + If starting @node is unformatted, at the beginning of left scan its + parent (twig level node, containing extent item) will be long term + locked and lock handle will be stored in the + @right_scan->parent_lock. This lock is used to start the rightward + scan without redoing the tree traversal (necessary to find parent) + and, hence, is kept during leftward scan. As a result, we have to + use try-lock when taking long term locks during the leftward scan. + */ + ret = scan_left(&left_scan, &right_scan, + node, sbinfo->flush.scan_maxnodes); + if (ret != 0) + goto failed; + + leftmost_in_slum = jref(left_scan.node); + scan_done(&left_scan); + + /* Then possibly go right to decide if we will use a policy of relocating leaves. + This is only done if we did not scan past (and count) enough nodes during the + leftward scan. If we do scan right, we only care to go far enough to establish + that at least FLUSH_RELOCATE_THRESHOLD number of nodes are being flushed. The + scan limit is the difference between left_scan.count and the threshold. */ + + todo = sbinfo->flush.relocate_threshold - left_scan.count; + /* scan right is inherently deadlock prone, because we are + * (potentially) holding a lock on the twig node at this moment. + * FIXME: this is incorrect comment: lock is not held */ + if (todo > 0) { + ret = scan_right(&right_scan, node, (unsigned)todo); + if (ret != 0) + goto failed; + } + + /* Only the right-scan count is needed, release any rightward locks right away. */ + scan_done(&right_scan); + + /* ... and the answer is: we should relocate leaf nodes if at least + FLUSH_RELOCATE_THRESHOLD nodes were found. */ + flush_pos.leaf_relocate = JF_ISSET(node, JNODE_REPACK) || + (left_scan.count + right_scan.count >= sbinfo->flush.relocate_threshold); + + /* Funny business here. We set the 'point' in the flush_position at prior to + starting squalloc regardless of whether the first point is + formatted or unformatted. Without this there would be an invariant, in the + rest of the code, that if the flush_position is unformatted then + flush_position->point is NULL and flush_position->parent_{lock,coord} is set, + and if the flush_position is formatted then flush_position->point is non-NULL + and no parent info is set. + + This seems lazy, but it makes the initial calls to reverse_relocate_test + (which ask "is it the pos->point the leftmost child of its parent") much easier + because we know the first child already. Nothing is broken by this, but the + reasoning is subtle. Holding an extra reference on a jnode during flush can + cause us to see nodes with HEARD_BANSHEE during squalloc, because nodes are not + removed from sibling lists until they have zero reference count. Flush would + never observe a HEARD_BANSHEE node on the left-edge of flush, nodes are only + deleted to the right. So if nothing is broken, why fix it? + + NOTE-NIKITA actually, flush can meet HEARD_BANSHEE node at any + point and in any moment, because of the concurrent file system + activity (for example, truncate). */ + + /* Check jnode state after flush_scan completed. Having a lock on this + node or its parent (in case of unformatted) helps us in case of + concurrent flushing. */ + if (jnode_check_flushprepped(leftmost_in_slum) && !jnode_convertible(leftmost_in_slum)) { + ret = 0; + goto failed; + } + + /* Now setup flush_pos using scan_left's endpoint. */ + ret = prepare_flush_pos(&flush_pos, leftmost_in_slum); + if (ret) + goto failed; + + if (znode_get_level(flush_pos.coord.node) == LEAF_LEVEL + && node_is_empty(flush_pos.coord.node)) { + znode * empty = flush_pos.coord.node; + + assert ("zam-1022", !ZF_ISSET(empty, JNODE_HEARD_BANSHEE)); + ret = delete_empty_node(empty); + goto failed; + } + + if (jnode_check_flushprepped(leftmost_in_slum) && !jnode_convertible(leftmost_in_slum)) { + ret = 0; + goto failed; + } + + /* Set pos->preceder and (re)allocate pos and its ancestors if it is needed */ + ret = alloc_pos_and_ancestors(&flush_pos); + if (ret) + goto failed; + + /* Do the main rightward-bottom-up squeeze and allocate loop. */ + ret = squalloc(&flush_pos); + pos_stop(&flush_pos); + if (ret) + goto failed; + + /* FIXME_NFQUCMPD: Here, handle the twig-special case for unallocated children. + First, the pos_stop() and pos_valid() routines should be modified + so that pos_stop() sets a flush_position->stop flag to 1 without + releasing the current position immediately--instead release it in + pos_done(). This is a better implementation than the current one anyway. + + It is not clear that all fields of the flush_position should not be released, + but at the very least the parent_lock, parent_coord, and parent_load should + remain held because they are hold the last twig when pos_stop() is + called. + + When we reach this point in the code, if the parent_coord is set to after the + last item then we know that flush reached the end of a twig (and according to + the new flush queueing design, we will return now). If parent_coord is not + past the last item, we should check if the current twig has any unallocated + children to the right (we are not concerned with unallocated children to the + left--in that case the twig itself should not have been allocated). If the + twig has unallocated children to the right, set the parent_coord to that + position and then repeat the call to squalloc. + + Testing for unallocated children may be defined in two ways: if any internal + item has a fake block number, it is unallocated; if any extent item is + unallocated then all of its children are unallocated. But there is a more + aggressive approach: if there are any dirty children of the twig to the right + of the current position, we may wish to relocate those nodes now. Checking for + potential relocation is more expensive as it requires knowing whether there are + any dirty children that are not unallocated. The extent_needs_allocation + should be used after setting the correct preceder. + + When we reach the end of a twig at this point in the code, if the flush can + continue (when the queue is ready) it will need some information on the future + starting point. That should be stored away in the flush_handle using a seal, I + believe. Holding a jref() on the future starting point may break other code + that deletes that node. + */ + + /* FIXME_NFQUCMPD: Also, we don't want to do any flushing when flush is called + above the twig level. If the VM calls flush above the twig level, do nothing + and return (but figure out why this happens). The txnmgr should be modified to + only flush its leaf-level dirty list. This will do all the necessary squeeze + and allocate steps but leave unallocated branches and possibly unallocated + twigs (when the twig's leftmost child is not dirty). After flushing the leaf + level, the remaining unallocated nodes should be given write-optimized + locations. (Possibly, the remaining unallocated twigs should be allocated just + before their leftmost child.) + */ + + /* Any failure reaches this point. */ +failed: + + if (nr_to_flush != NULL) { + if (ret >= 0) { + (*nr_to_flush) = flush_pos.prep_or_free_cnt; + } else { + (*nr_to_flush) = 0; + } + } + + switch (ret) { + case -E_REPEAT: + case -EINVAL: + case -E_DEADLOCK: + case -E_NO_NEIGHBOR: + case -ENOENT: + /* FIXME(C): Except for E_DEADLOCK, these should probably be handled properly + in each case. They already are handled in many cases. */ + /* Something bad happened, but difficult to avoid... Try again! */ + ret = 0; + } + + if (leftmost_in_slum) + jput(leftmost_in_slum); + + pos_done(&flush_pos); + scan_done(&left_scan); + scan_done(&right_scan); + + ON_DEBUG(atomic_dec(&flush_cnt)); + + leave_flush(sb); + + if (!reiser4_is_set(sb, REISER4_MTFLUSH)) + up(&sbinfo->flush_sema); + + return ret; +} + +/* The reiser4 flush subsystem can be turned into "rapid flush mode" means that + * flusher should submit all prepped nodes immediately without keeping them in + * flush queues for long time. The reason for rapid flush mode is to free + * memory as fast as possible. */ + +#if REISER4_USE_RAPID_FLUSH + +/** + * submit all prepped nodes if rapid flush mode is set, + * turn rapid flush mode off. + */ + +static int rapid_flush (flush_pos_t * pos) +{ + if (!wbq_available()) + return 0; + + return write_prepped_nodes(pos, 1); +} + +#else + +#define rapid_flush(pos) (0) + +#endif /* REISER4_USE_RAPID_FLUSH */ + +/* Flush some nodes of current atom, usually slum, return -E_REPEAT if there are more nodes + * to flush, return 0 if atom's dirty lists empty and keep current atom locked, return + * other errors as they are. */ +reiser4_internal int +flush_current_atom (int flags, long *nr_submitted, txn_atom ** atom) +{ + reiser4_super_info_data * sinfo = get_current_super_private(); + flush_queue_t *fq = NULL; + jnode * node; + int nr_queued; + int ret; + + assert ("zam-889", atom != NULL && *atom != NULL); + assert ("zam-890", spin_atom_is_locked(*atom)); + assert ("zam-892", get_current_context()->trans->atom == *atom); + + while(1) { + ret = fq_by_atom(*atom, &fq); + if (ret != -E_REPEAT) + break; + *atom = get_current_atom_locked(); + } + if (ret) + return ret; + + assert ("zam-891", spin_atom_is_locked(*atom)); + + /* parallel flushers limit */ + if (sinfo->tmgr.atom_max_flushers != 0) { + while ((*atom)->nr_flushers >= sinfo->tmgr.atom_max_flushers) { + /* An atom_send_event() call is inside fq_put_nolock() which is + called when flush is finished and nr_flushers is + decremented. */ + atom_wait_event(*atom); + *atom = get_current_atom_locked(); + } + } + + /* count ourself as a flusher */ + (*atom)->nr_flushers++; + + writeout_mode_enable(); + + nr_queued = 0; + + /* In this loop we process all already prepped (RELOC or OVRWR) and dirtied again + * nodes. The atom spin lock is not released until all dirty nodes processed or + * not prepped node found in the atom dirty lists. */ + while ((node = find_first_dirty_jnode(*atom, flags))) { + LOCK_JNODE(node); + + assert ("zam-881", jnode_is_dirty(node)); + assert ("zam-898", !JF_ISSET(node, JNODE_OVRWR)); + + if (JF_ISSET(node, JNODE_WRITEBACK)) { + capture_list_remove_clean(node); + capture_list_push_back(ATOM_WB_LIST(*atom), node); + /*XXXX*/ON_DEBUG(count_jnode(*atom, node, DIRTY_LIST, WB_LIST, 1)); + + } else if (jnode_is_znode(node) && znode_above_root(JZNODE(node))) { + /* A special case for znode-above-root. The above-root (fake) + znode is captured and dirtied when the tree height changes or + when the root node is relocated. This causes atoms to fuse so + that changes at the root are serialized. However, this node is + never flushed. This special case used to be in lock.c to + prevent the above-root node from ever being captured, but now + that it is captured we simply prevent it from flushing. The + log-writer code relies on this to properly log superblock + modifications of the tree height. */ + jnode_make_wander_nolock(node); + } else if (JF_ISSET(node, JNODE_RELOC)) { + queue_jnode(fq, node); + ++ nr_queued; + } else + break; + + UNLOCK_JNODE(node); + } + + if (node == NULL) { + if (nr_queued == 0) { + writeout_mode_disable(); + (*atom)->nr_flushers --; + atom_send_event(*atom); + fq_put_nolock(fq); + /* current atom remains locked */ + return 0; + } + UNLOCK_ATOM(*atom); + } else { + jref(node); + UNLOCK_ATOM(*atom); + UNLOCK_JNODE(node); + ret = jnode_flush(node, NULL, nr_submitted, fq, flags); + jput(node); + } + + ret = write_fq(fq, nr_submitted, WRITEOUT_SINGLE_STREAM | WRITEOUT_FOR_PAGE_RECLAIM); + + *atom = get_current_atom_locked(); + (*atom)->nr_flushers --; + fq_put_nolock(fq); + atom_send_event(*atom); + UNLOCK_ATOM(*atom); + + writeout_mode_disable(); + + if (ret == 0) + ret = -E_REPEAT; + + return ret; +} + +/* REVERSE PARENT-FIRST RELOCATION POLICIES */ + +/* This implements the is-it-close-enough-to-its-preceder? test for relocation in the + reverse parent-first relocate context. Here all we know is the preceder and the block + number. Since we are going in reverse, the preceder may still be relocated as well, so + we can't ask the block allocator "is there a closer block available to relocate?" here. + In the _forward_ parent-first relocate context (not here) we actually call the block + allocator to try and find a closer location. */ +static int +reverse_relocate_if_close_enough(const reiser4_block_nr * pblk, const reiser4_block_nr * nblk) +{ + reiser4_block_nr dist; + + assert("jmacd-7710", *pblk != 0 && *nblk != 0); + assert("jmacd-7711", !blocknr_is_fake(pblk)); + assert("jmacd-7712", !blocknr_is_fake(nblk)); + + /* Distance is the absolute value. */ + dist = (*pblk > *nblk) ? (*pblk - *nblk) : (*nblk - *pblk); + + /* If the block is less than FLUSH_RELOCATE_DISTANCE blocks away from its preceder + block, do not relocate. */ + if (dist <= get_current_super_private()->flush.relocate_distance) { + return 0; + } + + return 1; +} + +/* This function is a predicate that tests for relocation. Always called in the + reverse-parent-first context, when we are asking whether the current node should be + relocated in order to expand the flush by dirtying the parent level (and thus + proceeding to flush that level). When traversing in the forward parent-first direction + (not here), relocation decisions are handled in two places: allocate_znode() and + extent_needs_allocation(). */ +static int +reverse_relocate_test(jnode * node, const coord_t * parent_coord, flush_pos_t * pos) +{ + reiser4_block_nr pblk = 0; + reiser4_block_nr nblk = 0; + + assert("jmacd-8989", !jnode_is_root(node)); + + /* + * This function is called only from the + * reverse_relocate_check_dirty_parent() and only if the parent + * node is clean. This implies that the parent has the real (i.e., not + * fake) block number, and, so does the child, because otherwise the + * parent would be dirty. + */ + + /* New nodes are treated as if they are being relocated. */ + if (jnode_created(node) + || (pos->leaf_relocate && jnode_get_level(node) == LEAF_LEVEL)) { + return 1; + } + + /* Find the preceder. FIXME(B): When the child is an unformatted, previously + existing node, the coord may be leftmost even though the child is not the + parent-first preceder of the parent. If the first dirty node appears somewhere + in the middle of the first extent unit, this preceder calculation is wrong. + Needs more logic in here. */ + if (coord_is_leftmost_unit(parent_coord)) { + pblk = *znode_get_block(parent_coord->node); + } else { + pblk = pos->preceder.blk; + } + check_preceder(pblk); + + /* If (pblk == 0) then the preceder isn't allocated or isn't known: relocate. */ + if (pblk == 0) { + return 1; + } + + nblk = *jnode_get_block(node); + + if (blocknr_is_fake(&nblk)) + /* child is unallocated, mark parent dirty */ + return 1; + + return reverse_relocate_if_close_enough(&pblk, &nblk); +} + +/* This function calls reverse_relocate_test to make a reverse-parent-first + relocation decision and then, if yes, it marks the parent dirty. */ +static int +reverse_relocate_check_dirty_parent(jnode * node, const coord_t * parent_coord, flush_pos_t * pos) +{ + int ret; + + if (!znode_check_dirty(parent_coord->node)) { + + ret = reverse_relocate_test(node, parent_coord, pos); + if (ret < 0) { + return ret; + } + + /* FIXME-ZAM + if parent is already relocated - we do not want to grab space, right? */ + if (ret == 1) { + int grabbed; + + grabbed = get_current_context()->grabbed_blocks; + if (reiser4_grab_space_force((__u64)1, BA_RESERVED) != 0) + reiser4_panic("umka-1250", + "No space left during flush."); + + assert("jmacd-18923", znode_is_write_locked(parent_coord->node)); + znode_make_dirty(parent_coord->node); + grabbed2free_mark(grabbed); + } + } + + return 0; +} + +/* INITIAL ALLOCATE ANCESTORS STEP (REVERSE PARENT-FIRST ALLOCATION BEFORE FORWARD + PARENT-FIRST LOOP BEGINS) */ + +/* Get the leftmost child for given coord. */ +static int get_leftmost_child_of_unit (const coord_t * coord, jnode ** child) +{ + int ret; + + ret = item_utmost_child(coord, LEFT_SIDE, child); + + if (ret) + return ret; + + if (IS_ERR(*child)) + return PTR_ERR(*child); + + return 0; +} + +/* This step occurs after the left- and right-scans are completed, before starting the + forward parent-first traversal. Here we attempt to allocate ancestors of the starting + flush point, which means continuing in the reverse parent-first direction to the + parent, grandparent, and so on (as long as the child is a leftmost child). This + routine calls a recursive process, alloc_one_ancestor, which does the real work, + except there is special-case handling here for the first ancestor, which may be a twig. + At each level (here and alloc_one_ancestor), we check for relocation and then, if + the child is a leftmost child, repeat at the next level. On the way back down (the + recursion), we allocate the ancestors in parent-first order. */ +static int alloc_pos_and_ancestors(flush_pos_t * pos) +{ + int ret = 0; + lock_handle plock; + load_count pload; + coord_t pcoord; + + if (znode_check_flushprepped(pos->lock.node)) + return 0; + + coord_init_invalid(&pcoord, NULL); + init_lh(&plock); + init_load_count(&pload); + + if (pos->state == POS_ON_EPOINT) { + /* a special case for pos on twig level, where we already have + a lock on parent node. */ + /* The parent may not be dirty, in which case we should decide + whether to relocate the child now. If decision is made to + relocate the child, the parent is marked dirty. */ + ret = reverse_relocate_check_dirty_parent(pos->child, &pos->coord, pos); + if (ret) + goto exit; + + /* FIXME_NFQUCMPD: We only need to allocate the twig (if child + is leftmost) and the leaf/child, so recursion is not needed. + Levels above the twig will be allocated for + write-optimization before the transaction commits. */ + + /* Do the recursive step, allocating zero or more of our + * ancestors. */ + ret = alloc_one_ancestor(&pos->coord, pos); + + } else { + if (!znode_is_root(pos->lock.node)) { + /* all formatted nodes except tree root */ + ret = reiser4_get_parent(&plock, pos->lock.node, ZNODE_WRITE_LOCK, 0); + if (ret) + goto exit; + + ret = incr_load_count_znode(&pload, plock.node); + if (ret) + goto exit; + + ret = find_child_ptr(plock.node, pos->lock.node, &pcoord); + if (ret) + goto exit; + + ret = reverse_relocate_check_dirty_parent(ZJNODE(pos->lock.node), &pcoord, pos); + if (ret) + goto exit; + + ret = alloc_one_ancestor(&pcoord, pos); + if (ret) + goto exit; + } + + ret = allocate_znode(pos->lock.node, &pcoord, pos); + } +exit: + done_load_count(&pload); + done_lh(&plock); + return ret; +} + +/* This is the recursive step described in alloc_pos_and_ancestors, above. Ignoring the + call to set_preceder, which is the next function described, this checks if the + child is a leftmost child and returns if it is not. If the child is a leftmost child + it checks for relocation, possibly dirtying the parent. Then it performs the recursive + step. */ +static int alloc_one_ancestor(const coord_t * coord, flush_pos_t * pos) +{ + int ret = 0; + lock_handle alock; + load_count aload; + coord_t acoord; + + /* As we ascend at the left-edge of the region to flush, take this opportunity at + the twig level to find our parent-first preceder unless we have already set + it. */ + if (pos->preceder.blk == 0) { + ret = set_preceder(coord, pos); + if (ret != 0) + return ret; + } + + /* If the ancestor is clean or already allocated, or if the child is not a + leftmost child, stop going up, even leaving coord->node not flushprepped. */ + if (znode_check_flushprepped(coord->node)|| !coord_is_leftmost_unit(coord)) + return 0; + + init_lh(&alock); + init_load_count(&aload); + coord_init_invalid(&acoord, NULL); + + /* Only ascend to the next level if it is a leftmost child, but write-lock the + parent in case we will relocate the child. */ + if (!znode_is_root(coord->node)) { + + ret = jnode_lock_parent_coord( + ZJNODE(coord->node), &acoord, &alock, &aload, ZNODE_WRITE_LOCK, 0); + if (ret != 0) { + /* FIXME(C): check EINVAL, E_DEADLOCK */ + goto exit; + } + + ret = reverse_relocate_check_dirty_parent(ZJNODE(coord->node), &acoord, pos); + if (ret != 0) { + goto exit; + } + + /* Recursive call. */ + if (!znode_check_flushprepped(acoord.node)) { + ret = alloc_one_ancestor(&acoord, pos); + if (ret) + goto exit; + } + } + + /* Note: we call allocate with the parent write-locked (except at the root) in + case we relocate the child, in which case it will modify the parent during this + call. */ + ret = allocate_znode(coord->node, &acoord, pos); + +exit: + done_load_count(&aload); + done_lh(&alock); + return ret; +} + +/* During the reverse parent-first alloc_pos_and_ancestors process described above there is + a call to this function at the twig level. During alloc_pos_and_ancestors we may ask: + should this node be relocated (in reverse parent-first context)? We repeat this + process as long as the child is the leftmost child, eventually reaching an ancestor of + the flush point that is not a leftmost child. The preceder of that ancestors, which is + not a leftmost child, is actually on the leaf level. The preceder of that block is the + left-neighbor of the flush point. The preceder of that block is the rightmost child of + the twig on the left. So, when alloc_pos_and_ancestors passes upward through the twig + level, it stops momentarily to remember the block of the rightmost child of the twig on + the left and sets it to the flush_position's preceder_hint. + + There is one other place where we may set the flush_position's preceder hint, which is + during scan-left. +*/ +static int +set_preceder(const coord_t * coord_in, flush_pos_t * pos) +{ + int ret; + coord_t coord; + lock_handle left_lock; + load_count left_load; + +#if 0 + /* do not trust to allocation of nodes above twigs, use the block number of last + * write (write optimized approach). */ + if (znode_get_level(coord_in->node) > TWIG_LEVEL + 1) { + get_blocknr_hint_default(&pos->preceder.blk); + reiser4_stat_inc(block_alloc.nohint); + return 0; + } +#endif + + coord_dup(&coord, coord_in); + + init_lh(&left_lock); + init_load_count(&left_load); + + /* FIXME(B): Same FIXME as in "Find the preceder" in reverse_relocate_test. + coord_is_leftmost_unit is not the right test if the unformatted child is in the + middle of the first extent unit. */ + if (!coord_is_leftmost_unit(&coord)) { + coord_prev_unit(&coord); + } else { + ret = reiser4_get_left_neighbor(&left_lock, coord.node, ZNODE_READ_LOCK, GN_SAME_ATOM); + if (ret) { + /* If we fail for any reason it doesn't matter because the + preceder is only a hint. We are low-priority at this point, so + this must be the case. */ + if (ret == -E_REPEAT || ret == -E_NO_NEIGHBOR || + ret == -ENOENT || ret == -EINVAL || ret == -E_DEADLOCK) + { + ret = 0; + } + goto exit; + } + + ret = incr_load_count_znode(&left_load, left_lock.node); + if (ret) + goto exit; + + coord_init_last_unit(&coord, left_lock.node); + } + + ret = item_utmost_child_real_block(&coord, RIGHT_SIDE, &pos->preceder.blk); +exit: + check_preceder(pos->preceder.blk); + done_load_count(&left_load); + done_lh(&left_lock); + return ret; +} + +/* MAIN SQUEEZE AND ALLOCATE LOOP (THREE BIG FUNCTIONS) */ + +/* This procedure implements the outer loop of the flush algorithm. To put this in + context, here is the general list of steps taken by the flush routine as a whole: + + 1. Scan-left + 2. Scan-right (maybe) + 3. Allocate initial flush position and its ancestors + 4. + 5. + 6. + + This procedure implements the loop in steps 4 through 6 in the above listing. + + Step 4: if the current flush position is an extent item (position on the twig level), + it allocates the extent (allocate_extent_item_in_place) then shifts to the next + coordinate. If the next coordinate's leftmost child needs flushprep, we will continue. + If the next coordinate is an internal item, we descend back to the leaf level, + otherwise we repeat a step #4 (labeled ALLOC_EXTENTS below). If the "next coordinate" + brings us past the end of the twig level, then we call + reverse_relocate_end_of_twig to possibly dirty the next (right) twig, prior to + step #5 which moves to the right. + + Step 5: calls squalloc_changed_ancestors, which initiates a recursive call up the + tree to allocate any ancestors of the next-right flush position that are not also + ancestors of the current position. Those ancestors (in top-down order) are the next in + parent-first order. We squeeze adjacent nodes on the way up until the right node and + current node share the same parent, then allocate on the way back down. Finally, this + step sets the flush position to the next-right node. Then repeat steps 4 and 5. +*/ + +/* SQUEEZE CODE */ + + +/* squalloc_right_twig helper function, cut a range of extent items from + cut node to->node from the beginning up to coord @to. */ +static int squalloc_right_twig_cut(coord_t * to, reiser4_key * to_key, znode * left) +{ + coord_t from; + reiser4_key from_key; + + coord_init_first_unit(&from, to->node); + item_key_by_coord(&from, &from_key); + + return cut_node_content(&from, to, &from_key, to_key, NULL); +} + +/* Copy as much of the leading extents from @right to @left, allocating + unallocated extents as they are copied. Returns SQUEEZE_TARGET_FULL or + SQUEEZE_SOURCE_EMPTY when no more can be shifted. If the next item is an + internal item it calls shift_one_internal_unit and may then return + SUBTREE_MOVED. */ +squeeze_result squalloc_extent(znode *left, const coord_t *, flush_pos_t *, reiser4_key *stop_key); +#if REISER4_DEBUG +void *shift_check_prepare(const znode *left, const znode *right); +void shift_check(void *vp, const znode *left, const znode *right); +#endif +static int squeeze_right_twig(znode * left, znode * right, flush_pos_t * pos) +{ + int ret = SUBTREE_MOVED; + coord_t coord; /* used to iterate over items */ + reiser4_key stop_key; + + assert("jmacd-2008", !node_is_empty(right)); + coord_init_first_unit(&coord, right); + + /* FIXME: can be optimized to cut once */ + while (!node_is_empty(coord.node) && item_is_extent(&coord)) { + ON_DEBUG(void *vp); + + assert("vs-1468", coord_is_leftmost_unit(&coord)); + ON_DEBUG(vp = shift_check_prepare(left, coord.node)); + + /* stop_key is used to find what was copied and what to cut */ + stop_key = *min_key(); + ret = squalloc_extent(left, &coord, pos, &stop_key); + if (ret != SQUEEZE_CONTINUE) { + ON_DEBUG(reiser4_kfree(vp)); + break; + } + assert("vs-1465", !keyeq(&stop_key, min_key())); + + /* Helper function to do the cutting. */ + set_key_offset(&stop_key, get_key_offset(&stop_key) - 1); + check_me("vs-1466", squalloc_right_twig_cut(&coord, &stop_key, left) == 0); + + ON_DEBUG(shift_check(vp, left, coord.node)); + } + + if (node_is_empty(coord.node)) + ret = SQUEEZE_SOURCE_EMPTY; + + if (ret == SQUEEZE_TARGET_FULL) { + goto out; + } + + if (node_is_empty(right)) { + /* The whole right node was copied into @left. */ + assert("vs-464", ret == SQUEEZE_SOURCE_EMPTY); + goto out; + } + + coord_init_first_unit(&coord, right); + + if (!item_is_internal(&coord)) { + /* we do not want to squeeze anything else to left neighbor because "slum" + is over */ + ret = SQUEEZE_TARGET_FULL; + goto out; + } + assert("jmacd-433", item_is_internal(&coord)); + + /* Shift an internal unit. The child must be allocated before shifting any more + extents, so we stop here. */ + ret = shift_one_internal_unit(left, right); + +out: + assert("jmacd-8612", ret < 0 || ret == SQUEEZE_TARGET_FULL + || ret == SUBTREE_MOVED || ret == SQUEEZE_SOURCE_EMPTY); + + if (ret == SQUEEZE_TARGET_FULL) { + /* We submit prepped nodes here and expect that this @left twig + * will not be modified again during this jnode_flush() call. */ + int ret1; + + /* NOTE: seems like io is done under long term locks. */ + ret1 = write_prepped_nodes(pos, 1); + if (ret1 < 0) + return ret1; + } + + return ret; +} + +#if REISER4_DEBUG +static void +item_convert_invariant(flush_pos_t * pos) +{ + assert("edward-1225", coord_is_existing_item(&pos->coord)); + if (chaining_data_present(pos)) { + item_plugin * iplug = item_convert_plug(pos); + + assert("edward-1000", iplug == item_plugin_by_coord(&pos->coord)); + assert("edward-1001", iplug->f.convert != NULL); + } + else + assert("edward-1226", pos->child == NULL); +} +#else + +#define item_convert_invariant(pos) noop + +#endif + +/* Scan node items starting from the first one and apply for each + item its flush ->convert() method (if any). This method may + resize/kill the item so the tree will be changed. +*/ +static int convert_node(flush_pos_t * pos, znode * node) +{ + int ret = 0; + item_plugin * iplug; + + assert("edward-304", pos != NULL); + assert("edward-305", pos->child == NULL); + assert("edward-475", znode_convertible(node)); + assert("edward-669", znode_is_wlocked(node)); + assert("edward-1210", !node_is_empty(node)); + + if (znode_get_level(node) != LEAF_LEVEL) + /* unsupported */ + goto exit; + + coord_init_first_unit(&pos->coord, node); + + while (1) { + ret = 0; + coord_set_to_left(&pos->coord); + item_convert_invariant(pos); + + iplug = item_plugin_by_coord(&pos->coord); + assert("edward-844", iplug != NULL); + + if (iplug->f.convert) { + ret = iplug->f.convert(pos); + if (ret) + goto exit; + } + assert("edward-307", pos->child == NULL); + + if (coord_next_item(&pos->coord)) { + /* node is over */ + + if (!chaining_data_present(pos)) + /* finished this node */ + break; + if (should_chain_next_node(pos)) { + /* go to next node */ + move_chaining_data(pos, 0 /* to next node */); + break; + } + /* repeat this node */ + move_chaining_data(pos, 1 /* this node */); + continue; + } + /* Node is not over. + Check if there is attached convert data. + If so roll one item position back and repeat + on this node + */ + if (chaining_data_present(pos)) { + + if (iplug != item_plugin_by_coord(&pos->coord)) + set_item_convert_count(pos, 0); + + ret = coord_prev_item(&pos->coord); + assert("edward-1003", !ret); + + move_chaining_data(pos, 1 /* this node */); + } + } + JF_CLR(ZJNODE(node), JNODE_CONVERTIBLE); + znode_make_dirty(node); + exit: + assert("edward-1004", !ret); + return ret; +} + +/* Squeeze and allocate the right neighbor. This is called after @left and + its current children have been squeezed and allocated already. This + procedure's job is to squeeze and items from @right to @left. + + If at the leaf level, use the shift_everything_left memcpy-optimized + version of shifting (squeeze_right_leaf). + + If at the twig level, extents are allocated as they are shifted from @right + to @left (squalloc_right_twig). + + At any other level, shift one internal item and return to the caller + (squalloc_parent_first) so that the shifted-subtree can be processed in + parent-first order. + + When unit of internal item is moved, squeezing stops and SUBTREE_MOVED is + returned. When all content of @right is squeezed, SQUEEZE_SOURCE_EMPTY is + returned. If nothing can be moved into @left anymore, SQUEEZE_TARGET_FULL + is returned. +*/ + +static int squeeze_right_neighbor(flush_pos_t * pos, znode * left, znode * right) +{ + int ret; + + /* FIXME it is possible to see empty hasn't-heard-banshee node in a + * tree owing to error (for example, ENOSPC) in write */ + /* assert("jmacd-9321", !node_is_empty(left)); */ + assert("jmacd-9322", !node_is_empty(right)); + assert("jmacd-9323", znode_get_level(left) == znode_get_level(right)); + + switch (znode_get_level(left)) { + case TWIG_LEVEL: + /* Shift with extent allocating until either an internal item + is encountered or everything is shifted or no free space + left in @left */ + ret = squeeze_right_twig(left, right, pos); + break; + + default: + /* All other levels can use shift_everything until we implement per-item + flush plugins. */ + ret = squeeze_right_non_twig(left, right); + break; + } + + assert("jmacd-2011", (ret < 0 || + ret == SQUEEZE_SOURCE_EMPTY || ret == SQUEEZE_TARGET_FULL || ret == SUBTREE_MOVED)); + return ret; +} + +static int squeeze_right_twig_and_advance_coord (flush_pos_t * pos, znode * right) +{ + int ret; + + ret = squeeze_right_twig(pos->lock.node, right, pos); + if (ret < 0) + return ret; + if (ret > 0) { + coord_init_after_last_item(&pos->coord, pos->lock.node); + return ret; + } + + coord_init_last_unit(&pos->coord, pos->lock.node); + return 0; +} + +#if 0 +/* "prepped" check for parent node without long-term locking it */ +static inline int fast_check_parent_flushprepped (znode * node) +{ + reiser4_tree * tree = current_tree; + int prepped = 1; + + RLOCK_TREE(tree); + + if (node->in_parent.node || !jnode_is_flushprepped(ZJNODE(node))) + prepped = 0; + + RUNLOCK_TREE(tree); + + return prepped; +} +#endif + +/* forward declaration */ +static int squalloc_upper_levels (flush_pos_t *, znode *, znode *); + +/* do a fast check for "same parents" condition before calling + * squalloc_upper_levels() */ +static inline int check_parents_and_squalloc_upper_levels (flush_pos_t * pos, znode *left, znode * right) +{ + if (znode_same_parents(left, right)) + return 0; + + return squalloc_upper_levels(pos, left, right); +} + +/* Check whether the parent of given @right node needs to be processes + ((re)allocated) prior to processing of the child. If @left and @right do not + share at least the parent of the @right is after the @left but before the + @right in parent-first order, we have to (re)allocate it before the @right + gets (re)allocated. */ +static int squalloc_upper_levels (flush_pos_t * pos, znode *left, znode * right) +{ + int ret; + + lock_handle left_parent_lock; + lock_handle right_parent_lock; + + load_count left_parent_load; + load_count right_parent_load; + + + init_lh(&left_parent_lock); + init_lh(&right_parent_lock); + + init_load_count(&left_parent_load); + init_load_count(&right_parent_load); + + ret = reiser4_get_parent(&left_parent_lock, left, ZNODE_WRITE_LOCK, 0); + if (ret) + goto out; + + ret = reiser4_get_parent(&right_parent_lock, right, ZNODE_WRITE_LOCK, 0); + if (ret) + goto out; + + /* Check for same parents */ + if (left_parent_lock.node == right_parent_lock.node) + goto out; + + if (znode_check_flushprepped(right_parent_lock.node)) { + /* Keep parent-first order. In the order, the right parent node stands + before the @right node. If it is already allocated, we set the + preceder (next block search start point) to its block number, @right + node should be allocated after it. + + However, preceder is set only if the right parent is on twig level. + The explanation is the following: new branch nodes are allocated over + already allocated children while the tree grows, it is difficult to + keep tree ordered, we assume that only leaves and twings are correctly + allocated. So, only twigs are used as a preceder for allocating of the + rest of the slum. */ + if (znode_get_level(right_parent_lock.node) == TWIG_LEVEL) { + pos->preceder.blk = *znode_get_block(right_parent_lock.node); + check_preceder(pos->preceder.blk); + } + goto out; + } + + ret = incr_load_count_znode(&left_parent_load, left_parent_lock.node); + if (ret) + goto out; + + ret = incr_load_count_znode(&right_parent_load, right_parent_lock.node); + if (ret) + goto out; + + ret = squeeze_right_neighbor(pos, left_parent_lock.node, right_parent_lock.node); + /* We stop if error. We stop if some items/units were shifted (ret == 0) + * and thus @right changed its parent. It means we have not process + * right_parent node prior to processing of @right. Positive return + * values say that shifting items was not happen because of "empty + * source" or "target full" conditions. */ + if (ret <= 0) + goto out; + + /* parent(@left) and parent(@right) may have different parents also. We + * do a recursive call for checking that. */ + ret = check_parents_and_squalloc_upper_levels(pos, left_parent_lock.node, right_parent_lock.node); + if (ret) + goto out; + + /* allocate znode when going down */ + ret = lock_parent_and_allocate_znode(right_parent_lock.node, pos); + + out: + done_load_count(&left_parent_load); + done_load_count(&right_parent_load); + + done_lh(&left_parent_lock); + done_lh(&right_parent_lock); + + return ret; +} + +/* Check the leftmost child "flushprepped" status, also returns true if child + * node was not found in cache. */ +static int leftmost_child_of_unit_check_flushprepped (const coord_t *coord) +{ + int ret; + int prepped; + + jnode * child; + + ret = get_leftmost_child_of_unit(coord, &child); + + if (ret) + return ret; + + if (child) { + prepped = jnode_check_flushprepped(child); + jput(child); + } else { + /* We consider not existing child as a node which slum + processing should not continue to. Not cached node is clean, + so it is flushprepped. */ + prepped = 1; + } + + return prepped; +} + +/* (re)allocate znode with automated getting parent node */ +static int lock_parent_and_allocate_znode (znode * node, flush_pos_t * pos) +{ + int ret; + lock_handle parent_lock; + load_count parent_load; + coord_t pcoord; + + assert ("zam-851", znode_is_write_locked(node)); + + init_lh(&parent_lock); + init_load_count(&parent_load); + + ret = reiser4_get_parent(&parent_lock, node, ZNODE_WRITE_LOCK, 0); + if (ret) + goto out; + + ret = incr_load_count_znode(&parent_load, parent_lock.node); + if (ret) + goto out; + + ret = find_child_ptr(parent_lock.node, node, &pcoord); + if (ret) + goto out; + + ret = allocate_znode(node, &pcoord, pos); + + out: + done_load_count(&parent_load); + done_lh(&parent_lock); + return ret; +} + +/* Process nodes on leaf level until unformatted node or rightmost node in the + * slum reached. */ +static int handle_pos_on_formatted (flush_pos_t * pos) +{ + int ret; + lock_handle right_lock; + load_count right_load; + + init_lh(&right_lock); + init_load_count(&right_load); + + if (should_convert_node(pos, pos->lock.node)) { + ret = convert_node(pos, pos->lock.node); + if (ret) + return ret; + } + + while (1) { + ret = neighbor_in_slum(pos->lock.node, &right_lock, RIGHT_SIDE, ZNODE_WRITE_LOCK); + if (ret) + break; + + /* we don't prep(allocate) nodes for flushing twice. This can be suboptimal, or it + * can be optimal. For now we choose to live with the risk that it will + * be suboptimal because it would be quite complex to code it to be + * smarter. */ + if (znode_check_flushprepped(right_lock.node) && !znode_convertible(right_lock.node)) { + assert("edward-1005", !should_convert_next_node(pos, right_lock.node)); + pos_stop(pos); + break; + } + + ret = incr_load_count_znode(&right_load, right_lock.node); + if (ret) + break; + + if (should_convert_node(pos, right_lock.node)) { + ret = convert_node(pos, right_lock.node); + if (ret) + break; + if (node_is_empty(right_lock.node)) { + /* node became empty after converting, repeat */ + done_load_count(&right_load); + done_lh(&right_lock); + continue; + } + } + + /* squeeze _before_ going upward. */ + ret = squeeze_right_neighbor(pos, pos->lock.node, right_lock.node); + if (ret < 0) + break; + + if (znode_check_flushprepped(right_lock.node)) { + if (should_convert_next_node(pos, right_lock.node)) { + /* in spite of flushprepped status of the node, + its right slum neighbor should be converted */ + assert("edward-953", convert_data(pos)); + assert("edward-954", item_convert_data(pos)); + + if (node_is_empty(right_lock.node)) { + done_load_count(&right_load); + done_lh(&right_lock); + } + else + move_flush_pos(pos, &right_lock, &right_load, NULL); + continue; + } + pos_stop(pos); + break; + } + + if (node_is_empty(right_lock.node)) { + /* repeat if right node was squeezed completely */ + done_load_count(&right_load); + done_lh(&right_lock); + continue; + } + + /* parent(right_lock.node) has to be processed before + * (right_lock.node) due to "parent-first" allocation order. */ + ret = check_parents_and_squalloc_upper_levels(pos, pos->lock.node, right_lock.node); + if (ret) + break; + /* (re)allocate _after_ going upward */ + ret = lock_parent_and_allocate_znode(right_lock.node, pos); + if (ret) + break; + + if (should_terminate_squalloc(pos)) { + set_item_convert_count(pos, 0); + break; + } + + /* advance the flush position to the right neighbor */ + move_flush_pos(pos, &right_lock, &right_load, NULL); + + ret = rapid_flush(pos); + if (ret) + break; + } + + assert("edward-1006", !convert_data(pos) || !item_convert_data(pos)); + + done_load_count(&right_load); + done_lh(&right_lock); + + /* This function indicates via pos whether to stop or go to twig or continue on current + * level. */ + return ret; + +} + +/* Process nodes on leaf level until unformatted node or rightmost node in the + * slum reached. */ +static int handle_pos_on_leaf (flush_pos_t * pos) +{ + int ret; + + assert ("zam-845", pos->state == POS_ON_LEAF); + + ret = handle_pos_on_formatted(pos); + + if (ret == -E_NO_NEIGHBOR) { + /* cannot get right neighbor, go process extents. */ + pos->state = POS_TO_TWIG; + return 0; + } + + return ret; +} + +/* Process slum on level > 1 */ +static int handle_pos_on_internal (flush_pos_t * pos) +{ + assert ("zam-850", pos->state == POS_ON_INTERNAL); + return handle_pos_on_formatted(pos); +} + +/* check whether squalloc should stop before processing given extent */ +static int squalloc_extent_should_stop (flush_pos_t * pos) +{ + assert("zam-869", item_is_extent(&pos->coord)); + + /* pos->child is a jnode handle_pos_on_extent() should start with in + * stead of the first child of the first extent unit. */ + if (pos->child) { + int prepped; + + assert("vs-1383", jnode_is_unformatted(pos->child)); + prepped = jnode_check_flushprepped(pos->child); + pos->pos_in_unit = jnode_get_index(pos->child) - extent_unit_index(&pos->coord); + assert("vs-1470", pos->pos_in_unit < extent_unit_width(&pos->coord)); + assert("nikita-3434", ergo(extent_is_unallocated(&pos->coord), + pos->pos_in_unit == 0)); + jput(pos->child); + pos->child = NULL; + + return prepped; + } + + pos->pos_in_unit = 0; + if (extent_is_unallocated(&pos->coord)) + return 0; + + return leftmost_child_of_unit_check_flushprepped(&pos->coord); +} + +int alloc_extent(flush_pos_t *flush_pos); + +/* Handle the case when regular reiser4 tree (znodes connected one to its + * neighbors by sibling pointers) is interrupted on leaf level by one or more + * unformatted nodes. By having a lock on twig level and use extent code + * routines to process unformatted nodes we swim around an irregular part of + * reiser4 tree. */ +static int handle_pos_on_twig (flush_pos_t * pos) +{ + int ret; + + assert ("zam-844", pos->state == POS_ON_EPOINT); + assert ("zam-843", item_is_extent(&pos->coord)); + + /* We decide should we continue slum processing with current extent + unit: if leftmost child of current extent unit is flushprepped + (i.e. clean or already processed by flush) we stop squalloc(). There + is a fast check for unallocated extents which we assume contain all + not flushprepped nodes. */ + /* FIXME: Here we implement simple check, we are only looking on the + leftmost child. */ + ret = squalloc_extent_should_stop(pos); + if (ret != 0) { + pos_stop(pos); + return ret; + } + + while (pos_valid(pos) && coord_is_existing_unit(&pos->coord) && item_is_extent(&pos->coord)) { + ret = alloc_extent(pos); + if (ret) { + break; + } + coord_next_unit(&pos->coord); + } + + if (coord_is_after_rightmost(&pos->coord)) { + pos->state = POS_END_OF_TWIG; + return 0; + } + if (item_is_internal(&pos->coord)) { + pos->state = POS_TO_LEAF; + return 0; + } + + assert ("zam-860", item_is_extent(&pos->coord)); + + /* "slum" is over */ + pos->state = POS_INVALID; + return 0; +} + +/* When we about to return flush position from twig to leaf level we can process + * the right twig node or move position to the leaf. This processes right twig + * if it is possible and jump to leaf level if not. */ +static int handle_pos_end_of_twig (flush_pos_t * pos) +{ + int ret; + lock_handle right_lock; + load_count right_load; + coord_t at_right; + jnode * child = NULL; + + + assert ("zam-848", pos->state == POS_END_OF_TWIG); + assert ("zam-849", coord_is_after_rightmost(&pos->coord)); + + init_lh(&right_lock); + init_load_count(&right_load); + + /* We get a lock on the right twig node even it is not dirty because + * slum continues or discontinues on leaf level not on next twig. This + * lock on the right twig is needed for getting its leftmost child. */ + ret = reiser4_get_right_neighbor(&right_lock, pos->lock.node, ZNODE_WRITE_LOCK, GN_SAME_ATOM); + if (ret) + goto out; + + ret = incr_load_count_znode(&right_load, right_lock.node); + if (ret) + goto out; + + /* right twig could be not dirty */ + if (znode_check_dirty(right_lock.node)) { + /* If right twig node is dirty we always attempt to squeeze it + * content to the left... */ +became_dirty: + ret = squeeze_right_twig_and_advance_coord(pos, right_lock.node); + if (ret <=0) { + /* pos->coord is on internal item, go to leaf level, or + * we have an error which will be caught in squalloc() */ + pos->state = POS_TO_LEAF; + goto out; + } + + /* If right twig was squeezed completely we wave to re-lock + * right twig. now it is done through the top-level squalloc + * routine. */ + if (node_is_empty(right_lock.node)) + goto out; + + /* ... and prep it if it is not yet prepped */ + if (!znode_check_flushprepped(right_lock.node)) { + /* As usual, process parent before ...*/ + ret = check_parents_and_squalloc_upper_levels(pos, pos->lock.node, right_lock.node); + if (ret) + goto out; + + /* ... processing the child */ + ret = lock_parent_and_allocate_znode(right_lock.node, pos); + if (ret) + goto out; + } + } else { + coord_init_first_unit(&at_right, right_lock.node); + + /* check first child of next twig, should we continue there ? */ + ret = get_leftmost_child_of_unit(&at_right, &child); + if (ret || child == NULL || jnode_check_flushprepped(child)) { + pos_stop(pos); + goto out; + } + + /* check clean twig for possible relocation */ + if (!znode_check_flushprepped(right_lock.node)) { + ret = reverse_relocate_check_dirty_parent(child, &at_right, pos); + if (ret) + goto out; + if (znode_check_dirty(right_lock.node)) + goto became_dirty; + } + } + + assert ("zam-875", znode_check_flushprepped(right_lock.node)); + + /* Update the preceder by a block number of just processed right twig + * node. The code above could miss the preceder updating because + * allocate_znode() could not be called for this node. */ + pos->preceder.blk = *znode_get_block(right_lock.node); + check_preceder(pos->preceder.blk); + + coord_init_first_unit(&at_right, right_lock.node); + assert("zam-868", coord_is_existing_unit(&at_right)); + + pos->state = item_is_extent(&at_right) ? POS_ON_EPOINT : POS_TO_LEAF; + move_flush_pos(pos, &right_lock, &right_load, &at_right); + + out: + done_load_count(&right_load); + done_lh(&right_lock); + + if (child) + jput(child); + + return ret; +} + +/* Move the pos->lock to leaf node pointed by pos->coord, check should we + * continue there. */ +static int handle_pos_to_leaf (flush_pos_t * pos) +{ + int ret; + lock_handle child_lock; + load_count child_load; + jnode * child; + + assert ("zam-846", pos->state == POS_TO_LEAF); + assert ("zam-847", item_is_internal(&pos->coord)); + + init_lh(&child_lock); + init_load_count(&child_load); + + ret = get_leftmost_child_of_unit(&pos->coord, &child); + if (ret) + return ret; + if (child == NULL) { + pos_stop(pos); + return 0; + } + + if (jnode_check_flushprepped(child)) { + pos->state = POS_INVALID; + goto out; + } + + ret = longterm_lock_znode(&child_lock, JZNODE(child), ZNODE_WRITE_LOCK, ZNODE_LOCK_LOPRI); + if (ret) + goto out; + + ret = incr_load_count_znode(&child_load, JZNODE(child)); + if (ret) + goto out; + + ret = allocate_znode(JZNODE(child), &pos->coord, pos); + if (ret) + goto out; + + /* move flush position to leaf level */ + pos->state = POS_ON_LEAF; + move_flush_pos(pos, &child_lock, &child_load, NULL); + + if (node_is_empty(JZNODE(child))) { + ret = delete_empty_node(JZNODE(child)); + pos->state = POS_INVALID; + } + out: + done_load_count(&child_load); + done_lh(&child_lock); + jput(child); + + return ret; +} +/* move pos from leaf to twig, and move lock from leaf to twig. */ +/* Move pos->lock to upper (twig) level */ +static int handle_pos_to_twig (flush_pos_t * pos) +{ + int ret; + + lock_handle parent_lock; + load_count parent_load; + coord_t pcoord; + + assert ("zam-852", pos->state == POS_TO_TWIG); + + init_lh(&parent_lock); + init_load_count(&parent_load); + + ret = reiser4_get_parent(&parent_lock, pos->lock.node, ZNODE_WRITE_LOCK, 0); + if (ret) + goto out; + + ret = incr_load_count_znode(&parent_load, parent_lock.node); + if (ret) + goto out; + + ret = find_child_ptr(parent_lock.node, pos->lock.node, &pcoord); + if (ret) + goto out; + + assert ("zam-870", item_is_internal(&pcoord)); + coord_next_item(&pcoord); + + if (coord_is_after_rightmost(&pcoord)) + pos->state = POS_END_OF_TWIG; + else if (item_is_extent(&pcoord)) + pos->state = POS_ON_EPOINT; + else { + /* Here we understand that getting -E_NO_NEIGHBOR in + * handle_pos_on_leaf() was because of just a reaching edge of + * slum */ + pos_stop(pos); + goto out; + } + + move_flush_pos(pos, &parent_lock, &parent_load, &pcoord); + + out: + done_load_count(&parent_load); + done_lh(&parent_lock); + + return ret; +} + +typedef int (*pos_state_handle_t)(flush_pos_t*); +static pos_state_handle_t flush_pos_handlers[] = { + /* process formatted nodes on leaf level, keep lock on a leaf node */ + [POS_ON_LEAF] = handle_pos_on_leaf, + /* process unformatted nodes, keep lock on twig node, pos->coord points to extent currently + * being processed */ + [POS_ON_EPOINT] = handle_pos_on_twig, + /* move a lock from leaf node to its parent for further processing of unformatted nodes */ + [POS_TO_TWIG] = handle_pos_to_twig, + /* move a lock from twig to leaf level when a processing of unformatted nodes finishes, + * pos->coord points to the leaf node we jump to */ + [POS_TO_LEAF] = handle_pos_to_leaf, + /* after processing last extent in the twig node, attempting to shift items from the twigs + * right neighbor and process them while shifting */ + [POS_END_OF_TWIG] = handle_pos_end_of_twig, + /* process formatted nodes on internal level, keep lock on an internal node */ + [POS_ON_INTERNAL] = handle_pos_on_internal +}; + +/* Advance flush position horizontally, prepare for flushing ((re)allocate, squeeze, + * encrypt) nodes and their ancestors in "parent-first" order */ +static int squalloc (flush_pos_t * pos) +{ + int ret = 0; + + /* maybe needs to be made a case statement with handle_pos_on_leaf as first case, for + * greater CPU efficiency? Measure and see.... -Hans */ + while (pos_valid(pos)) { + ret = flush_pos_handlers[pos->state](pos); + if (ret < 0) + break; + + ret = rapid_flush(pos); + if (ret) + break; + } + + /* any positive value or -E_NO_NEIGHBOR are legal return codes for handle_pos* + routines, -E_NO_NEIGHBOR means that slum edge was reached */ + if (ret > 0 || ret == -E_NO_NEIGHBOR) + ret = 0; + + return ret; +} + +static void update_ldkey(znode *node) +{ + reiser4_key ldkey; + + assert("vs-1630", rw_dk_is_write_locked(znode_get_tree(node))); + if (node_is_empty(node)) + return; + + znode_set_ld_key(node, leftmost_key_in_node(node, &ldkey)); +} + +/* this is to be called after calling of shift node's method to shift data from @right to + @left. It sets left delimiting keys of @left and @right to keys of first items of @left + and @right correspondingly and sets right delimiting key of @left to first key of @right */ +static void +update_znode_dkeys(znode *left, znode *right) +{ + assert("nikita-1470", rw_dk_is_write_locked(znode_get_tree(right))); + assert("vs-1629", znode_is_write_locked(left) && znode_is_write_locked(right)); + + /* we need to update left delimiting of left if it was empty before shift */ + update_ldkey(left); + update_ldkey(right); + if (node_is_empty(right)) + znode_set_rd_key(left, znode_get_rd_key(right)); + else + znode_set_rd_key(left, znode_get_ld_key(right)); +} + +/* try to shift everything from @right to @left. If everything was shifted - + @right is removed from the tree. Result is the number of bytes shifted. */ +static int +shift_everything_left(znode * right, znode * left, carry_level * todo) +{ + coord_t from; + node_plugin *nplug; + carry_plugin_info info; + + coord_init_after_last_item(&from, right); + + nplug = node_plugin_by_node(right); + info.doing = NULL; + info.todo = todo; + return nplug->shift(&from, left, SHIFT_LEFT, + 1 /* delete @right if it becomes empty */, + 1 /* move coord @from to node @left if everything will be shifted */, + &info); +} + +/* Shift as much as possible from @right to @left using the memcpy-optimized + shift_everything_left. @left and @right are formatted neighboring nodes on + leaf level. */ +static int +squeeze_right_non_twig(znode * left, znode * right) +{ + int ret; + carry_pool *pool; + carry_level todo; + + assert("nikita-2246", znode_get_level(left) == znode_get_level(right)); + + if (!znode_is_dirty(left) || !znode_is_dirty(right)) + return SQUEEZE_TARGET_FULL; + + pool = init_carry_pool(); + if (IS_ERR(pool)) + return PTR_ERR(pool); + init_carry_level(&todo, pool); + + ret = shift_everything_left(right, left, &todo); + if (ret > 0) { + /* something was shifted */ + reiser4_tree *tree; + __u64 grabbed; + + znode_make_dirty(left); + znode_make_dirty(right); + + /* update delimiting keys of nodes which participated in + shift. FIXME: it would be better to have this in shift + node's operation. But it can not be done there. Nobody + remembers why, though */ + tree = znode_get_tree(left); + UNDER_RW_VOID(dk, tree, write, update_znode_dkeys(left, right)); + + /* Carry is called to update delimiting key and, maybe, to remove empty + node. */ + grabbed = get_current_context()->grabbed_blocks; + ret = reiser4_grab_space_force(tree->height, BA_RESERVED); + assert("nikita-3003", ret == 0); /* reserved space is exhausted. Ask Hans. */ + ret = carry(&todo, NULL /* previous level */ ); + grabbed2free_mark(grabbed); + } else { + /* Shifting impossible, we return appropriate result code */ + ret = node_is_empty(right) ? SQUEEZE_SOURCE_EMPTY : SQUEEZE_TARGET_FULL; + } + + done_carry_pool(pool); + + return ret; +} + +/* Shift first unit of first item if it is an internal one. Return + SQUEEZE_TARGET_FULL if it fails to shift an item, otherwise return + SUBTREE_MOVED. */ +static int +shift_one_internal_unit(znode * left, znode * right) +{ + int ret; + carry_pool *pool; + carry_level todo; + coord_t coord; + int size, moved; + carry_plugin_info info; + + assert("nikita-2247", znode_get_level(left) == znode_get_level(right)); + assert("nikita-2435", znode_is_write_locked(left)); + assert("nikita-2436", znode_is_write_locked(right)); + assert("nikita-2434", UNDER_RW(tree, znode_get_tree(left), read, left->right == right)); + + coord_init_first_unit(&coord, right); + +#if REISER4_DEBUG + if (!node_is_empty(left)) { + coord_t last; + reiser4_key right_key; + reiser4_key left_key; + + coord_init_last_unit(&last, left); + + assert("nikita-2463", + keyle(item_key_by_coord(&last, &left_key), item_key_by_coord(&coord, &right_key))); + } +#endif + + assert("jmacd-2007", item_is_internal(&coord)); + + pool = init_carry_pool(); + if (IS_ERR(pool)) + return PTR_ERR(pool); + init_carry_level(&todo, pool); + + size = item_length_by_coord(&coord); + info.todo = &todo; + info.doing = NULL; + + ret = node_plugin_by_node(left)->shift(&coord, left, SHIFT_LEFT, + 1 /* delete @right if it becomes empty */, + 0 /* do not move coord @coord to node @left */, + &info); + + /* If shift returns positive, then we shifted the item. */ + assert("vs-423", ret <= 0 || size == ret); + moved = (ret > 0); + + if (moved) { + /* something was moved */ + reiser4_tree *tree; + int grabbed; + + znode_make_dirty(left); + znode_make_dirty(right); + tree = znode_get_tree(left); + UNDER_RW_VOID(dk, tree, write, update_znode_dkeys(left, right)); + + /* reserve space for delimiting keys after shifting */ + grabbed = get_current_context()->grabbed_blocks; + ret = reiser4_grab_space_force(tree->height, BA_RESERVED); + assert("nikita-3003", ret == 0); /* reserved space is exhausted. Ask Hans. */ + + ret = carry(&todo, NULL /* previous level */ ); + grabbed2free_mark(grabbed); + } + + done_carry_pool(pool); + + if (ret != 0) { + /* Shift or carry operation failed. */ + assert("jmacd-7325", ret < 0); + return ret; + } + + return moved ? SUBTREE_MOVED : SQUEEZE_TARGET_FULL; +} + +/* ALLOCATE INTERFACE */ +/* Audited by: umka (2002.06.11) */ +reiser4_internal void +jnode_set_block(jnode * node /* jnode to update */ , + const reiser4_block_nr * blocknr /* new block nr */ ) +{ + assert("nikita-2020", node != NULL); + assert("umka-055", blocknr != NULL); + assert("zam-819", ergo(JF_ISSET(node, JNODE_EFLUSH), node->blocknr == 0)); + assert("vs-1453", ergo(JF_ISSET(node, JNODE_EFLUSH), jnode_is_unformatted(node))); + node->blocknr = *blocknr; +} + +/* Make the final relocate/wander decision during forward parent-first squalloc for a + znode. For unformatted nodes this is done in plugin/item/extent.c:extent_needs_allocation(). */ +static int +allocate_znode_loaded(znode * node, + const coord_t * parent_coord, flush_pos_t * pos) +{ + int ret; + reiser4_super_info_data * sbinfo = get_current_super_private(); + /* FIXME(D): We have the node write-locked and should have checked for ! + allocated() somewhere before reaching this point, but there can be a race, so + this assertion is bogus. */ + assert("jmacd-7987", !jnode_check_flushprepped(ZJNODE(node))); + assert("jmacd-7988", znode_is_write_locked(node)); + assert("jmacd-7989", coord_is_invalid(parent_coord) + || znode_is_write_locked(parent_coord->node)); + + if (ZF_ISSET(node, JNODE_REPACK) || znode_created(node) || znode_is_root(node) || + /* We have enough nodes to relocate no matter what. */ + (pos->leaf_relocate != 0 && znode_get_level(node) == LEAF_LEVEL)) + { + /* No need to decide with new nodes, they are treated the same as + relocate. If the root node is dirty, relocate. */ + if (pos->preceder.blk == 0) { + /* preceder is unknown and we have decided to relocate node -- + using of default value for search start is better than search + from block #0. */ + get_blocknr_hint_default(&pos->preceder.blk); + check_preceder(pos->preceder.blk); + } + + goto best_reloc; + + } else if (pos->preceder.blk == 0) { + /* If we don't know the preceder, leave it where it is. */ + jnode_make_wander(ZJNODE(node)); + } else { + /* Make a decision based on block distance. */ + reiser4_block_nr dist; + reiser4_block_nr nblk = *znode_get_block(node); + + assert("jmacd-6172", !blocknr_is_fake(&nblk)); + assert("jmacd-6173", !blocknr_is_fake(&pos->preceder.blk)); + assert("jmacd-6174", pos->preceder.blk != 0); + + if (pos->preceder.blk == nblk - 1) { + /* Ideal. */ + jnode_make_wander(ZJNODE(node)); + } else { + + dist = (nblk < pos->preceder.blk) ? (pos->preceder.blk - nblk) : (nblk - pos->preceder.blk); + + /* See if we can find a closer block (forward direction only). */ + pos->preceder.max_dist = min((reiser4_block_nr)sbinfo->flush.relocate_distance, dist); + pos->preceder.level = znode_get_level(node); + + ret = allocate_znode_update(node, parent_coord, pos); + + pos->preceder.max_dist = 0; + + if (ret && (ret != -ENOSPC)) + return ret; + + if (ret == 0) { + /* Got a better allocation. */ + znode_make_reloc(node, pos->fq); + } else if (dist < sbinfo->flush.relocate_distance) { + /* The present allocation is good enough. */ + jnode_make_wander(ZJNODE(node)); + } else { + /* Otherwise, try to relocate to the best position. */ + best_reloc: + ret = allocate_znode_update(node, parent_coord, pos); + if (ret != 0) + return ret; + + /* set JNODE_RELOC bit _after_ node gets allocated */ + znode_make_reloc(node, pos->fq); + } + } + } + + /* This is the new preceder. */ + pos->preceder.blk = *znode_get_block(node); + check_preceder(pos->preceder.blk); + pos->alloc_cnt += 1; + + assert ("jmacd-4277", !blocknr_is_fake(&pos->preceder.blk)); + + return 0; +} + +static int +allocate_znode(znode * node, const coord_t * parent_coord, flush_pos_t * pos) +{ + /* + * perform znode allocation with znode pinned in memory to avoid races + * with asynchronous emergency flush (which plays with + * JNODE_FLUSH_RESERVED bit). + */ + return WITH_DATA(node, allocate_znode_loaded(node, parent_coord, pos)); +} + + +/* A subroutine of allocate_znode, this is called first to see if there is a close + position to relocate to. It may return ENOSPC if there is no close position. If there + is no close position it may not relocate. This takes care of updating the parent node + with the relocated block address. */ +static int +allocate_znode_update(znode * node, const coord_t * parent_coord, flush_pos_t * pos) +{ + int ret; + reiser4_block_nr blk; + lock_handle uber_lock; + int flush_reserved_used = 0; + int grabbed; + + init_lh(&uber_lock); + + grabbed = get_current_context()->grabbed_blocks; + + /* discard e-flush allocation */ + ret = zload(node); + if (ret) + return ret; + + if (ZF_ISSET(node, JNODE_CREATED)) { + assert ("zam-816", blocknr_is_fake(znode_get_block(node))); + pos->preceder.block_stage = BLOCK_UNALLOCATED; + } else { + pos->preceder.block_stage = BLOCK_GRABBED; + + /* The disk space for relocating the @node is already reserved in "flush reserved" + * counter if @node is leaf, otherwise we grab space using BA_RESERVED (means grab + * space from whole disk not from only 95%). */ + if (znode_get_level(node) == LEAF_LEVEL) { + /* + * earlier (during do_jnode_make_dirty()) we decided + * that @node can possibly go into overwrite set and + * reserved block for its wandering location. + */ + txn_atom * atom = get_current_atom_locked(); + assert("nikita-3449", + ZF_ISSET(node, JNODE_FLUSH_RESERVED)); + flush_reserved2grabbed(atom, (__u64)1); + spin_unlock_atom(atom); + /* + * we are trying to move node into relocate + * set. Allocation of relocated position "uses" + * reserved block. + */ + ZF_CLR(node, JNODE_FLUSH_RESERVED); + flush_reserved_used = 1; + } else { + ret = reiser4_grab_space_force((__u64)1, BA_RESERVED); + if (ret != 0) + goto exit; + } + } + + /* We may do not use 5% of reserved disk space here and flush will not pack tightly. */ + ret = reiser4_alloc_block(&pos->preceder, &blk, BA_FORMATTED | BA_PERMANENT); + if(ret) + goto exit; + + + if (!ZF_ISSET(node, JNODE_CREATED) && + (ret = reiser4_dealloc_block(znode_get_block(node), 0, BA_DEFER | BA_FORMATTED))) + goto exit; + + if (likely(!znode_is_root(node))) { + item_plugin *iplug; + + iplug = item_plugin_by_coord(parent_coord); + assert("nikita-2954", iplug->f.update != NULL); + iplug->f.update(parent_coord, &blk); + + znode_make_dirty(parent_coord->node); + + } else { + reiser4_tree *tree = znode_get_tree(node); + znode *uber; + + /* We take a longterm lock on the fake node in order to change + the root block number. This may cause atom fusion. */ + ret = get_uber_znode(tree, ZNODE_WRITE_LOCK, ZNODE_LOCK_HIPRI, + &uber_lock); + /* The fake node cannot be deleted, and we must have priority + here, and may not be confused with ENOSPC. */ + assert("jmacd-74412", + ret != -EINVAL && ret != -E_DEADLOCK && ret != -ENOSPC); + + if (ret) + goto exit; + + uber = uber_lock.node; + + UNDER_RW_VOID(tree, tree, write, tree->root_block = blk); + + znode_make_dirty(uber); + } + + ret = znode_rehash(node, &blk); +exit: + if(ret) { + /* Get flush reserved block back if something fails, because + * callers assume that on error block wasn't relocated and its + * flush reserved block wasn't used. */ + if (flush_reserved_used) { + /* + * ok, we failed to move node into relocate + * set. Restore status quo. + */ + grabbed2flush_reserved((__u64)1); + ZF_SET(node, JNODE_FLUSH_RESERVED); + } + } + zrelse(node); + done_lh(&uber_lock); + grabbed2free_mark(grabbed); + return ret; +} + +/* JNODE INTERFACE */ + +/* Lock a node (if formatted) and then get its parent locked, set the child's + coordinate in the parent. If the child is the root node, the above_root + znode is returned but the coord is not set. This function may cause atom + fusion, but it is only used for read locks (at this point) and therefore + fusion only occurs when the parent is already dirty. */ +/* Hans adds this note: remember to ask how expensive this operation is vs. storing parent + pointer in jnodes. */ +static int +jnode_lock_parent_coord(jnode * node, + coord_t * coord, + lock_handle * parent_lh, + load_count * parent_zh, + znode_lock_mode parent_mode, + int try) +{ + int ret; + + assert("edward-53", jnode_is_unformatted(node) || jnode_is_znode(node)); + assert("edward-54", jnode_is_unformatted(node) || znode_is_any_locked(JZNODE(node))); + + if (!jnode_is_znode(node)) { + reiser4_key key; + tree_level stop_level = TWIG_LEVEL ; + lookup_bias bias = FIND_EXACT; + + assert("edward-168", !(jnode_get_type(node) == JNODE_BITMAP)); + + /* The case when node is not znode, but can have parent coord + (unformatted node, node which represents cluster page, + etc..). Generate a key for the appropriate entry, search + in the tree using coord_by_key, which handles locking for + us. */ + + /* + * nothing is locked at this moment, so, nothing prevents + * concurrent truncate from removing jnode from inode. To + * prevent this spin-lock jnode. jnode can be truncated just + * after call to the jnode_build_key(), but this is ok, + * because coord_by_key() will just fail to find appropriate + * extent. + */ + LOCK_JNODE(node); + if (!JF_ISSET(node, JNODE_HEARD_BANSHEE)) { + jnode_build_key(node, &key); + ret = 0; + } else + ret = RETERR(-ENOENT); + UNLOCK_JNODE(node); + + if (ret != 0) + return ret; + + if (jnode_is_cluster_page(node)) + stop_level = LEAF_LEVEL; + + assert("jmacd-1812", coord != NULL); + + ret = coord_by_key(jnode_get_tree(node), &key, coord, parent_lh, + parent_mode, bias, stop_level, stop_level, CBK_UNIQUE, 0/*ra_info*/); + switch (ret) { + case CBK_COORD_NOTFOUND: + assert("edward-1038", + ergo(jnode_is_cluster_page(node), JF_ISSET(node, JNODE_HEARD_BANSHEE))); + if (!JF_ISSET(node, JNODE_HEARD_BANSHEE)) { + warning("nikita-3177", "Parent not found"); + print_jnode("node", node); + } + return ret; + case CBK_COORD_FOUND: + if (coord->between != AT_UNIT) { + /* FIXME: comment needed */ + done_lh(parent_lh); + if (!JF_ISSET(node, JNODE_HEARD_BANSHEE)) { + warning("nikita-3178", + "Found but not happy: %i", + coord->between); + print_jnode("node", node); + } + return RETERR(-ENOENT); + } + ret = incr_load_count_znode(parent_zh, parent_lh->node); + if (ret != 0) + return ret; + /* if (jnode_is_cluster_page(node)) { + races with write() are possible + check_child_cluster (parent_lh->node); + } + */ + break; + default: + return ret; + } + + } else { + int flags; + znode *z; + + z = JZNODE(node); + /* Formatted node case: */ + assert("jmacd-2061", !znode_is_root(z)); + + flags = GN_ALLOW_NOT_CONNECTED; + if (try) + flags |= GN_TRY_LOCK; + + ret = reiser4_get_parent_flags(parent_lh, z, parent_mode, flags); + if (ret != 0) + /* -E_REPEAT is ok here, it is handled by the caller. */ + return ret; + + /* Make the child's position "hint" up-to-date. (Unless above + root, which caller must check.) */ + if (coord != NULL) { + + ret = incr_load_count_znode(parent_zh, parent_lh->node); + if (ret != 0) { + warning("jmacd-976812386", "incr_load_count_znode failed: %d", ret); + return ret; + } + + ret = find_child_ptr(parent_lh->node, z, coord); + if (ret != 0) { + warning("jmacd-976812", "find_child_ptr failed: %d", ret); + return ret; + } + } + } + + return 0; +} + +/* Get the (locked) next neighbor of a znode which is dirty and a member of the same atom. + If there is no next neighbor or the neighbor is not in memory or if there is a + neighbor but it is not dirty or not in the same atom, -E_NO_NEIGHBOR is returned. */ +static int +neighbor_in_slum( + + znode * node, /* starting point */ + + lock_handle * lock, /* lock on starting point */ + + sideof side, /* left or right direction we seek the next node in */ + + znode_lock_mode mode /* kind of lock we want */ + + ) +{ + int ret; + + assert("jmacd-6334", znode_is_connected(node)); + + ret = reiser4_get_neighbor(lock, node, mode, GN_SAME_ATOM | (side == LEFT_SIDE ? GN_GO_LEFT : 0)); + + if (ret) { + /* May return -ENOENT or -E_NO_NEIGHBOR. */ + /* FIXME(C): check EINVAL, E_DEADLOCK */ + if (ret == -ENOENT) { + ret = RETERR(-E_NO_NEIGHBOR); + } + + return ret; + } + + /* Check dirty bit of locked znode, no races here */ + if (znode_check_dirty(lock->node)) + return 0; + + done_lh(lock); + return RETERR(-E_NO_NEIGHBOR); +} + +/* Return true if two znodes have the same parent. This is called with both nodes + write-locked (for squeezing) so no tree lock is needed. */ +static int +znode_same_parents(znode * a, znode * b) +{ + assert("jmacd-7011", znode_is_write_locked(a)); + assert("jmacd-7012", znode_is_write_locked(b)); + + /* We lock the whole tree for this check.... I really don't like whole tree + * locks... -Hans */ + return UNDER_RW(tree, znode_get_tree(a), read, + (znode_parent(a) == znode_parent(b))); +} + +/* FLUSH SCAN */ + +/* Initialize the flush_scan data structure. */ +static void +scan_init(flush_scan * scan) +{ + memset(scan, 0, sizeof (*scan)); + init_lh(&scan->node_lock); + init_lh(&scan->parent_lock); + init_load_count(&scan->parent_load); + init_load_count(&scan->node_load); + coord_init_invalid(&scan->parent_coord, NULL); +} + +/* Release any resources held by the flush scan, e.g., release locks, free memory, etc. */ +static void +scan_done(flush_scan * scan) +{ + done_load_count(&scan->node_load); + if (scan->node != NULL) { + jput(scan->node); + scan->node = NULL; + } + done_load_count(&scan->parent_load); + done_lh(&scan->parent_lock); + done_lh(&scan->node_lock); +} + +/* Returns true if flush scanning is finished. */ +reiser4_internal int +scan_finished(flush_scan * scan) +{ + return scan->stop || (scan->direction == RIGHT_SIDE && + scan->count >= scan->max_count); +} + +/* Return true if the scan should continue to the @tonode. True if the node meets the + same_slum_check condition. If not, deref the "left" node and stop the scan. */ +reiser4_internal int +scan_goto(flush_scan * scan, jnode * tonode) +{ + int go = same_slum_check(scan->node, tonode, 1, 0); + + if (!go) { + scan->stop = 1; + jput(tonode); + } + + return go; +} + +/* Set the current scan->node, refcount it, increment count by the @add_count (number to + count, e.g., skipped unallocated nodes), deref previous current, and copy the current + parent coordinate. */ +reiser4_internal int +scan_set_current(flush_scan * scan, jnode * node, unsigned add_count, const coord_t * parent) +{ + /* Release the old references, take the new reference. */ + done_load_count(&scan->node_load); + + if (scan->node != NULL) { + jput(scan->node); + } + scan->node = node; + scan->count += add_count; + + /* This next stmt is somewhat inefficient. The scan_extent_coord code could + delay this update step until it finishes and update the parent_coord only once. + It did that before, but there was a bug and this was the easiest way to make it + correct. */ + if (parent != NULL) { + coord_dup(&scan->parent_coord, parent); + } + + /* Failure may happen at the incr_load_count call, but the caller can assume the reference + is safely taken. */ + return incr_load_count_jnode(&scan->node_load, node); +} + +/* Return true if scanning in the leftward direction. */ +reiser4_internal int +scanning_left(flush_scan * scan) +{ + return scan->direction == LEFT_SIDE; +} + +/* Performs leftward scanning starting from either kind of node. Counts the starting + node. The right-scan object is passed in for the left-scan in order to copy the parent + of an unformatted starting position. This way we avoid searching for the unformatted + node's parent when scanning in each direction. If we search for the parent once it is + set in both scan objects. The limit parameter tells flush-scan when to stop. + + Rapid scanning is used only during scan_left, where we are interested in finding the + 'leftpoint' where we begin flushing. We are interested in stopping at the left child + of a twig that does not have a dirty left neighbor. THIS IS A SPECIAL CASE. The + problem is finding a way to flush only those nodes without unallocated children, and it + is difficult to solve in the bottom-up flushing algorithm we are currently using. The + problem can be solved by scanning left at every level as we go upward, but this would + basically bring us back to using a top-down allocation strategy, which we already tried + (see BK history from May 2002), and has a different set of problems. The top-down + strategy makes avoiding unallocated children easier, but makes it difficult to + propertly flush dirty children with clean parents that would otherwise stop the + top-down flush, only later to dirty the parent once the children are flushed. So we + solve the problem in the bottom-up algorithm with a special case for twigs and leaves + only. + + The first step in solving the problem is this rapid leftward scan. After we determine + that there are at least enough nodes counted to qualify for FLUSH_RELOCATE_THRESHOLD we + are no longer interested in the exact count, we are only interested in finding a the + best place to start the flush. We could choose one of two possibilities: + + 1. Stop at the leftmost child (of a twig) that does not have a dirty left neighbor. + This requires checking one leaf per rapid-scan twig + + 2. Stop at the leftmost child (of a twig) where there are no dirty children of the twig + to the left. This requires checking possibly all of the in-memory children of each + twig during the rapid scan. + + For now we implement the first policy. +*/ +static int +scan_left(flush_scan * scan, flush_scan * right, jnode * node, unsigned limit) +{ + int ret = 0; + + scan->max_count = limit; + scan->direction = LEFT_SIDE; + + ret = scan_set_current(scan, jref(node), 1, NULL); + if (ret != 0) { + return ret; + } + + ret = scan_common(scan, right); + if (ret != 0) { + return ret; + } + + /* Before rapid scanning, we need a lock on scan->node so that we can get its + parent, only if formatted. */ + if (jnode_is_znode(scan->node)) { + ret = longterm_lock_znode(&scan->node_lock, JZNODE(scan->node), + ZNODE_WRITE_LOCK, ZNODE_LOCK_LOPRI); + } + + /* Rapid_scan would go here (with limit set to FLUSH_RELOCATE_THRESHOLD). */ + return ret; +} + +/* Performs rightward scanning... Does not count the starting node. The limit parameter + is described in scan_left. If the starting node is unformatted then the + parent_coord was already set during scan_left. The rapid_after parameter is not used + during right-scanning. + + scan_right is only called if the scan_left operation does not count at least + FLUSH_RELOCATE_THRESHOLD nodes for flushing. Otherwise, the limit parameter is set to + the difference between scan-left's count and FLUSH_RELOCATE_THRESHOLD, meaning + scan-right counts as high as FLUSH_RELOCATE_THRESHOLD and then stops. */ +static int +scan_right(flush_scan * scan, jnode * node, unsigned limit) +{ + int ret; + + scan->max_count = limit; + scan->direction = RIGHT_SIDE; + + ret = scan_set_current(scan, jref(node), 0, NULL); + if (ret != 0) { + return ret; + } + + return scan_common(scan, NULL); +} + +/* Common code to perform left or right scanning. */ +static int +scan_common(flush_scan * scan, flush_scan * other) +{ + int ret; + + assert("nikita-2376", scan->node != NULL); + assert("edward-54", jnode_is_unformatted(scan->node) || jnode_is_znode(scan->node)); + + /* Special case for starting at an unformatted node. Optimization: we only want + to search for the parent (which requires a tree traversal) once. Obviously, we + shouldn't have to call it once for the left scan and once for the right scan. + For this reason, if we search for the parent during scan-left we then duplicate + the coord/lock/load into the scan-right object. */ + if (jnode_is_unformatted(scan->node)) { + ret = scan_unformatted(scan, other); + if (ret != 0) + return ret; + } + /* This loop expects to start at a formatted position and performs chaining of + formatted regions */ + while (!scan_finished(scan)) { + + ret = scan_formatted(scan); + if (ret != 0) { + return ret; + } + } + + return 0; +} + +static int +scan_unformatted(flush_scan * scan, flush_scan * other) +{ + int ret = 0; + int try = 0; + + if (!coord_is_invalid(&scan->parent_coord)) + goto scan; + + /* set parent coord from */ + if (!jnode_is_unformatted(scan->node)) { + /* formatted position*/ + + lock_handle lock; + assert("edward-301", jnode_is_znode(scan->node)); + init_lh(&lock); + + /* + * when flush starts from unformatted node, first thing it + * does is tree traversal to find formatted parent of starting + * node. This parent is then kept lock across scans to the + * left and to the right. This means that during scan to the + * left we cannot take left-ward lock, because this is + * dead-lock prone. So, if we are scanning to the left and + * there is already lock held by this thread, + * jnode_lock_parent_coord() should use try-lock. + */ + try = scanning_left(scan) && !lock_stack_isclean(get_current_lock_stack()); + /* Need the node locked to get the parent lock, We have to + take write lock since there is at least one call path + where this znode is already write-locked by us. */ + ret = longterm_lock_znode(&lock, JZNODE(scan->node), ZNODE_WRITE_LOCK, + scanning_left(scan) ? ZNODE_LOCK_LOPRI : ZNODE_LOCK_HIPRI); + if (ret != 0) + /* EINVAL or E_DEADLOCK here mean... try again! At this point we've + scanned too far and can't back out, just start over. */ + return ret; + + ret = jnode_lock_parent_coord(scan->node, + &scan->parent_coord, + &scan->parent_lock, + &scan->parent_load, + ZNODE_WRITE_LOCK, try); + + /* FIXME(C): check EINVAL, E_DEADLOCK */ + done_lh(&lock); + if (ret == -E_REPEAT) { + scan->stop = 1; + return 0; + } + if (ret) + return ret; + + } else { + /* unformatted position */ + + ret = jnode_lock_parent_coord(scan->node, &scan->parent_coord, &scan->parent_lock, + &scan->parent_load, ZNODE_WRITE_LOCK, try); + + if (IS_CBKERR(ret)) + return ret; + + if (ret == CBK_COORD_NOTFOUND) + /* FIXME(C): check EINVAL, E_DEADLOCK */ + return ret; + + /* parent was found */ + assert("jmacd-8661", other != NULL); + /* Duplicate the reference into the other flush_scan. */ + coord_dup(&other->parent_coord, &scan->parent_coord); + copy_lh(&other->parent_lock, &scan->parent_lock); + copy_load_count(&other->parent_load, &scan->parent_load); + } + scan: + return scan_by_coord(scan); +} + +/* Performs left- or rightward scanning starting from a formatted node. Follow left + pointers under tree lock as long as: + + - node->left/right is non-NULL + - node->left/right is connected, dirty + - node->left/right belongs to the same atom + - scan has not reached maximum count +*/ +static int +scan_formatted(flush_scan * scan) +{ + int ret; + znode *neighbor = NULL; + + assert("jmacd-1401", !scan_finished(scan)); + + do { + znode *node = JZNODE(scan->node); + + /* Node should be connected, but if not stop the scan. */ + if (!znode_is_connected(node)) { + scan->stop = 1; + break; + } + + /* Lock the tree, check-for and reference the next sibling. */ + RLOCK_TREE(znode_get_tree(node)); + + /* It may be that a node is inserted or removed between a node and its + left sibling while the tree lock is released, but the flush-scan count + does not need to be precise. Thus, we release the tree lock as soon as + we get the neighboring node. */ + neighbor = scanning_left(scan) ? node->left : node->right; + if (neighbor != NULL) { + zref(neighbor); + } + + RUNLOCK_TREE(znode_get_tree(node)); + + /* If neighbor is NULL at the leaf level, need to check for an unformatted + sibling using the parent--break in any case. */ + if (neighbor == NULL) { + break; + } + + /* Check the condition for going left, break if it is not met. This also + releases (jputs) the neighbor if false. */ + if (!scan_goto(scan, ZJNODE(neighbor))) { + break; + } + + /* Advance the flush_scan state to the left, repeat. */ + ret = scan_set_current(scan, ZJNODE(neighbor), 1, NULL); + if (ret != 0) { + return ret; + } + + } while (!scan_finished(scan)); + + /* If neighbor is NULL then we reached the end of a formatted region, or else the + sibling is out of memory, now check for an extent to the left (as long as + LEAF_LEVEL). */ + if (neighbor != NULL || jnode_get_level(scan->node) != LEAF_LEVEL || scan_finished(scan)) { + scan->stop = 1; + return 0; + } + /* Otherwise, calls scan_by_coord for the right(left)most item of the + left(right) neighbor on the parent level, then possibly continue. */ + + coord_init_invalid(&scan->parent_coord, NULL); + return scan_unformatted(scan, NULL); +} + +/* NOTE-EDWARD: + This scans adjacent items of the same type and calls scan flush plugin for each one. + Performs left(right)ward scanning starting from a (possibly) unformatted node. If we start + from unformatted node, then we continue only if the next neighbor is also unformatted. + When called from scan_formatted, we skip first iteration (to make sure that + right(left)most item of the left(right) neighbor on the parent level is of the same + type and set appropriate coord). */ +static int +scan_by_coord(flush_scan * scan) +{ + int ret = 0; + int scan_this_coord; + lock_handle next_lock; + load_count next_load; + coord_t next_coord; + jnode *child; + item_plugin *iplug; + + init_lh(&next_lock); + init_load_count(&next_load); + scan_this_coord = (jnode_is_unformatted(scan->node) ? 1 : 0); + + /* set initial item id */ + iplug = item_plugin_by_coord(&scan->parent_coord); + + for (; !scan_finished(scan); scan_this_coord = 1) { + if (scan_this_coord) { + /* Here we expect that unit is scannable. it would not be so due + * to race with extent->tail conversion. */ + if (iplug->f.scan == NULL) { + scan->stop = 1; + ret = -E_REPEAT; + /* skip the check at the end. */ + goto race; + } + + ret = iplug->f.scan(scan); + if (ret != 0) + goto exit; + + if (scan_finished(scan)) { + checkchild(scan); + break; + } + } else { + /* the same race against truncate as above is possible + * here, it seems */ + + /* NOTE-JMACD: In this case, apply the same end-of-node logic but don't scan + the first coordinate. */ + assert("jmacd-1231", item_is_internal(&scan->parent_coord)); + } + + if(iplug->f.utmost_child == NULL || znode_get_level(scan->parent_coord.node) != TWIG_LEVEL) { + /* stop this coord and continue on parrent level */ + ret = scan_set_current(scan, ZJNODE(zref(scan->parent_coord.node)), 1, NULL); + if (ret != 0) + goto exit; + break; + } + + /* Either way, the invariant is that scan->parent_coord is set to the + parent of scan->node. Now get the next unit. */ + coord_dup(&next_coord, &scan->parent_coord); + coord_sideof_unit(&next_coord, scan->direction); + + /* If off-the-end of the twig, try the next twig. */ + if (coord_is_after_sideof_unit(&next_coord, scan->direction)) { + /* We take the write lock because we may start flushing from this + * coordinate. */ + ret = neighbor_in_slum(next_coord.node, &next_lock, scan->direction, ZNODE_WRITE_LOCK); + + if (ret == -E_NO_NEIGHBOR) { + scan->stop = 1; + ret = 0; + break; + } + + if (ret != 0) { + goto exit; + } + + ret = incr_load_count_znode(&next_load, next_lock.node); + if (ret != 0) { + goto exit; + } + + coord_init_sideof_unit(&next_coord, next_lock.node, sideof_reverse(scan->direction)); + } + + iplug = item_plugin_by_coord(&next_coord); + + /* Get the next child. */ + ret = iplug->f.utmost_child(&next_coord, sideof_reverse(scan->direction), &child); + if (ret != 0) + goto exit; + /* If the next child is not in memory, or, item_utmost_child + failed (due to race with unlink, most probably), stop + here. */ + if (child == NULL || IS_ERR(child)) { + scan->stop = 1; + checkchild(scan); + break; + } + + assert("nikita-2374", jnode_is_unformatted(child) || jnode_is_znode(child)); + + /* See if it is dirty, part of the same atom. */ + if (!scan_goto(scan, child)) { + checkchild(scan); + break; + } + + /* If so, make this child current. */ + ret = scan_set_current(scan, child, 1, &next_coord); + if (ret != 0) + goto exit; + + /* Now continue. If formatted we release the parent lock and return, then + proceed. */ + if (jnode_is_znode(child)) + break; + + /* Otherwise, repeat the above loop with next_coord. */ + if (next_load.node != NULL) { + done_lh(&scan->parent_lock); + move_lh(&scan->parent_lock, &next_lock); + move_load_count(&scan->parent_load, &next_load); + } + } + + assert("jmacd-6233", scan_finished(scan) || jnode_is_znode(scan->node)); + exit: + checkchild(scan); + race: /* skip the above check */ + if (jnode_is_znode(scan->node)) { + done_lh(&scan->parent_lock); + done_load_count(&scan->parent_load); + } + + done_load_count(&next_load); + done_lh(&next_lock); + return ret; +} + +/* FLUSH POS HELPERS */ + +/* Initialize the fields of a flush_position. */ +static void +pos_init(flush_pos_t * pos) +{ + memset(pos, 0, sizeof *pos); + + pos->state = POS_INVALID; + coord_init_invalid(&pos->coord, NULL); + init_lh(&pos->lock); + init_load_count(&pos->load); + + blocknr_hint_init(&pos->preceder); +} + +/* The flush loop inside squalloc periodically checks pos_valid to + determine when "enough flushing" has been performed. This will return true until one + of the following conditions is met: + + 1. the number of flush-queued nodes has reached the kernel-supplied "int *nr_to_flush" + parameter, meaning we have flushed as many blocks as the kernel requested. When + flushing to commit, this parameter is NULL. + + 2. pos_stop() is called because squalloc discovers that the "next" node in the + flush order is either non-existant, not dirty, or not in the same atom. +*/ + + +static int pos_valid (flush_pos_t * pos) +{ + return pos->state != POS_INVALID; +} + +/* Release any resources of a flush_position. Called when jnode_flush finishes. */ +static void +pos_done(flush_pos_t * pos) +{ + pos_stop(pos); + blocknr_hint_done(&pos->preceder); + if (convert_data(pos)) + free_convert_data(pos); +} + +/* Reset the point and parent. Called during flush subroutines to terminate the + squalloc loop. */ +static int +pos_stop(flush_pos_t * pos) +{ + pos->state = POS_INVALID; + done_lh(&pos->lock); + done_load_count(&pos->load); + coord_init_invalid(&pos->coord, NULL); + + if (pos->child) { + jput(pos->child); + pos->child = NULL; + } + + return 0; +} + +/* Return the flush_position's block allocator hint. */ +reiser4_internal reiser4_blocknr_hint * +pos_hint(flush_pos_t * pos) +{ + return &pos->preceder; +} + +reiser4_internal flush_queue_t * pos_fq(flush_pos_t * pos) +{ + return pos->fq; +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 90 + LocalWords: preceder + End: +*/ diff -puN /dev/null fs/reiser4/flush.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/flush.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,283 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* DECLARATIONS: */ + +#if !defined(__REISER4_FLUSH_H__) +#define __REISER4_FLUSH_H__ + +#include "plugin/cryptcompress.h" + +/* The flush_scan data structure maintains the state of an in-progress flush-scan on a + single level of the tree. A flush-scan is used for counting the number of adjacent + nodes to flush, which is used to determine whether we should relocate, and it is also + used to find a starting point for flush. A flush-scan object can scan in both right + and left directions via the scan_left() and scan_right() interfaces. The + right- and left-variations are similar but perform different functions. When scanning + left we (optionally perform rapid scanning and then) longterm-lock the endpoint node. + When scanning right we are simply counting the number of adjacent, dirty nodes. */ +struct flush_scan { + + /* The current number of nodes scanned on this level. */ + unsigned count; + + /* There may be a maximum number of nodes for a scan on any single level. When + going leftward, max_count is determined by FLUSH_SCAN_MAXNODES (see reiser4.h) */ + unsigned max_count; + + /* Direction: Set to one of the sideof enumeration: { LEFT_SIDE, RIGHT_SIDE }. */ + sideof direction; + + /* Initially @stop is set to false then set true once some condition stops the + search (e.g., we found a clean node before reaching max_count or we found a + node belonging to another atom). */ + int stop; + + /* The current scan position. If @node is non-NULL then its reference count has + been incremented to reflect this reference. */ + jnode *node; + + /* A handle for zload/zrelse of current scan position node. */ + load_count node_load; + + /* During left-scan, if the final position (a.k.a. endpoint node) is formatted the + node is locked using this lock handle. The endpoint needs to be locked for + transfer to the flush_position object after scanning finishes. */ + lock_handle node_lock; + + /* When the position is unformatted, its parent, coordinate, and parent + zload/zrelse handle. */ + lock_handle parent_lock; + coord_t parent_coord; + load_count parent_load; + + /* The block allocator preceder hint. Sometimes flush_scan determines what the + preceder is and if so it sets it here, after which it is copied into the + flush_position. Otherwise, the preceder is computed later. */ + reiser4_block_nr preceder_blk; +}; + +typedef struct convert_item_info { + dc_item_stat d_cur; /* disk cluster state of the current item */ + dc_item_stat d_next; /* disk cluster state of the next slum item */ + struct inode * inode; + flow_t flow; +} convert_item_info_t; + +typedef struct convert_info { + int count; /* for squalloc terminating */ + reiser4_cluster_t clust; /* transform cluster */ + item_plugin * iplug; /* current item plugin */ + convert_item_info_t * itm; /* current item info */ +} convert_info_t; + +typedef enum flush_position_state { + POS_INVALID, /* Invalid or stopped pos, do not continue slum + * processing */ + POS_ON_LEAF, /* pos points to already prepped, locked formatted node at + * leaf level */ + POS_ON_EPOINT, /* pos keeps a lock on twig level, "coord" field is used + * to traverse unformatted nodes */ + POS_TO_LEAF, /* pos is being moved to leaf level */ + POS_TO_TWIG, /* pos is being moved to twig level */ + POS_END_OF_TWIG, /* special case of POS_ON_TWIG, when coord is after + * rightmost unit of the current twig */ + POS_ON_INTERNAL /* same as POS_ON_LEAF, but points to internal node */ + +} flushpos_state_t; + + + +/* An encapsulation of the current flush point and all the parameters that are passed + through the entire squeeze-and-allocate stage of the flush routine. A single + flush_position object is constructed after left- and right-scanning finishes. */ +struct flush_position { + flushpos_state_t state; + + coord_t coord; /* coord to traverse unformatted nodes */ + lock_handle lock; /* current lock we hold */ + load_count load; /* load status for current locked formatted node */ + + jnode * child; /* for passing a reference to unformatted child + * across pos state changes */ + + reiser4_blocknr_hint preceder; /* The flush 'hint' state. */ + int leaf_relocate; /* True if enough leaf-level nodes were + * found to suggest a relocate policy. */ + long *nr_to_flush; /* If called under memory pressure, + * indicates how many nodes the VM asked to flush. */ + int alloc_cnt; /* The number of nodes allocated during squeeze and allococate. */ + int prep_or_free_cnt; /* The number of nodes prepared for write (allocate) or squeezed and freed. */ + flush_queue_t *fq; + long *nr_written; /* number of nodes submitted to disk */ + int flags; /* a copy of jnode_flush flags argument */ + + znode * prev_twig; /* previous parent pointer value, used to catch + * processing of new twig node */ + convert_info_t * sq; /* convert info */ + + unsigned long pos_in_unit; /* for extents only. Position + within an extent unit of first + jnode of slum */ +}; + +static inline int +item_convert_count (flush_pos_t * pos) +{ + return pos->sq->count; +} +static inline void +inc_item_convert_count (flush_pos_t * pos) +{ + pos->sq->count++; +} +static inline void +set_item_convert_count (flush_pos_t * pos, int count) +{ + pos->sq->count = count; +} +static inline item_plugin * +item_convert_plug (flush_pos_t * pos) +{ + return pos->sq->iplug; +} + +static inline convert_info_t * +convert_data (flush_pos_t * pos) +{ + return pos->sq; +} + +static inline convert_item_info_t * +item_convert_data (flush_pos_t * pos) +{ + assert("edward-955", convert_data(pos)); + return pos->sq->itm; +} + +static inline tfm_cluster_t * +tfm_cluster_sq (flush_pos_t * pos) +{ + return &pos->sq->clust.tc; +} + +static inline tfm_stream_t * +tfm_stream_sq (flush_pos_t * pos, tfm_stream_id id) +{ + assert("edward-854", pos->sq != NULL); + return tfm_stream(tfm_cluster_sq(pos), id); +} + +static inline int +chaining_data_present (flush_pos_t * pos) +{ + return convert_data(pos) && item_convert_data(pos); +} + +/* Returns true if next node contains next item of the disk cluster + so item convert data should be moved to the right slum neighbor. +*/ +static inline int +should_chain_next_node(flush_pos_t * pos) { + int result = 0; + + assert("edward-1007", chaining_data_present(pos)); + + switch (item_convert_data(pos)->d_next) { + case DC_CHAINED_ITEM: + result = 1; + break; + case DC_AFTER_CLUSTER: + break; + default: + impossible("edward-1009", "bad state of next slum item"); + } + return result; +} + +/* update item state in a disk cluster to assign conversion mode */ +static inline void +move_chaining_data(flush_pos_t * pos, + int this_node /* where is next item */) { + + assert("edward-1010", chaining_data_present(pos)); + + if (this_node == 0) { + /* next item is on the right neighbor */ + assert("edward-1011", + item_convert_data(pos)->d_cur == DC_FIRST_ITEM || + item_convert_data(pos)->d_cur == DC_CHAINED_ITEM); + assert("edward-1012", + item_convert_data(pos)->d_next == DC_CHAINED_ITEM); + + item_convert_data(pos)->d_cur = DC_CHAINED_ITEM; + item_convert_data(pos)->d_next = DC_INVALID_STATE; + } else { + /* next item is on the same node */ + assert("edward-1013", + item_convert_data(pos)->d_cur == DC_FIRST_ITEM || + item_convert_data(pos)->d_cur == DC_CHAINED_ITEM); + assert("edward-1227", + item_convert_data(pos)->d_next == DC_AFTER_CLUSTER || + item_convert_data(pos)->d_next == DC_INVALID_STATE); + + item_convert_data(pos)->d_cur = DC_AFTER_CLUSTER; + item_convert_data(pos)->d_next = DC_INVALID_STATE; + } +} + +static inline int +should_convert_node(flush_pos_t * pos, znode * node) +{ + return znode_convertible(node); +} + +/* true if there is attached convert item info */ +static inline int +should_convert_next_node(flush_pos_t * pos, znode * node) +{ + return convert_data(pos) && item_convert_data(pos); +} + +#define SQUALLOC_THRESHOLD 256 + +static inline int +should_terminate_squalloc(flush_pos_t * pos) +{ + return convert_data(pos) && + !item_convert_data(pos) && + item_convert_count(pos) >= SQUALLOC_THRESHOLD; +} + +void free_convert_data(flush_pos_t * pos); +/* used in extent.c */ +int scan_set_current(flush_scan * scan, jnode * node, unsigned add_size, const coord_t * parent); +int scan_finished(flush_scan * scan); +int scanning_left(flush_scan * scan); +int scan_goto(flush_scan * scan, jnode * tonode); +txn_atom *atom_locked_by_fq(flush_queue_t * fq); + +int init_fqs(void); +void done_fqs(void); + +#if REISER4_DEBUG +#define check_preceder(blk) \ +assert("nikita-2588", blk < reiser4_block_count(reiser4_get_current_sb())); +extern void check_pos(flush_pos_t *pos); +#else +#define check_preceder(b) noop +#define check_pos(pos) noop +#endif + +/* __REISER4_FLUSH_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 90 + LocalWords: preceder + End: +*/ diff -puN /dev/null fs/reiser4/flush_queue.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/flush_queue.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,753 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "debug.h" +#include "type_safe_list.h" +#include "super.h" +#include "txnmgr.h" +#include "jnode.h" +#include "znode.h" +#include "page_cache.h" +#include "wander.h" +#include "vfs_ops.h" +#include "writeout.h" + +#include +#include +#include +#include +#include + +/* A flush queue object is an accumulator for keeping jnodes prepared + by the jnode_flush() function for writing to disk. Those "queued" jnodes are + kept on the flush queue until memory pressure or atom commit asks + flush queues to write some or all from their jnodes. */ + +TYPE_SAFE_LIST_DEFINE(fq, flush_queue_t, alink); + +#if REISER4_DEBUG +# define spin_ordering_pred_fq(fq) (1) +#endif + +SPIN_LOCK_FUNCTIONS(fq, flush_queue_t, guard); + +/* + LOCKING: + + fq->guard spin lock protects fq->atom pointer and nothing else. fq->prepped + list protected by atom spin lock. fq->prepped list uses the following + locking: + + two ways to protect fq->prepped list for read-only list traversal: + + 1. atom spin-lock atom. + 2. fq is IN_USE, atom->nr_running_queues increased. + + and one for list modification: + + 1. atom is spin-locked and one condition is true: fq is IN_USE or + atom->nr_running_queues == 0. + + The deadlock-safe order for flush queues and atoms is: first lock atom, then + lock flush queue, then lock jnode. +*/ + +#define fq_in_use(fq) ((fq)->state & FQ_IN_USE) +#define fq_ready(fq) (!fq_in_use(fq)) + +#define mark_fq_in_use(fq) do { (fq)->state |= FQ_IN_USE; } while (0) +#define mark_fq_ready(fq) do { (fq)->state &= ~FQ_IN_USE; } while (0) + +/* get lock on atom from locked flush queue object */ +static txn_atom * +atom_get_locked_by_fq(flush_queue_t * fq) +{ + /* This code is similar to jnode_get_atom(), look at it for the + * explanation. */ + txn_atom *atom; + + assert("zam-729", spin_fq_is_locked(fq)); + + while(1) { + atom = fq->atom; + if (atom == NULL) + break; + + if (spin_trylock_atom(atom)) + break; + + atomic_inc(&atom->refcount); + spin_unlock_fq(fq); + LOCK_ATOM(atom); + spin_lock_fq(fq); + + if (fq->atom == atom) { + atomic_dec(&atom->refcount); + break; + } + + spin_unlock_fq(fq); + atom_dec_and_unlock(atom); + spin_lock_fq(fq); + } + + return atom; +} + +reiser4_internal txn_atom * +atom_locked_by_fq(flush_queue_t * fq) +{ + return UNDER_SPIN(fq, fq, atom_get_locked_by_fq(fq)); +} + +static void +init_fq(flush_queue_t * fq) +{ + memset(fq, 0, sizeof *fq); + + atomic_set(&fq->nr_submitted, 0); + + capture_list_init(ATOM_FQ_LIST(fq)); + + sema_init(&fq->io_sem, 0); + spin_fq_init(fq); +} + +/* slab for flush queues */ +static kmem_cache_t *fq_slab; + +reiser4_internal int init_fqs(void) +{ + fq_slab = kmem_cache_create("fq", + sizeof (flush_queue_t), + 0, + SLAB_HWCACHE_ALIGN, + NULL, + NULL); + return (fq_slab == NULL) ? RETERR(-ENOMEM) : 0; +} + +reiser4_internal void done_fqs(void) +{ + kmem_cache_destroy(fq_slab); +} + +/* create new flush queue object */ +static flush_queue_t * +create_fq(int gfp) +{ + flush_queue_t *fq; + + fq = kmem_cache_alloc(fq_slab, gfp); + if (fq) + init_fq(fq); + + return fq; +} + +/* adjust atom's and flush queue's counters of queued nodes */ +static void +count_enqueued_node(flush_queue_t * fq) +{ + ON_DEBUG(fq->atom->num_queued++); +} + +static void +count_dequeued_node(flush_queue_t * fq) +{ + assert("zam-993", fq->atom->num_queued > 0); + ON_DEBUG(fq->atom->num_queued--); +} + +/* attach flush queue object to the atom */ +static void +attach_fq(txn_atom * atom, flush_queue_t * fq) +{ + assert("zam-718", spin_atom_is_locked(atom)); + fq_list_push_front(&atom->flush_queues, fq); + fq->atom = atom; + ON_DEBUG(atom->nr_flush_queues++); +} + +static void +detach_fq(flush_queue_t * fq) +{ + assert("zam-731", spin_atom_is_locked(fq->atom)); + + spin_lock_fq(fq); + fq_list_remove_clean(fq); + assert("vs-1456", fq->atom->nr_flush_queues > 0); + ON_DEBUG(fq->atom->nr_flush_queues--); + fq->atom = NULL; + spin_unlock_fq(fq); +} + +/* destroy flush queue object */ +static void +done_fq(flush_queue_t * fq) +{ + assert("zam-763", capture_list_empty(ATOM_FQ_LIST(fq))); + assert("zam-766", atomic_read(&fq->nr_submitted) == 0); + + kmem_cache_free(fq_slab, fq); +} + +/* */ +reiser4_internal void +mark_jnode_queued(flush_queue_t *fq, jnode *node) +{ + JF_SET(node, JNODE_FLUSH_QUEUED); + count_enqueued_node(fq); +} + +/* Putting jnode into the flush queue. Both atom and jnode should be + spin-locked. */ +reiser4_internal void +queue_jnode(flush_queue_t * fq, jnode * node) +{ + assert("zam-711", spin_jnode_is_locked(node)); + assert("zam-713", node->atom != NULL); + assert("zam-712", spin_atom_is_locked(node->atom)); + assert("zam-714", jnode_is_dirty(node)); + assert("zam-716", fq->atom != NULL); + assert("zam-717", fq->atom == node->atom); + assert("zam-907", fq_in_use(fq)); + + assert("zam-826", JF_ISSET(node, JNODE_RELOC)); + assert("vs-1481", !JF_ISSET(node, JNODE_FLUSH_QUEUED)); + assert("vs-1481", NODE_LIST(node) != FQ_LIST); + + mark_jnode_queued(fq, node); + capture_list_remove_clean(node); + capture_list_push_back(ATOM_FQ_LIST(fq), node); + /*XXXX*/ON_DEBUG(count_jnode(node->atom, node, NODE_LIST(node), FQ_LIST, 1)); +} + +/* repeatable process for waiting io completion on a flush queue object */ +static int +wait_io(flush_queue_t * fq, int *nr_io_errors) +{ + assert("zam-738", fq->atom != NULL); + assert("zam-739", spin_atom_is_locked(fq->atom)); + assert("zam-736", fq_in_use(fq)); + assert("zam-911", capture_list_empty(ATOM_FQ_LIST(fq))); + + if (atomic_read(&fq->nr_submitted) != 0) { + struct super_block *super; + + UNLOCK_ATOM(fq->atom); + + assert("nikita-3013", schedulable()); + + super = reiser4_get_current_sb(); + + /* FIXME: this is instead of blk_run_queues() */ + blk_run_address_space(get_super_fake(super)->i_mapping); + + if ( !(super->s_flags & MS_RDONLY) ) + down(&fq->io_sem); + + /* Ask the caller to re-acquire the locks and call this + function again. Note: this technique is commonly used in + the txnmgr code. */ + return -E_REPEAT; + } + + *nr_io_errors += atomic_read(&fq->nr_errors); + return 0; +} + +/* wait on I/O completion, re-submit dirty nodes to write */ +static int +finish_fq(flush_queue_t * fq, int *nr_io_errors) +{ + int ret; + txn_atom * atom = fq->atom; + + assert("zam-801", atom != NULL); + assert("zam-744", spin_atom_is_locked(atom)); + assert("zam-762", fq_in_use(fq)); + + ret = wait_io(fq, nr_io_errors); + if (ret) + return ret; + + detach_fq(fq); + done_fq(fq); + + atom_send_event(atom); + + return 0; +} + +/* wait for all i/o for given atom to be completed, actually do one iteration + on that and return -E_REPEAT if there more iterations needed */ +static int +finish_all_fq(txn_atom * atom, int *nr_io_errors) +{ + flush_queue_t *fq; + + assert("zam-730", spin_atom_is_locked(atom)); + + if (fq_list_empty(&atom->flush_queues)) + return 0; + + for_all_type_safe_list(fq, &atom->flush_queues, fq) { + if (fq_ready(fq)) { + int ret; + + mark_fq_in_use(fq); + assert("vs-1247", fq->owner == NULL); + ON_DEBUG(fq->owner = current); + ret = finish_fq(fq, nr_io_errors); + + if ( *nr_io_errors ) + reiser4_handle_error(); + + if (ret) { + fq_put(fq); + return ret; + } + + UNLOCK_ATOM(atom); + + return -E_REPEAT; + } + } + + /* All flush queues are in use; atom remains locked */ + return -EBUSY; +} + +/* wait all i/o for current atom */ +reiser4_internal int +current_atom_finish_all_fq(void) +{ + txn_atom *atom; + int nr_io_errors = 0; + int ret = 0; + + do { + while (1) { + atom = get_current_atom_locked(); + ret = finish_all_fq(atom, &nr_io_errors); + if (ret != -EBUSY) + break; + atom_wait_event(atom); + } + } while (ret == -E_REPEAT); + + /* we do not need locked atom after this function finishes, SUCCESS or + -EBUSY are two return codes when atom remains locked after + finish_all_fq */ + if (!ret) + UNLOCK_ATOM(atom); + + assert("nikita-2696", spin_atom_is_not_locked(atom)); + + if (ret) + return ret; + + if (nr_io_errors) + return RETERR(-EIO); + + return 0; +} + +/* change node->atom field for all jnode from given list */ +static void +scan_fq_and_update_atom_ref(capture_list_head * list, txn_atom * atom) +{ + jnode *cur; + + for_all_type_safe_list(capture, list, cur) { + LOCK_JNODE(cur); + cur->atom = atom; + UNLOCK_JNODE(cur); + } +} + +/* support for atom fusion operation */ +reiser4_internal void +fuse_fq(txn_atom * to, txn_atom * from) +{ + flush_queue_t *fq; + + assert("zam-720", spin_atom_is_locked(to)); + assert("zam-721", spin_atom_is_locked(from)); + + + for_all_type_safe_list(fq, &from->flush_queues, fq) { + scan_fq_and_update_atom_ref(ATOM_FQ_LIST(fq), to); + spin_lock_fq(fq); + fq->atom = to; + spin_unlock_fq(fq); + } + + fq_list_splice(&to->flush_queues, &from->flush_queues); + +#if REISER4_DEBUG + to->num_queued += from->num_queued; + to->nr_flush_queues += from->nr_flush_queues; + from->nr_flush_queues = 0; +#endif +} + +#if REISER4_DEBUG +int atom_fq_parts_are_clean (txn_atom * atom) +{ + assert("zam-915", atom != NULL); + return fq_list_empty(&atom->flush_queues); +} +#endif +/* Bio i/o completion routine for reiser4 write operations. */ +static int +end_io_handler(struct bio *bio, unsigned int bytes_done UNUSED_ARG, int err UNUSED_ARG) +{ + int i; + int nr_errors = 0; + flush_queue_t *fq; + + assert ("zam-958", bio->bi_rw & WRITE); + + /* i/o op. is not fully completed */ + if (bio->bi_size != 0) + return 1; + + /* we expect that bio->private is set to NULL or fq object which is used + * for synchronization and error counting. */ + fq = bio->bi_private; + /* Check all elements of io_vec for correct write completion. */ + for (i = 0; i < bio->bi_vcnt; i += 1) { + struct page *pg = bio->bi_io_vec[i].bv_page; + + if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) { + SetPageError(pg); + nr_errors++; + } + + { + /* jnode WRITEBACK ("write is in progress bit") is + * atomically cleared here. */ + jnode *node; + + assert("zam-736", pg != NULL); + assert("zam-736", PagePrivate(pg)); + node = (jnode *) (pg->private); + + JF_CLR(node, JNODE_WRITEBACK); + } + + end_page_writeback(pg); + page_cache_release(pg); + } + + if (fq) { + /* count i/o error in fq object */ + atomic_add(nr_errors, &fq->nr_errors); + + /* If all write requests registered in this "fq" are done we up + * the semaphore. */ + if (atomic_sub_and_test(bio->bi_vcnt, &fq->nr_submitted)) + up(&fq->io_sem); + } + + bio_put(bio); + return 0; +} + +/* Count I/O requests which will be submitted by @bio in given flush queues + @fq */ +reiser4_internal void +add_fq_to_bio(flush_queue_t * fq, struct bio *bio) +{ + bio->bi_private = fq; + bio->bi_end_io = end_io_handler; + + if (fq) + atomic_add(bio->bi_vcnt, &fq->nr_submitted); +} + +/* Move all queued nodes out from @fq->prepped list. */ +static void release_prepped_list(flush_queue_t * fq) +{ + txn_atom * atom; + + assert ("zam-904", fq_in_use(fq)); + atom = UNDER_SPIN(fq, fq, atom_get_locked_by_fq(fq)); + + while(!capture_list_empty(ATOM_FQ_LIST(fq))) { + jnode * cur; + + cur = capture_list_front(ATOM_FQ_LIST(fq)); + capture_list_remove_clean(cur); + + count_dequeued_node(fq); + LOCK_JNODE(cur); + assert("nikita-3154", !JF_ISSET(cur, JNODE_OVRWR)); + assert("nikita-3154", JF_ISSET(cur, JNODE_RELOC)); + assert("nikita-3154", JF_ISSET(cur, JNODE_FLUSH_QUEUED)); + JF_CLR(cur, JNODE_FLUSH_QUEUED); + + if (JF_ISSET(cur, JNODE_DIRTY)) { + capture_list_push_back(ATOM_DIRTY_LIST(atom, jnode_get_level(cur)), cur); + ON_DEBUG(count_jnode(atom, cur, FQ_LIST, DIRTY_LIST, 1)); + } else { + capture_list_push_back(ATOM_CLEAN_LIST(atom), cur); + ON_DEBUG(count_jnode(atom, cur, FQ_LIST, CLEAN_LIST, 1)); + } + + UNLOCK_JNODE(cur); + } + + if (-- atom->nr_running_queues == 0) + atom_send_event(atom); + + UNLOCK_ATOM(atom); +} + +/* Submit write requests for nodes on the already filled flush queue @fq. + + @fq: flush queue object which contains jnodes we can (and will) write. + @return: number of submitted blocks (>=0) if success, otherwise -- an error + code (<0). */ +reiser4_internal int +write_fq(flush_queue_t * fq, long * nr_submitted, int flags) +{ + int ret; + txn_atom * atom; + + while (1) { + atom = UNDER_SPIN(fq, fq, atom_get_locked_by_fq(fq)); + assert ("zam-924", atom); + /* do not write fq in parallel. */ + if (atom->nr_running_queues == 0 || !(flags & WRITEOUT_SINGLE_STREAM)) + break; + atom_wait_event(atom); + } + + atom->nr_running_queues ++; + UNLOCK_ATOM(atom); + + ret = write_jnode_list(ATOM_FQ_LIST(fq), fq, nr_submitted, flags); + release_prepped_list(fq); + + return ret; +} + +/* Getting flush queue object for exclusive use by one thread. May require + several iterations which is indicated by -E_REPEAT return code. + + This function does not contain code for obtaining an atom lock because an + atom lock is obtained by different ways in different parts of reiser4, + usually it is current atom, but we need a possibility for getting fq for the + atom of given jnode. */ +static int +fq_by_atom_gfp(txn_atom * atom, flush_queue_t ** new_fq, int gfp) +{ + flush_queue_t *fq; + + assert("zam-745", spin_atom_is_locked(atom)); + + fq = fq_list_front(&atom->flush_queues); + while (!fq_list_end(&atom->flush_queues, fq)) { + spin_lock_fq(fq); + + if (fq_ready(fq)) { + mark_fq_in_use(fq); + assert("vs-1246", fq->owner == NULL); + ON_DEBUG(fq->owner = current); + spin_unlock_fq(fq); + + if (*new_fq) + done_fq(*new_fq); + + *new_fq = fq; + + return 0; + } + + spin_unlock_fq(fq); + + fq = fq_list_next(fq); + } + + /* Use previously allocated fq object */ + if (*new_fq) { + mark_fq_in_use(*new_fq); + assert("vs-1248", (*new_fq)->owner == 0); + ON_DEBUG((*new_fq)->owner = current); + attach_fq(atom, *new_fq); + + return 0; + } + + UNLOCK_ATOM(atom); + + *new_fq = create_fq(gfp); + + if (*new_fq == NULL) + return RETERR(-ENOMEM); + + return RETERR(-E_REPEAT); +} + +reiser4_internal int +fq_by_atom(txn_atom * atom, flush_queue_t ** new_fq) +{ + return fq_by_atom_gfp(atom, new_fq, GFP_KERNEL); +} + +/* A wrapper around fq_by_atom for getting a flush queue object for current + * atom, if success fq->atom remains locked. */ +reiser4_internal flush_queue_t * +get_fq_for_current_atom(void) +{ + flush_queue_t *fq = NULL; + txn_atom *atom; + int ret; + + do { + atom = get_current_atom_locked(); + ret = fq_by_atom(atom, &fq); + } while (ret == -E_REPEAT); + + if (ret) + return ERR_PTR(ret); + return fq; +} + +/* Releasing flush queue object after exclusive use */ +reiser4_internal void +fq_put_nolock(flush_queue_t * fq) +{ + assert("zam-747", fq->atom != NULL); + assert("zam-902", capture_list_empty(ATOM_FQ_LIST(fq))); + mark_fq_ready(fq); + assert("vs-1245", fq->owner == current); + ON_DEBUG(fq->owner = NULL); +} + +reiser4_internal void +fq_put(flush_queue_t * fq) +{ + txn_atom *atom; + + spin_lock_fq(fq); + atom = atom_get_locked_by_fq(fq); + + assert("zam-746", atom != NULL); + + fq_put_nolock(fq); + atom_send_event(atom); + + spin_unlock_fq(fq); + UNLOCK_ATOM(atom); +} + +/* A part of atom object initialization related to the embedded flush queue + list head */ + +reiser4_internal void +init_atom_fq_parts(txn_atom * atom) +{ + fq_list_init(&atom->flush_queues); +} + +/* get a flush queue for an atom pointed by given jnode (spin-locked) ; returns + * both atom and jnode locked and found and took exclusive access for flush + * queue object. */ +reiser4_internal int fq_by_jnode_gfp(jnode * node, flush_queue_t ** fq, int gfp) +{ + txn_atom * atom; + int ret; + + assert("zam-835", spin_jnode_is_locked(node)); + + *fq = NULL; + + while (1) { + /* begin with taking lock on atom */ + atom = jnode_get_atom(node); + UNLOCK_JNODE(node); + + if (atom == NULL) { + /* jnode does not point to the atom anymore, it is + * possible because jnode lock could be removed for a + * time in atom_get_locked_by_jnode() */ + if (*fq) { + done_fq(*fq); + *fq = NULL; + } + return 0; + } + + /* atom lock is required for taking flush queue */ + ret = fq_by_atom_gfp(atom, fq, gfp); + + if (ret) { + if (ret == -E_REPEAT) + /* atom lock was released for doing memory + * allocation, start with locked jnode one more + * time */ + goto lock_again; + return ret; + } + + /* It is correct to lock atom first, then lock a jnode */ + LOCK_JNODE(node); + + if (node->atom == atom) + break; /* Yes! it is our jnode. We got all of them: + * flush queue, and both locked atom and + * jnode */ + + /* release all locks and allocated objects and restart from + * locked jnode. */ + UNLOCK_JNODE(node); + + fq_put(*fq); + fq = NULL; + + UNLOCK_ATOM(atom); + + lock_again: + LOCK_JNODE(node); + } + + return 0; +} + + +#if REISER4_DEBUG + +void check_fq(const txn_atom *atom) +{ + /* check number of nodes on all atom's flush queues */ + flush_queue_t *fq; + int count; + jnode *node; + + count = 0; + for_all_type_safe_list(fq, &atom->flush_queues, fq) { + spin_lock_fq(fq); + for_all_type_safe_list(capture, ATOM_FQ_LIST(fq), node) + count ++; + spin_unlock_fq(fq); + } + if (count != atom->fq) + warning("", "fq counter %d, real %d\n", atom->fq, count); + +} + +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 80 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/forward.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/forward.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,258 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Forward declarations. Thank you Kernighan. */ + +#if !defined( __REISER4_FORWARD_H__ ) +#define __REISER4_FORWARD_H__ + +#include + +typedef struct zlock zlock; +typedef struct lock_stack lock_stack; +typedef struct lock_handle lock_handle; +typedef struct znode znode; +typedef struct flow flow_t; +typedef struct coord coord_t; +typedef struct tree_access_pointer tap_t; +typedef struct item_coord item_coord; +typedef struct shift_params shift_params; +typedef struct reiser4_object_create_data reiser4_object_create_data; +typedef union reiser4_plugin reiser4_plugin; +typedef int reiser4_plugin_id; +typedef struct item_plugin item_plugin; +typedef struct jnode_plugin jnode_plugin; +typedef struct reiser4_item_data reiser4_item_data; +typedef union reiser4_key reiser4_key; +typedef union reiser4_dblock_nr reiser4_dblock_nr; +typedef struct reiser4_tree reiser4_tree; +typedef struct carry_cut_data carry_cut_data; +typedef struct carry_kill_data carry_kill_data; +typedef struct carry_tree_op carry_tree_op; +typedef struct carry_tree_node carry_tree_node; +typedef struct carry_plugin_info carry_plugin_info; +typedef struct reiser4_journal reiser4_journal; +typedef struct txn_atom txn_atom; +typedef struct txn_handle txn_handle; +typedef struct txn_mgr txn_mgr; +typedef struct reiser4_dir_entry_desc reiser4_dir_entry_desc; +typedef struct reiser4_context reiser4_context; +typedef struct carry_level carry_level; +typedef struct blocknr_set blocknr_set; +typedef struct blocknr_set_entry blocknr_set_entry; +/* super_block->s_fs_info points to this */ +typedef struct reiser4_super_info_data reiser4_super_info_data; +/* next two objects are fields of reiser4_super_info_data */ +typedef struct reiser4_oid_allocator reiser4_oid_allocator; +typedef struct reiser4_space_allocator reiser4_space_allocator; +typedef struct reiser4_file_fsdata reiser4_file_fsdata; + +typedef struct flush_scan flush_scan; +typedef struct flush_position flush_pos_t; + +typedef unsigned short pos_in_node_t; +#define MAX_POS_IN_NODE 65535 + +typedef struct jnode jnode; +typedef struct reiser4_blocknr_hint reiser4_blocknr_hint; + +typedef struct uf_coord uf_coord_t; +typedef struct hint hint_t; + +typedef struct ktxnmgrd_context ktxnmgrd_context; + +typedef struct reiser4_xattr_plugin reiser4_xattr_plugin; + +struct inode; +struct page; +struct file; +struct dentry; +struct super_block; + +/* return values of coord_by_key(). cbk == coord_by_key */ +typedef enum { + CBK_COORD_FOUND = 0, + CBK_COORD_NOTFOUND = -ENOENT, +} lookup_result; + +/* results of lookup with directory file */ +typedef enum { + FILE_NAME_FOUND = 0, + FILE_NAME_NOTFOUND = -ENOENT, + FILE_IO_ERROR = -EIO, /* FIXME: it seems silly to have special OOM, IO_ERROR return codes for each search. */ + FILE_OOM = -ENOMEM /* FIXME: it seems silly to have special OOM, IO_ERROR return codes for each search. */ +} file_lookup_result; + +/* behaviors of lookup. If coord we are looking for is actually in a tree, + both coincide. */ +typedef enum { + /* search exactly for the coord with key given */ + FIND_EXACT, + /* search for coord with the maximal key not greater than one + given */ + FIND_MAX_NOT_MORE_THAN /*LEFT_SLANT_BIAS */ +} lookup_bias; + +typedef enum { + /* number of leaf level of the tree + The fake root has (tree_level=0). */ + LEAF_LEVEL = 1, + + /* number of level one above leaf level of the tree. + + It is supposed that internal tree used by reiser4 to store file + system data and meta data will have height 2 initially (when + created by mkfs). + */ + TWIG_LEVEL = 2, +} tree_level; + +/* The "real" maximum ztree height is the 0-origin size of any per-level + array, since the zero'th level is not used. */ +#define REAL_MAX_ZTREE_HEIGHT (REISER4_MAX_ZTREE_HEIGHT-LEAF_LEVEL) + +/* enumeration of possible mutual position of item and coord. This enum is + return type of ->is_in_item() item plugin method which see. */ +typedef enum { + /* coord is on the left of an item*/ + IP_ON_THE_LEFT, + /* coord is inside item */ + IP_INSIDE, + /* coord is inside item, but to the right of the rightmost unit of + this item */ + IP_RIGHT_EDGE, + /* coord is on the right of an item */ + IP_ON_THE_RIGHT +} interposition; + +/* type of lock to acquire on znode before returning it to caller */ +typedef enum { + ZNODE_NO_LOCK = 0, + ZNODE_READ_LOCK = 1, + ZNODE_WRITE_LOCK = 2, +} znode_lock_mode; + +/* type of lock request */ +typedef enum { + ZNODE_LOCK_LOPRI = 0, + ZNODE_LOCK_HIPRI = (1 << 0), + + /* By setting the ZNODE_LOCK_NONBLOCK flag in a lock request the call to longterm_lock_znode will not sleep + waiting for the lock to become available. If the lock is unavailable, reiser4_znode_lock will immediately + return the value -E_REPEAT. */ + ZNODE_LOCK_NONBLOCK = (1 << 1), + /* An option for longterm_lock_znode which prevents atom fusion */ + ZNODE_LOCK_DONT_FUSE = (1 << 2) +} znode_lock_request; + +typedef enum { READ_OP = 0, WRITE_OP = 1 } rw_op; + +/* used to specify direction of shift. These must be -1 and 1 */ +typedef enum { + SHIFT_LEFT = 1, + SHIFT_RIGHT = -1 +} shift_direction; + +typedef enum { + LEFT_SIDE, + RIGHT_SIDE +} sideof; + +#define round_up( value, order ) \ + ( ( typeof( value ) )( ( ( long ) ( value ) + ( order ) - 1U ) & \ + ~( ( order ) - 1 ) ) ) + +/* values returned by squalloc_right_neighbor and its auxiliary functions */ +typedef enum { + /* unit of internal item is moved */ + SUBTREE_MOVED = 0, + /* nothing else can be squeezed into left neighbor */ + SQUEEZE_TARGET_FULL = 1, + /* all content of node is squeezed into its left neighbor */ + SQUEEZE_SOURCE_EMPTY = 2, + /* one more item is copied (this is only returned by + allocate_and_copy_extent to squalloc_twig)) */ + SQUEEZE_CONTINUE = 3 +} squeeze_result; + +/* Do not change items ids. If you do - there will be format change */ +typedef enum { + STATIC_STAT_DATA_ID = 0x0, + SIMPLE_DIR_ENTRY_ID = 0x1, + COMPOUND_DIR_ID = 0x2, + NODE_POINTER_ID = 0x3, + EXTENT_POINTER_ID = 0x5, + FORMATTING_ID = 0x6, + CTAIL_ID = 0x7, + BLACK_BOX_ID = 0x8, + LAST_ITEM_ID = 0x9 +} item_id; + +/* Flags passed to jnode_flush() to allow it to distinguish default settings based on + whether commit() was called or VM memory pressure was applied. */ +typedef enum { + /* submit flush queue to disk at jnode_flush completion */ + JNODE_FLUSH_WRITE_BLOCKS = 1, + + /* flush is called for commit */ + JNODE_FLUSH_COMMIT = 2, + /* not implemented */ + JNODE_FLUSH_MEMORY_FORMATTED = 4, + + /* not implemented */ + JNODE_FLUSH_MEMORY_UNFORMATTED = 8, +} jnode_flush_flags; + +/* Flags to insert/paste carry operations. Currently they only used in + flushing code, but in future, they can be used to optimize for repetitive + accesses. */ +typedef enum { + /* carry is not allowed to shift data to the left when trying to find + free space */ + COPI_DONT_SHIFT_LEFT = (1 << 0), + /* carry is not allowed to shift data to the right when trying to find + free space */ + COPI_DONT_SHIFT_RIGHT = (1 << 1), + /* carry is not allowed to allocate new node(s) when trying to find + free space */ + COPI_DONT_ALLOCATE = (1 << 2), + /* try to load left neighbor if its not in a cache */ + COPI_LOAD_LEFT = (1 << 3), + /* try to load right neighbor if its not in a cache */ + COPI_LOAD_RIGHT = (1 << 4), + /* shift insertion point to the left neighbor */ + COPI_GO_LEFT = (1 << 5), + /* shift insertion point to the right neighbor */ + COPI_GO_RIGHT = (1 << 6), + /* try to step back into original node if insertion into new node + fails after shifting data there. */ + COPI_STEP_BACK = (1 << 7) +} cop_insert_flag; + +typedef enum { + SAFE_UNLINK, /* safe-link for unlink */ + SAFE_TRUNCATE /* safe-link for truncate */ +} reiser4_safe_link_t; + +/* this is to show on which list of atom jnode is */ +typedef enum { + NOT_CAPTURED, + DIRTY_LIST, + CLEAN_LIST, + FQ_LIST, + WB_LIST, + OVRWR_LIST, + PROTECT_LIST +} atom_list; + +/* __REISER4_FORWARD_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/init_super.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/init_super.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,526 @@ +/* Copyright by Hans Reiser, 2003 */ + +#include "forward.h" +#include "debug.h" +#include "dformat.h" +#include "txnmgr.h" +#include "jnode.h" +#include "znode.h" +#include "tree.h" +#include "vfs_ops.h" +#include "inode.h" +#include "page_cache.h" +#include "ktxnmgrd.h" +#include "super.h" +#include "reiser4.h" +#include "entd.h" +#include "emergency_flush.h" +#include "safe_link.h" +#include "plugin/dir/dir.h" + +#include +#include +#include +#include +#include +#include +#include + +#define _INIT_PARAM_LIST (struct super_block * s, reiser4_context * ctx, void * data, int silent) +#define _DONE_PARAM_LIST (struct super_block * s) + +#define _INIT_(subsys) static int _init_##subsys _INIT_PARAM_LIST +#define _DONE_(subsys) static void _done_##subsys _DONE_PARAM_LIST + +#define _DONE_EMPTY(subsys) _DONE_(subsys) {} + +_INIT_(mount_flags_check) +{ +/* if (bdev_read_only(s->s_bdev) || (s->s_flags & MS_RDONLY)) { + warning("nikita-3322", "Readonly reiser4 is not yet supported"); + return RETERR(-EROFS); + }*/ + return 0; +} + +_DONE_EMPTY(mount_flags_check) + +_INIT_(sinfo) +{ + reiser4_super_info_data * sbinfo; + + sbinfo = kmalloc(sizeof (reiser4_super_info_data), GFP_KERNEL); + if (!sbinfo) + return RETERR(-ENOMEM); + + s->s_fs_info = sbinfo; + s->s_op = NULL; + memset(sbinfo, 0, sizeof (*sbinfo)); + + ON_DEBUG(INIT_LIST_HEAD(&sbinfo->all_jnodes)); + ON_DEBUG(spin_lock_init(&sbinfo->all_guard)); + + sema_init(&sbinfo->delete_sema, 1); + sema_init(&sbinfo->flush_sema, 1); + spin_super_init(sbinfo); + spin_super_eflush_init(sbinfo); + + return 0; +} + +_DONE_(sinfo) +{ + assert("zam-990", s->s_fs_info != NULL); + rcu_barrier(); + kfree(s->s_fs_info); + s->s_fs_info = NULL; +} + +_INIT_(context) +{ + return init_context(ctx, s); +} + +_DONE_(context) +{ + reiser4_super_info_data * sbinfo; + + sbinfo = get_super_private(s); + + /* we don't want ->write_super to be called any more. */ + if (s->s_op) + s->s_op->write_super = NULL; +#if REISER4_DEBUG + { + struct list_head *scan; + + /* print jnodes that survived umount. */ + list_for_each(scan, &sbinfo->all_jnodes) { + jnode *busy; + + busy = list_entry(scan, jnode, jnodes); + info_jnode("\nafter umount", busy); + } + } + if (sbinfo->kmallocs > 0) + warning("nikita-2622", + "%i areas still allocated", sbinfo->kmallocs); +#endif + + get_current_context()->trans = NULL; + done_context(get_current_context()); +} + +_INIT_(parse_options) +{ + return reiser4_parse_options(s, data); +} + +_DONE_(parse_options) +{ + return; +} + +_INIT_(object_ops) +{ + build_object_ops(s, &get_super_private(s)->ops); + return 0; +} + +_DONE_EMPTY(object_ops) + +_INIT_(read_super) +{ + struct buffer_head *super_bh; + struct reiser4_master_sb *master_sb; + int plugin_id; + reiser4_super_info_data * sbinfo = get_super_private(s); + unsigned long blocksize; + + read_super_block: +#ifdef CONFIG_REISER4_BADBLOCKS + if ( sbinfo->altsuper ) + super_bh = sb_bread(s, (sector_t) (sbinfo->altsuper >> s->s_blocksize_bits)); + else +#endif + /* look for reiser4 magic at hardcoded place */ + super_bh = sb_bread(s, (sector_t) (REISER4_MAGIC_OFFSET / s->s_blocksize)); + + if (!super_bh) + return RETERR(-EIO); + + master_sb = (struct reiser4_master_sb *) super_bh->b_data; + /* check reiser4 magic string */ + if (!strncmp(master_sb->magic, REISER4_SUPER_MAGIC_STRING, sizeof(REISER4_SUPER_MAGIC_STRING))) { + /* reset block size if it is not a right one FIXME-VS: better comment is needed */ + blocksize = d16tocpu(&master_sb->blocksize); + + if (blocksize != PAGE_CACHE_SIZE) { + if (!silent) + warning("nikita-2609", "%s: wrong block size %ld\n", s->s_id, blocksize); + brelse(super_bh); + return RETERR(-EINVAL); + } + if (blocksize != s->s_blocksize) { + brelse(super_bh); + if (!sb_set_blocksize(s, (int) blocksize)) { + return RETERR(-EINVAL); + } + goto read_super_block; + } + + plugin_id = d16tocpu(&master_sb->disk_plugin_id); + /* only two plugins are available for now */ + assert("vs-476", plugin_id == FORMAT40_ID); + sbinfo->df_plug = disk_format_plugin_by_id(plugin_id); + sbinfo->diskmap_block = d64tocpu(&master_sb->diskmap); + brelse(super_bh); + } else { + if (!silent) { + warning("nikita-2608", "%s: wrong master super block magic.", s->s_id); + } + + /* no standard reiser4 super block found */ + brelse(super_bh); + /* FIXME-VS: call guess method for all available layout + plugins */ + /* umka (2002.06.12) Is it possible when format-specific super + block exists but there no master super block? */ + return RETERR(-EINVAL); + } + return 0; +} + +_DONE_EMPTY(read_super) + +_INIT_(tree0) +{ + reiser4_super_info_data * sbinfo = get_super_private(s); + + init_tree_0(&sbinfo->tree); + sbinfo->tree.super = s; + return 0; +} + +_DONE_EMPTY(tree0) + +_INIT_(txnmgr) +{ + txnmgr_init(&get_super_private(s)->tmgr); + return 0; +} + +_DONE_(txnmgr) +{ + txnmgr_done(&get_super_private(s)->tmgr); +} + +_INIT_(ktxnmgrd_context) +{ + return init_ktxnmgrd_context(&get_super_private(s)->tmgr); +} + +_DONE_(ktxnmgrd_context) +{ + done_ktxnmgrd_context(&get_super_private(s)->tmgr); +} + +_INIT_(ktxnmgrd) +{ + return start_ktxnmgrd(&get_super_private(s)->tmgr); +} + +_DONE_(ktxnmgrd) +{ + stop_ktxnmgrd(&get_super_private(s)->tmgr); +} + +_INIT_(formatted_fake) +{ + return init_formatted_fake(s); +} + +_DONE_(formatted_fake) +{ + reiser4_super_info_data * sbinfo; + + sbinfo = get_super_private(s); + + rcu_barrier(); + + /* done_formatted_fake just has finished with last jnodes (bitmap + * ones) */ + done_tree(&sbinfo->tree); + /* call finish_rcu(), because some znode were "released" in + * done_tree(). */ + rcu_barrier(); + done_formatted_fake(s); +} + +_INIT_(entd) +{ + init_entd_context(s); + return 0; +} + +_DONE_(entd) +{ + done_entd_context(s); +} + +_DONE_(disk_format); + +_INIT_(disk_format) +{ + return get_super_private(s)->df_plug->get_ready(s, data); +} + +_DONE_(disk_format) +{ + reiser4_super_info_data *sbinfo = get_super_private(s); + + sbinfo->df_plug->release(s); +} + +_INIT_(sb_counters) +{ + /* There are some 'committed' versions of reiser4 super block + counters, which correspond to reiser4 on-disk state. These counters + are initialized here */ + reiser4_super_info_data *sbinfo = get_super_private(s); + + sbinfo->blocks_free_committed = sbinfo->blocks_free; + sbinfo->nr_files_committed = oids_used(s); + + return 0; +} + +_DONE_EMPTY(sb_counters) + +_INIT_(d_cursor) +{ + /* this should be done before reading inode of root directory, because + * reiser4_iget() used load_cursors(). */ + return d_cursor_init_at(s); +} + +_DONE_(d_cursor) +{ + d_cursor_done_at(s); +} + +static struct { + reiser4_plugin_type type; + reiser4_plugin_id id; +} default_plugins[PSET_LAST] = { + [PSET_FILE] = { + .type = REISER4_FILE_PLUGIN_TYPE, + .id = UNIX_FILE_PLUGIN_ID + }, + [PSET_DIR] = { + .type = REISER4_DIR_PLUGIN_TYPE, + .id = HASHED_DIR_PLUGIN_ID + }, + [PSET_HASH] = { + .type = REISER4_HASH_PLUGIN_TYPE, + .id = R5_HASH_ID + }, + [PSET_FIBRATION] = { + .type = REISER4_FIBRATION_PLUGIN_TYPE, + .id = FIBRATION_DOT_O + }, + [PSET_PERM] = { + .type = REISER4_PERM_PLUGIN_TYPE, + .id = RWX_PERM_ID + }, + [PSET_FORMATTING] = { + .type = REISER4_FORMATTING_PLUGIN_TYPE, + .id = SMALL_FILE_FORMATTING_ID + }, + [PSET_SD] = { + .type = REISER4_ITEM_PLUGIN_TYPE, + .id = STATIC_STAT_DATA_ID + }, + [PSET_DIR_ITEM] = { + .type = REISER4_ITEM_PLUGIN_TYPE, + .id = COMPOUND_DIR_ID + }, + [PSET_CRYPTO] = { + .type = REISER4_CRYPTO_PLUGIN_TYPE, + .id = NONE_CRYPTO_ID + }, + [PSET_DIGEST] = { + .type = REISER4_DIGEST_PLUGIN_TYPE, + .id = NONE_DIGEST_ID + }, + [PSET_COMPRESSION] = { + .type = REISER4_COMPRESSION_PLUGIN_TYPE, + .id = NONE_COMPRESSION_ID + } +}; + +/* access to default plugin table */ +reiser4_internal reiser4_plugin * +get_default_plugin(pset_member memb) +{ + return plugin_by_id(default_plugins[memb].type, default_plugins[memb].id); +} + +_INIT_(fs_root) +{ + reiser4_super_info_data *sbinfo = get_super_private(s); + struct inode * inode; + int result = 0; + + inode = reiser4_iget(s, sbinfo->df_plug->root_dir_key(s), 0); + if (IS_ERR(inode)) + return RETERR(PTR_ERR(inode)); + + s->s_root = d_alloc_root(inode); + if (!s->s_root) { + iput(inode); + return RETERR(-ENOMEM); + } + + s->s_root->d_op = &sbinfo->ops.dentry; + + if (!is_inode_loaded(inode)) { + pset_member memb; + + for (memb = 0; memb < PSET_LAST; ++ memb) { + reiser4_plugin *plug; + + plug = get_default_plugin(memb); + result = grab_plugin_from(inode, memb, plug); + if (result != 0) + break; + } + + if (result == 0) { + if (REISER4_DEBUG) { + plugin_set *pset; + + pset = reiser4_inode_data(inode)->pset; + for (memb = 0; memb < PSET_LAST; ++ memb) + assert("nikita-3500", + pset_get(pset, memb) != NULL); + } + } else + warning("nikita-3448", "Cannot set plugins of root: %i", + result); + reiser4_iget_complete(inode); + } + s->s_maxbytes = MAX_LFS_FILESIZE; + return result; +} + +_DONE_(fs_root) +{ + shrink_dcache_parent(s->s_root); + assert("vs-1714", hlist_empty(&s->s_anon)); + dput(s->s_root); + s->s_root = NULL; + invalidate_inodes(s); + +} + +_INIT_(safelink) +{ + process_safelinks(s); + /* failure to process safe-links is not critical. Continue with + * mount. */ + return 0; +} + +_DONE_(safelink) +{ +} + +_INIT_(exit_context) +{ + reiser4_exit_context(ctx); + return 0; +} + +_DONE_EMPTY(exit_context) + +struct reiser4_subsys { + int (*init) _INIT_PARAM_LIST; + void (*done) _DONE_PARAM_LIST; +}; + +#define _SUBSYS(subsys) {.init = &_init_##subsys, .done = &_done_##subsys} +static struct reiser4_subsys subsys_array[] = { + _SUBSYS(mount_flags_check), + _SUBSYS(sinfo), + _SUBSYS(context), + _SUBSYS(parse_options), + _SUBSYS(object_ops), + _SUBSYS(read_super), + _SUBSYS(tree0), + _SUBSYS(txnmgr), + _SUBSYS(ktxnmgrd_context), + _SUBSYS(ktxnmgrd), + _SUBSYS(entd), + _SUBSYS(formatted_fake), + _SUBSYS(disk_format), + _SUBSYS(sb_counters), + _SUBSYS(d_cursor), + _SUBSYS(fs_root), + _SUBSYS(safelink), + _SUBSYS(exit_context) +}; + +#define REISER4_NR_SUBSYS (sizeof(subsys_array) / sizeof(struct reiser4_subsys)) + +static void done_super (struct super_block * s, int last_done) +{ + int i; + for (i = last_done; i >= 0; i--) + subsys_array[i].done(s); +} + +/* read super block from device and fill remaining fields in @s. + + This is read_super() of the past. */ +reiser4_internal int +reiser4_fill_super (struct super_block * s, void * data, int silent) +{ + reiser4_context ctx; + int i; + int ret; + + assert ("zam-989", s != NULL); + + for (i = 0; i < REISER4_NR_SUBSYS; i++) { + ret = subsys_array[i].init(s, &ctx, data, silent); + if (ret) { + done_super(s, i - 1); + return ret; + } + } + return 0; +} + +#if 0 + +int reiser4_done_super (struct super_block * s) +{ + reiser4_context ctx; + + init_context(&ctx, s); + done_super(s, REISER4_NR_SUBSYS - 1); + return 0; +} + +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 80 + End: +*/ diff -puN /dev/null fs/reiser4/init_super.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/init_super.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,4 @@ +/* Copyright by Hans Reiser, 2003 */ + +extern int reiser4_fill_super (struct super_block * s, void * data, int silent); +extern int reiser4_done_super (struct super_block * s); diff -puN /dev/null fs/reiser4/inode.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/inode.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,771 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Inode specific operations. */ + +#include "forward.h" +#include "debug.h" +#include "key.h" +#include "kassign.h" +#include "coord.h" +#include "seal.h" +#include "dscale.h" +#include "plugin/item/item.h" +#include "plugin/security/perm.h" +#include "plugin/plugin.h" +#include "plugin/object.h" +#include "plugin/dir/dir.h" +#include "znode.h" +#include "vfs_ops.h" +#include "inode.h" +#include "super.h" +#include "reiser4.h" + +#include /* for struct super_block, address_space */ + +/* return reiser4 internal tree which inode belongs to */ +/* Audited by: green(2002.06.17) */ +reiser4_internal reiser4_tree * +tree_by_inode(const struct inode * inode /* inode queried */ ) +{ + assert("nikita-256", inode != NULL); + assert("nikita-257", inode->i_sb != NULL); + return get_tree(inode->i_sb); +} + +/* return reiser4-specific inode flags */ +static inline unsigned long * +inode_flags(const struct inode * const inode) +{ + assert("nikita-2842", inode != NULL); + return &reiser4_inode_data(inode)->flags; +} + +/* set reiser4-specific flag @f in @inode */ +reiser4_internal void +inode_set_flag(struct inode * inode, reiser4_file_plugin_flags f) +{ + assert("nikita-2248", inode != NULL); + set_bit((int) f, inode_flags(inode)); +} + +/* clear reiser4-specific flag @f in @inode */ +reiser4_internal void +inode_clr_flag(struct inode * inode, reiser4_file_plugin_flags f) +{ + assert("nikita-2250", inode != NULL); + clear_bit((int) f, inode_flags(inode)); +} + +/* true if reiser4-specific flag @f is set in @inode */ +reiser4_internal int +inode_get_flag(const struct inode * inode, reiser4_file_plugin_flags f) +{ + assert("nikita-2251", inode != NULL); + return test_bit((int) f, inode_flags(inode)); +} + +/* convert oid to inode number */ +reiser4_internal ino_t oid_to_ino(oid_t oid) +{ + return (ino_t) oid; +} + +/* convert oid to user visible inode number */ +reiser4_internal ino_t oid_to_uino(oid_t oid) +{ + /* reiser4 object is uniquely identified by oid which is 64 bit + quantity. Kernel in-memory inode is indexed (in the hash table) by + 32 bit i_ino field, but this is not a problem, because there is a + way to further distinguish inodes with identical inode numbers + (find_actor supplied to iget()). + + But user space expects unique 32 bit inode number. Obviously this + is impossible. Work-around is to somehow hash oid into user visible + inode number. + */ + oid_t max_ino = (ino_t) ~ 0; + + if (REISER4_INO_IS_OID || (oid <= max_ino)) + return oid; + else + /* this is remotely similar to algorithm used to find next pid + to use for process: after wrap-around start from some + offset rather than from 0. Idea is that there are some long + living objects with which we don't want to collide. + */ + return REISER4_UINO_SHIFT + ((oid - max_ino) & (max_ino >> 1)); +} + +/* check that "inode" is on reiser4 file-system */ +reiser4_internal int +is_reiser4_inode(const struct inode *inode /* inode queried */ ) +{ + return + inode != NULL && + (is_reiser4_super(inode->i_sb) || + inode->i_op == &reiser4_inode_operations); + +} + +/* Maximal length of a name that can be stored in directory @inode. + + This is used in check during file creation and lookup. */ +reiser4_internal int +reiser4_max_filename_len(const struct inode *inode /* inode queried */ ) +{ + assert("nikita-287", is_reiser4_inode(inode)); + assert("nikita-1710", inode_dir_item_plugin(inode)); + if (inode_dir_item_plugin(inode)->s.dir.max_name_len) + return inode_dir_item_plugin(inode)->s.dir.max_name_len(inode); + else + return 255; +} + +/* Maximal number of hash collisions for this directory. */ +reiser4_internal int +max_hash_collisions(const struct inode *dir /* inode queried */ ) +{ + assert("nikita-1711", dir != NULL); +#if REISER4_USE_COLLISION_LIMIT + return reiser4_inode_data(dir)->plugin.max_collisions; +#else + (void) dir; + return ~0; +#endif +} + +/* Install file, inode, and address_space operation on @inode, depending on + its mode. */ +reiser4_internal int +setup_inode_ops(struct inode *inode /* inode to intialise */ , + reiser4_object_create_data * data /* parameters to create + * object */ ) +{ + reiser4_super_info_data *sinfo; + + sinfo = get_super_private(inode->i_sb); + + switch (inode->i_mode & S_IFMT) { + case S_IFSOCK: + case S_IFBLK: + case S_IFCHR: + case S_IFIFO:{ + dev_t rdev; /* to keep gcc happy */ + + /* ugly hack with rdev */ + if (data == NULL) { + rdev = inode->i_rdev; + inode->i_rdev = 0; + } else + rdev = data->rdev; + inode->i_blocks = 0; + inode->i_op = &sinfo->ops.special; + /* other fields are already initialised. */ + init_special_inode(inode, inode->i_mode, rdev); + break; + } + case S_IFLNK: + inode->i_op = &sinfo->ops.symlink; + inode->i_fop = NULL; + inode->i_mapping->a_ops = &sinfo->ops.as; + break; + case S_IFDIR: + inode->i_op = &sinfo->ops.dir; + inode->i_fop = &sinfo->ops.file; + inode->i_mapping->a_ops = &sinfo->ops.as; + break; + case S_IFREG: + inode->i_op = &sinfo->ops.regular; + inode->i_fop = &sinfo->ops.file; + inode->i_mapping->a_ops = &sinfo->ops.as; + break; + default: + warning("nikita-291", "wrong file mode: %o for %llu", inode->i_mode, + (unsigned long long)get_inode_oid(inode)); + reiser4_make_bad_inode(inode); + return RETERR(-EINVAL); + } + return 0; +} + +/* initialise inode from disk data. Called with inode locked. + Return inode locked. */ +static int +init_inode(struct inode *inode /* inode to intialise */ , + coord_t * coord /* coord of stat data */ ) +{ + int result; + item_plugin *iplug; + void *body; + int length; + reiser4_inode *state; + + assert("nikita-292", coord != NULL); + assert("nikita-293", inode != NULL); + + coord_clear_iplug(coord); + result = zload(coord->node); + if (result) + return result; + iplug = item_plugin_by_coord(coord); + body = item_body_by_coord(coord); + length = item_length_by_coord(coord); + + assert("nikita-295", iplug != NULL); + assert("nikita-296", body != NULL); + assert("nikita-297", length > 0); + + /* inode is under I_LOCK now */ + + state = reiser4_inode_data(inode); + /* call stat-data plugin method to load sd content into inode */ + result = iplug->s.sd.init_inode(inode, body, length); + plugin_set_sd(&state->pset, iplug); + if (result == 0) { + result = setup_inode_ops(inode, NULL); + if (result == 0 && + inode->i_sb->s_root && inode->i_sb->s_root->d_inode) { + struct inode *root; + pset_member ind; + + /* take missing plugins from file-system defaults */ + root = inode->i_sb->s_root->d_inode; + /* file and directory plugins are already initialised. */ + for (ind = PSET_DIR + 1; ind < PSET_LAST; ++ind) { + result = grab_plugin(inode, root, ind); + if (result != 0) + break; + } + if (result != 0) { + warning("nikita-3447", + "Cannot set up plugins for %lli", + (unsigned long long)get_inode_oid(inode)); + } + } + } + zrelse(coord->node); + return result; +} + +/* read `inode' from the disk. This is what was previously in + reiserfs_read_inode2(). + + Must be called with inode locked. Return inode still locked. +*/ +static int +read_inode(struct inode *inode /* inode to read from disk */ , + const reiser4_key * key /* key of stat data */, + int silent) +{ + int result; + lock_handle lh; + reiser4_inode *info; + coord_t coord; + + assert("nikita-298", inode != NULL); + assert("nikita-1945", !is_inode_loaded(inode)); + + info = reiser4_inode_data(inode); + assert("nikita-300", info->locality_id != 0); + + coord_init_zero(&coord); + init_lh(&lh); + /* locate stat-data in a tree and return znode locked */ + result = lookup_sd(inode, ZNODE_READ_LOCK, &coord, &lh, key, silent); + assert("nikita-301", !is_inode_loaded(inode)); + if (result == 0) { + /* use stat-data plugin to load sd into inode. */ + result = init_inode(inode, &coord); + if (result == 0) { + /* initialize stat-data seal */ + spin_lock_inode(inode); + seal_init(&info->sd_seal, &coord, key); + info->sd_coord = coord; + spin_unlock_inode(inode); + + /* call file plugin's method to initialize plugin + * specific part of inode */ + if (inode_file_plugin(inode)->init_inode_data) + inode_file_plugin(inode)->init_inode_data(inode, + NULL, + 0); + /* load detached directory cursors for stateless + * directory readers (NFS). */ + load_cursors(inode); + + /* Check the opened inode for consistency. */ + result = get_super_private(inode->i_sb)->df_plug->check_open(inode); + } + } + /* lookup_sd() doesn't release coord because we want znode + stay read-locked while stat-data fields are accessed in + init_inode() */ + done_lh(&lh); + + if (result != 0) + reiser4_make_bad_inode(inode); + return result; +} + +/* initialise new reiser4 inode being inserted into hash table. */ +static int +init_locked_inode(struct inode *inode /* new inode */ , + void *opaque /* key of stat data passed to the + * iget5_locked as cookie */ ) +{ + reiser4_key *key; + + assert("nikita-1995", inode != NULL); + assert("nikita-1996", opaque != NULL); + key = opaque; + set_inode_oid(inode, get_key_objectid(key)); + reiser4_inode_data(inode)->locality_id = get_key_locality(key); + return 0; +} + +/* reiser4_inode_find_actor() - "find actor" supplied by reiser4 to iget5_locked(). + + This function is called by iget5_locked() to distinguish reiser4 inodes + having the same inode numbers. Such inodes can only exist due to some error + condition. One of them should be bad. Inodes with identical inode numbers + (objectids) are distinguished by their packing locality. + +*/ +static int +reiser4_inode_find_actor(struct inode *inode /* inode from hash table to + * check */ , + void *opaque /* "cookie" passed to + * iget5_locked(). This is stat data + * key */ ) +{ + reiser4_key *key; + + key = opaque; + return + /* oid is unique, so first term is enough, actually. */ + get_inode_oid(inode) == get_key_objectid(key) && + /* + * also, locality should be checked, but locality is stored in + * the reiser4-specific part of the inode, and actor can be + * called against arbitrary inode that happened to be in this + * hash chain. Hence we first have to check that this is + * reiser4 inode at least. is_reiser4_inode() is probably too + * early to call, as inode may have ->i_op not yet + * initialised. + */ + is_reiser4_super(inode->i_sb) && + /* + * usually objectid is unique, but pseudo files use counter to + * generate objectid. All pseudo files are placed into special + * (otherwise unused) locality. + */ + reiser4_inode_data(inode)->locality_id == get_key_locality(key); +} + +/* hook for kmem_cache_create */ +void loading_init_once(reiser4_inode *info) +{ + sema_init(&info->loading, 1); +} + +/* for reiser4_alloc_inode */ +void loading_alloc(reiser4_inode *info) +{ +#if REISER4_DEBUG + assert("vs-1717", down_trylock(&info->loading) == 0); + up(&info->loading); +#endif +} + +/* for reiser4_destroy */ +void loading_destroy(reiser4_inode *info) +{ +#if REISER4_DEBUG + assert("vs-1717", down_trylock(&info->loading) == 0); + up(&info->loading); +#endif +} + +static void loading_down(reiser4_inode *info) +{ + down(&info->loading); +} + +static void loading_up(reiser4_inode *info) +{ + up(&info->loading); +} + +/* + * this is our helper function a la iget(). This is be called by + * reiser4_lookup() and reiser4_read_super(). Return inode locked or error + * encountered. + */ +reiser4_internal struct inode * +reiser4_iget(struct super_block *super /* super block */ , + const reiser4_key * key /* key of inode's stat-data */, + int silent) +{ + struct inode *inode; + int result; + reiser4_inode *info; + + assert("nikita-302", super != NULL); + assert("nikita-303", key != NULL); + + result = 0; + + /* call iget(). Our ->read_inode() is dummy, so this will either + find inode in cache or return uninitialised inode */ + inode = iget5_locked(super, + (unsigned long) get_key_objectid(key), + reiser4_inode_find_actor, + init_locked_inode, + (reiser4_key *) key); + if (inode == NULL) + return ERR_PTR(RETERR(-ENOMEM)); + if (is_bad_inode(inode)) { + warning("nikita-304", "Bad inode found"); + print_key("key", key); + iput(inode); + return ERR_PTR(RETERR(-EIO)); + } + + info = reiser4_inode_data(inode); + + /* Reiser4 inode state bit REISER4_LOADED is used to distinguish fully + loaded and initialized inode from just allocated inode. If + REISER4_LOADED bit is not set, reiser4_iget() completes loading under + info->loading. The place in reiser4 which uses not initialized inode + is the reiser4 repacker, see repacker-related functions in + plugin/item/extent.c */ + if (!is_inode_loaded(inode)) { + loading_down(info); + if (!is_inode_loaded(inode)) { + /* locking: iget5_locked returns locked inode */ + assert("nikita-1941", !is_inode_loaded(inode)); + assert("nikita-1949", + reiser4_inode_find_actor(inode, + (reiser4_key *)key)); + /* now, inode has objectid as ->i_ino and locality in + reiser4-specific part. This is enough for + read_inode() to read stat data from the disk */ + result = read_inode(inode, key, silent); + } else + loading_up(info); + } + + if (inode->i_state & I_NEW) + unlock_new_inode(inode); + + if (is_bad_inode(inode)) { + assert("vs-1717", result != 0); + loading_up(info); + iput(inode); + inode = ERR_PTR(result); + } else if (REISER4_DEBUG) { + reiser4_key found_key; + + assert("vs-1717", result == 0); + build_sd_key(inode, &found_key); + if (!keyeq(&found_key, key)) { + warning("nikita-305", "Wrong key in sd"); + print_key("sought for", key); + print_key("found", &found_key); + } + if (inode_file_plugin(inode)->not_linked(inode)) { + warning("nikita-3559", "Unlinked inode found: %llu\n", + (unsigned long long)get_inode_oid(inode)); + } + } + return inode; +} + +/* reiser4_iget() may return not fully initialized inode, this function should + * be called after one completes reiser4 inode initializing. */ +reiser4_internal void reiser4_iget_complete (struct inode * inode) +{ + assert("zam-988", is_reiser4_inode(inode)); + + if (!is_inode_loaded(inode)) { + inode_set_flag(inode, REISER4_LOADED); + loading_up(reiser4_inode_data(inode)); + } +} + +reiser4_internal void +reiser4_make_bad_inode(struct inode *inode) +{ + assert("nikita-1934", inode != NULL); + + /* clear LOADED bit */ + inode_clr_flag(inode, REISER4_LOADED); + make_bad_inode(inode); + return; +} + +reiser4_internal file_plugin * +inode_file_plugin(const struct inode * inode) +{ + assert("nikita-1997", inode != NULL); + return reiser4_inode_data(inode)->pset->file; +} + +reiser4_internal dir_plugin * +inode_dir_plugin(const struct inode * inode) +{ + assert("nikita-1998", inode != NULL); + return reiser4_inode_data(inode)->pset->dir; +} + +reiser4_internal perm_plugin * +inode_perm_plugin(const struct inode * inode) +{ + assert("nikita-1999", inode != NULL); + return reiser4_inode_data(inode)->pset->perm; +} + +reiser4_internal formatting_plugin * +inode_formatting_plugin(const struct inode * inode) +{ + assert("nikita-2000", inode != NULL); + return reiser4_inode_data(inode)->pset->formatting; +} + +reiser4_internal hash_plugin * +inode_hash_plugin(const struct inode * inode) +{ + assert("nikita-2001", inode != NULL); + return reiser4_inode_data(inode)->pset->hash; +} + +reiser4_internal fibration_plugin * +inode_fibration_plugin(const struct inode * inode) +{ + assert("nikita-2001", inode != NULL); + return reiser4_inode_data(inode)->pset->fibration; +} + +reiser4_internal crypto_plugin * +inode_crypto_plugin(const struct inode * inode) +{ + assert("edward-36", inode != NULL); + return reiser4_inode_data(inode)->pset->crypto; +} + +reiser4_internal compression_plugin * +inode_compression_plugin(const struct inode * inode) +{ + assert("edward-37", inode != NULL); + return reiser4_inode_data(inode)->pset->compression; +} + +reiser4_internal digest_plugin * +inode_digest_plugin(const struct inode * inode) +{ + assert("edward-86", inode != NULL); + return reiser4_inode_data(inode)->pset->digest; +} + +reiser4_internal item_plugin * +inode_sd_plugin(const struct inode * inode) +{ + assert("vs-534", inode != NULL); + return reiser4_inode_data(inode)->pset->sd; +} + +reiser4_internal item_plugin * +inode_dir_item_plugin(const struct inode * inode) +{ + assert("vs-534", inode != NULL); + return reiser4_inode_data(inode)->pset->dir_item; +} + +reiser4_internal void +inode_set_extension(struct inode *inode, sd_ext_bits ext) +{ + reiser4_inode *state; + + assert("nikita-2716", inode != NULL); + assert("nikita-2717", ext < LAST_SD_EXTENSION); + assert("nikita-3491", + spin_inode_object_is_locked(reiser4_inode_data(inode))); + + state = reiser4_inode_data(inode); + state->extmask |= 1 << ext; + /* force re-calculation of stat-data length on next call to + update_sd(). */ + inode_clr_flag(inode, REISER4_SDLEN_KNOWN); +} + +reiser4_internal void +inode_set_plugin(struct inode *inode, reiser4_plugin * plug, pset_member memb) +{ + assert("nikita-2718", inode != NULL); + assert("nikita-2719", plug != NULL); + + reiser4_inode_data(inode)->plugin_mask |= (1 << memb); +} + +reiser4_internal void +inode_check_scale_nolock(struct inode *inode, __u64 old, __u64 new) +{ + assert("edward-1287", inode != NULL); + if (!dscale_fit(old, new)) + inode_clr_flag(inode, REISER4_SDLEN_KNOWN); + return; +} + +reiser4_internal void +inode_check_scale(struct inode *inode, __u64 old, __u64 new) +{ + assert("nikita-2875", inode != NULL); + spin_lock_inode(inode); + inode_check_scale_nolock(inode, old, new); + spin_unlock_inode(inode); +} + +/* + * initialize ->ordering field of inode. This field defines how file stat-data + * and body is ordered within a tree with respect to other objects within the + * same parent directory. + */ +reiser4_internal void +init_inode_ordering(struct inode *inode, + reiser4_object_create_data *crd, int create) +{ + reiser4_key key; + + if (create) { + struct inode *parent; + + parent = crd->parent; + assert("nikita-3224", inode_dir_plugin(parent) != NULL); + inode_dir_plugin(parent)->build_entry_key(parent, + &crd->dentry->d_name, + &key); + } else { + coord_t *coord; + + coord = &reiser4_inode_data(inode)->sd_coord; + coord_clear_iplug(coord); + /* safe to use ->sd_coord, because node is under long term + * lock */ + WITH_DATA(coord->node, item_key_by_coord(coord, &key)); + } + + set_inode_ordering(inode, get_key_ordering(&key)); +} + +reiser4_internal znode * +inode_get_vroot(struct inode *inode) +{ + reiser4_block_nr blk; + znode *result; + reiser4_inode *info; + + info = reiser4_inode_data(inode); + LOCK_INODE(info); + blk = info->vroot; + UNLOCK_INODE(info); + if (!disk_addr_eq(&UBER_TREE_ADDR, &blk)) + result = zlook(tree_by_inode(inode), &blk); + else + result = NULL; + return result; +} + +reiser4_internal void +inode_set_vroot(struct inode *inode, znode *vroot) +{ + reiser4_inode *info; + + info = reiser4_inode_data(inode); + LOCK_INODE(info); + info->vroot = *znode_get_block(vroot); + UNLOCK_INODE(info); +} + +#if REISER4_DEBUG + +void +inode_invariant(const struct inode *inode) +{ + reiser4_inode * object; + + object = reiser4_inode_data(inode); + assert("nikita-3077", spin_inode_object_is_locked(object)); + + spin_lock_eflush(inode->i_sb); + + assert("nikita-3146", object->anonymous_eflushed >= 0 && object->captured_eflushed >= 0); + assert("nikita-3441", ergo(object->anonymous_eflushed > 0 || object->captured_eflushed > 0, + jnode_tree_by_reiser4_inode(object)->rnode != NULL)); + + spin_unlock_eflush(inode->i_sb); +} + +int +inode_has_no_jnodes(reiser4_inode *r4_inode) +{ + return jnode_tree_by_reiser4_inode(r4_inode)->rnode == NULL && + r4_inode->nr_jnodes == 0 && + r4_inode->captured_eflushed == 0 && + r4_inode->anonymous_eflushed == 0; +} + +void +mark_inode_update(struct inode *object, int immediate) +{ + int i; + int pos; + reiser4_context *ctx; + + ctx = get_current_context(); + for (i = 0, pos = -1; i < TRACKED_DELAYED_UPDATE; ++i) { + if (ctx->dirty[i].ino == object->i_ino) { + pos = i; + break; + } else if (ctx->dirty[i].ino == 0) + pos = i; + } + if (pos == -1) + ;/*warning("nikita-3402", "Too many delayed inode updates");*/ + else if (immediate) { + ctx->dirty[pos].ino = 0; + } else { + ctx->dirty[pos].ino = object->i_ino; + ctx->dirty[pos].delayed = 1; +#ifdef CONFIG_FRAME_POINTER + ctx->dirty[pos].stack[0] = __builtin_return_address(0); + ctx->dirty[pos].stack[1] = __builtin_return_address(1); + ctx->dirty[pos].stack[2] = __builtin_return_address(2); + ctx->dirty[pos].stack[3] = __builtin_return_address(3); +#endif + } +} + + +int +delayed_inode_updates(dirty_inode_info info) +{ + int i; + + for (i = 0; i < TRACKED_DELAYED_UPDATE; ++i) { + if (info[i].ino != 0 && info[i].delayed) + return 1; + } + return 0; +} + +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/inode.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/inode.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,424 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Inode functions. */ + +#if !defined( __REISER4_INODE_H__ ) +#define __REISER4_INODE_H__ + +#include "forward.h" +#include "debug.h" +#include "spin_macros.h" +#include "key.h" +#include "kcond.h" +#include "seal.h" +#include "plugin/plugin.h" +#include "plugin/cryptcompress.h" +#include "plugin/plugin_set.h" +#include "plugin/security/perm.h" +#include "plugin/pseudo/pseudo.h" +#include "vfs_ops.h" +#include "jnode.h" + +#include /* for __u?? , ino_t */ +#include /* for struct super_block, struct + * rw_semaphore, etc */ +#include +#include + +/* reiser4-specific inode flags. They are "transient" and are not + supposed to be stored on disk. Used to trace "state" of + inode +*/ +typedef enum { + /* this is light-weight inode, inheriting some state from its + parent */ + REISER4_LIGHT_WEIGHT = 0, + /* stat data wasn't yet created */ + REISER4_NO_SD = 1, + /* internal immutable flag. Currently is only used + to avoid race condition during file creation. + See comment in create_object(). */ + REISER4_IMMUTABLE = 2, + /* inode was read from storage */ + REISER4_LOADED = 3, + /* this bit is set for symlinks. inode->u.generic_ip points to target + name of symlink. */ + REISER4_GENERIC_PTR_USED = 4, + /* set if size of stat-data item for this inode is known. If this is + * set we can avoid recalculating size of stat-data on each update. */ + REISER4_SDLEN_KNOWN = 5, + /* reiser4_inode->crypt points to the crypto stat */ + REISER4_CRYPTO_STAT_LOADED = 6, + /* reiser4_inode->cluster_shift makes sense */ + REISER4_CLUSTER_KNOWN = 7, + /* cryptcompress_inode_data points to the secret key */ + REISER4_SECRET_KEY_INSTALLED = 8, + /* File (possibly) has pages corresponding to the tail items, that + * were created by ->readpage. It is set by mmap_unix_file() and + * sendfile_unix_file(). This bit is inspected by write_unix_file and + * kill-hook of tail items. It is never cleared once set. This bit is + * modified and inspected under i_sem. */ + REISER4_HAS_MMAP = 9, + /* file was partially converted. It's body consists of a mix of tail + * and extent items. */ + REISER4_PART_CONV = 10, +} reiser4_file_plugin_flags; + +/* state associated with each inode. + reiser4 inode. + + NOTE-NIKITA In 2.5 kernels it is not necessary that all file-system inodes + be of the same size. File-system allocates inodes by itself through + s_op->allocate_inode() method. So, it is possible to adjust size of inode + at the time of its creation. + + + Invariants involving parts of this data-type: + + [inode->eflushed] + +*/ + +typedef struct reiser4_inode reiser4_inode; +/* return pointer to reiser4-specific part of inode */ +static inline reiser4_inode * +reiser4_inode_data(const struct inode * inode /* inode queried */); + +#include "plugin/file/file.h" + +#if BITS_PER_LONG == 64 + +#define REISER4_INO_IS_OID (1) +typedef struct {; +} oid_hi_t; + +/* BITS_PER_LONG == 64 */ +#else + +#define REISER4_INO_IS_OID (0) +typedef __u32 oid_hi_t; + +/* BITS_PER_LONG == 64 */ +#endif + +struct reiser4_inode { + /* spin lock protecting fields of this structure. */ + reiser4_spin_data guard; + /* object plugins */ + plugin_set *pset; + /* plugins set for inheritance */ + plugin_set *hset; + /* high 32 bits of object id */ + oid_hi_t oid_hi; + /* seal for stat-data */ + seal_t sd_seal; + /* locality id for this file */ + oid_t locality_id; +#if REISER4_LARGE_KEY + __u64 ordering; +#endif + /* coord of stat-data in sealed node */ + coord_t sd_coord; + /* bit-mask of stat-data extentions used by this file */ + __u64 extmask; + /* bitmask of non-default plugins for this inode */ + __u16 plugin_mask; + /* cluster parameter for crypto and compression */ + __u8 cluster_shift; + /* secret key parameter for crypto */ + crypto_stat_t *crypt; + + union { + readdir_list_head readdir_list; + struct list_head not_used; + } lists; + /* per-inode flags. Filled by values of reiser4_file_plugin_flags */ + unsigned long flags; + union { + /* fields specific to unix_file plugin */ + unix_file_info_t unix_file_info; + /* fields specific to cryptcompress plugin */ + cryptcompress_info_t cryptcompress_info; + /* fields specific to pseudo file plugin */ + pseudo_info_t pseudo_info; + } file_plugin_data; + struct rw_semaphore coc_sem; /* filemap_nopage takes it for read, copy_on_capture - for write. Under this it + tries to unmap page for which it is called. This prevents process from using page which + was copied on capture */ + + /* tree of jnodes. Phantom jnodes (ones not attched to any atom) are + tagged in that tree by EFLUSH_TAG_ANONYMOUS */ + struct radix_tree_root jnodes_tree; +#if REISER4_DEBUG + /* numbers of eflushed jnodes of each type in the above tree */ + int anonymous_eflushed; + int captured_eflushed; + /* number of unformatted node jnodes of this file in jnode hash table */ + unsigned long nr_jnodes; +#endif + + /* block number of virtual root for this object. See comment above + * fs/reiser4/search.c:handle_vroot() */ + reiser4_block_nr vroot; + struct semaphore loading; +}; + +void loading_init_once(reiser4_inode *); +void loading_alloc(reiser4_inode *); +void loading_destroy(reiser4_inode *); + + +#define I_JNODES (512) /* inode state bit. Set when in hash table there are more than 0 jnodes of unformatted nodes of + an inode */ + +typedef struct reiser4_inode_object { + /* private part */ + reiser4_inode p; + /* generic fields not specific to reiser4, but used by VFS */ + struct inode vfs_inode; +} reiser4_inode_object; + +/* return pointer to the reiser4 specific portion of @inode */ +static inline reiser4_inode * +reiser4_inode_data(const struct inode * inode /* inode queried */) +{ + assert("nikita-254", inode != NULL); + return &container_of(inode, reiser4_inode_object, vfs_inode)->p; +} + +static inline struct inode * +inode_by_reiser4_inode(const reiser4_inode *r4_inode /* inode queried */) +{ + return &container_of(r4_inode, reiser4_inode_object, p)->vfs_inode; +} + +/* + * reiser4 inodes are identified by 64bit object-id (oid_t), but in struct + * inode ->i_ino field is of type ino_t (long) that can be either 32 or 64 + * bits. + * + * If ->i_ino is 32 bits we store remaining 32 bits in reiser4 specific part + * of inode, otherwise whole oid is stored in i_ino. + * + * Wrappers below ([sg]et_inode_oid()) are used to hide this difference. + */ + +#define OID_HI_SHIFT (sizeof(ino_t) * 8) + +#if REISER4_INO_IS_OID + +static inline oid_t +get_inode_oid(const struct inode *inode) +{ + return inode->i_ino; +} + +static inline void +set_inode_oid(struct inode *inode, oid_t oid) +{ + inode->i_ino = oid; +} + +/* REISER4_INO_IS_OID */ +#else + +static inline oid_t +get_inode_oid(const struct inode *inode) +{ + return + ((__u64)reiser4_inode_data(inode)->oid_hi << OID_HI_SHIFT) | + inode->i_ino; +} + +static inline void +set_inode_oid(struct inode *inode, oid_t oid) +{ + assert("nikita-2519", inode != NULL); + inode->i_ino = (ino_t)(oid); + reiser4_inode_data(inode)->oid_hi = (oid) >> OID_HI_SHIFT; + assert("nikita-2521", get_inode_oid(inode) == (oid)); +} + +/* REISER4_INO_IS_OID */ +#endif + +static inline oid_t +get_inode_locality(const struct inode *inode) +{ + return reiser4_inode_data(inode)->locality_id; +} + +#if REISER4_LARGE_KEY +static inline __u64 get_inode_ordering(const struct inode *inode) +{ + return reiser4_inode_data(inode)->ordering; +} + +static inline void set_inode_ordering(const struct inode *inode, __u64 ordering) +{ + reiser4_inode_data(inode)->ordering = ordering; +} + +#else + +#define get_inode_ordering(inode) (0) +#define set_inode_ordering(inode, val) noop + +#endif + +/* return inode in which @uf_info is embedded */ +static inline struct inode * +unix_file_info_to_inode(const unix_file_info_t *uf_info) +{ + return &container_of(uf_info, reiser4_inode_object, + p.file_plugin_data.unix_file_info)->vfs_inode; +} + +/* ordering predicate for inode spin lock: only jnode lock can be held */ +#define spin_ordering_pred_inode_object(inode) \ + ( lock_counters() -> rw_locked_dk == 0 ) && \ + ( lock_counters() -> rw_locked_tree == 0 ) && \ + ( lock_counters() -> spin_locked_txnh == 0 ) && \ + ( lock_counters() -> rw_locked_zlock == 0 ) && \ + ( lock_counters() -> spin_locked_jnode == 0 ) && \ + ( lock_counters() -> spin_locked_atom == 0 ) && \ + ( lock_counters() -> spin_locked_ktxnmgrd == 0 ) && \ + ( lock_counters() -> spin_locked_txnmgr == 0 ) + +SPIN_LOCK_FUNCTIONS(inode_object, reiser4_inode, guard); + +extern ino_t oid_to_ino(oid_t oid) __attribute__ ((const)); +extern ino_t oid_to_uino(oid_t oid) __attribute__ ((const)); + +extern reiser4_tree *tree_by_inode(const struct inode *inode); + +#if REISER4_DEBUG +extern void inode_invariant(const struct inode *inode); +extern int inode_has_no_jnodes(reiser4_inode *); +#else +#define inode_invariant(inode) noop +#endif + +#define spin_lock_inode(inode) \ +({ \ + LOCK_INODE(reiser4_inode_data(inode)); \ + inode_invariant(inode); \ +}) + +#define spin_unlock_inode(inode) \ +({ \ + inode_invariant(inode); \ + UNLOCK_INODE(reiser4_inode_data(inode)); \ +}) + +extern znode *inode_get_vroot(struct inode *inode); +extern void inode_set_vroot(struct inode *inode, znode *vroot); + +extern int reiser4_max_filename_len(const struct inode *inode); +extern int max_hash_collisions(const struct inode *dir); +extern void reiser4_unlock_inode(struct inode *inode); +extern int is_reiser4_inode(const struct inode *inode); +extern int setup_inode_ops(struct inode *inode, reiser4_object_create_data *); +extern struct inode *reiser4_iget(struct super_block *super, const reiser4_key * key, int silent); +extern void reiser4_iget_complete (struct inode * inode); +extern void inode_set_flag(struct inode *inode, reiser4_file_plugin_flags f); +extern void inode_clr_flag(struct inode *inode, reiser4_file_plugin_flags f); +extern int inode_get_flag(const struct inode *inode, reiser4_file_plugin_flags f); + +/* has inode been initialized? */ +static inline int +is_inode_loaded(const struct inode *inode /* inode queried */ ) +{ + assert("nikita-1120", inode != NULL); + return inode_get_flag(inode, REISER4_LOADED); +} + +extern file_plugin *inode_file_plugin(const struct inode *inode); +extern dir_plugin *inode_dir_plugin(const struct inode *inode); +extern perm_plugin *inode_perm_plugin(const struct inode *inode); +extern formatting_plugin *inode_formatting_plugin(const struct inode *inode); +extern hash_plugin *inode_hash_plugin(const struct inode *inode); +extern fibration_plugin *inode_fibration_plugin(const struct inode *inode); +extern crypto_plugin *inode_crypto_plugin(const struct inode *inode); +extern digest_plugin *inode_digest_plugin(const struct inode *inode); +extern compression_plugin *inode_compression_plugin(const struct inode *inode); +extern item_plugin *inode_sd_plugin(const struct inode *inode); +extern item_plugin *inode_dir_item_plugin(const struct inode *inode); + +extern void inode_set_plugin(struct inode *inode, + reiser4_plugin * plug, pset_member memb); +extern void reiser4_make_bad_inode(struct inode *inode); + +extern void inode_set_extension(struct inode *inode, sd_ext_bits ext); +extern void inode_check_scale(struct inode *inode, __u64 old, __u64 new); + +/* + * update field @field in inode @i to contain value @value. + */ +#define INODE_SET_FIELD(i, field, value) \ +({ \ + struct inode *__i; \ + typeof(value) __v; \ + \ + __i = (i); \ + __v = (value); \ + inode_check_scale(__i, __i->field, __v); \ + __i->field = __v; \ +}) + +#define INODE_INC_FIELD(i, field) \ +({ \ + struct inode *__i; \ + \ + __i = (i); \ + inode_check_scale(__i, __i->field, __i->field + 1); \ + ++ __i->field; \ +}) + +#define INODE_DEC_FIELD(i, field) \ +({ \ + struct inode *__i; \ + \ + __i = (i); \ + inode_check_scale(__i, __i->field, __i->field - 1); \ + -- __i->field; \ +}) + +/* See comment before readdir_common() for description. */ +static inline readdir_list_head * +get_readdir_list(const struct inode *inode) +{ + return &reiser4_inode_data(inode)->lists.readdir_list; +} + +extern void init_inode_ordering(struct inode *inode, + reiser4_object_create_data *crd, int create); + +static inline struct radix_tree_root * +jnode_tree_by_inode(struct inode *inode) +{ + return &reiser4_inode_data(inode)->jnodes_tree; +} + +static inline struct radix_tree_root * +jnode_tree_by_reiser4_inode(reiser4_inode *r4_inode) +{ + return &r4_inode->jnodes_tree; +} + +#if REISER4_DEBUG +extern void print_inode(const char *prefix, const struct inode *i); +#endif + +/* __REISER4_INODE_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/inode_ops.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/inode_ops.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,612 @@ +/* Copyright 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Interface to VFS. Reiser4 inode_operations are defined here. */ + +#include "forward.h" +#include "debug.h" +#include "dformat.h" +#include "coord.h" +#include "plugin/item/item.h" +#include "plugin/file/file.h" +#include "plugin/security/perm.h" +#include "plugin/disk_format/disk_format.h" +#include "plugin/plugin.h" +#include "plugin/plugin_set.h" +#include "plugin/object.h" +#include "txnmgr.h" +#include "jnode.h" +#include "znode.h" +#include "block_alloc.h" +#include "tree.h" +#include "vfs_ops.h" +#include "inode.h" +#include "page_cache.h" +#include "ktxnmgrd.h" +#include "super.h" +#include "reiser4.h" +#include "entd.h" +#include "emergency_flush.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* inode operations */ + +static int reiser4_create(struct inode *, struct dentry *, int, + struct nameidata *); +static struct dentry *reiser4_lookup(struct inode *, struct dentry *, + struct nameidata *); +static int reiser4_link(struct dentry *, struct inode *, struct dentry *); +static int reiser4_unlink(struct inode *, struct dentry *); +static int reiser4_rmdir(struct inode *, struct dentry *); +static int reiser4_symlink(struct inode *, struct dentry *, const char *); +static int reiser4_mkdir(struct inode *, struct dentry *, int); +static int reiser4_mknod(struct inode *, struct dentry *, int, dev_t); +static int reiser4_rename(struct inode *, struct dentry *, struct inode *, struct dentry *); +static int reiser4_readlink(struct dentry *, char *, int); +static int reiser4_follow_link(struct dentry *, struct nameidata *); +static void reiser4_truncate(struct inode *); +static int reiser4_permission(struct inode *, int, struct nameidata *); +static int reiser4_setattr(struct dentry *, struct iattr *); +static int reiser4_getattr(struct vfsmount *mnt, struct dentry *, struct kstat *); + +#if 0 +static int reiser4_setxattr(struct dentry *, const char *, void *, size_t, int); +static ssize_t reiser4_getxattr(struct dentry *, const char *, void *, size_t); +static ssize_t reiser4_listxattr(struct dentry *, char *, size_t); +static int reiser4_removexattr(struct dentry *, const char *); +#endif + +reiser4_internal int invoke_create_method(struct inode *parent, + struct dentry *dentry, + reiser4_object_create_data * data); + +/* ->create() VFS method in reiser4 inode_operations */ +static int +reiser4_create(struct inode *parent /* inode of parent + * directory */, + struct dentry *dentry /* dentry of new object to + * create */, + int mode /* new object mode */, + struct nameidata *nameidata) +{ + reiser4_object_create_data data; + + memset(&data, 0, sizeof data); + data.mode = S_IFREG | mode; + data.id = UNIX_FILE_PLUGIN_ID; + return invoke_create_method(parent, dentry, &data); +} + +/* ->mkdir() VFS method in reiser4 inode_operations */ +static int +reiser4_mkdir(struct inode *parent /* inode of parent + * directory */ , + struct dentry *dentry /* dentry of new object to + * create */ , + int mode /* new object's mode */ ) +{ + reiser4_object_create_data data; + + data.mode = S_IFDIR | mode; + data.id = DIRECTORY_FILE_PLUGIN_ID; + return invoke_create_method(parent, dentry, &data); +} + +/* ->symlink() VFS method in reiser4 inode_operations */ +static int +reiser4_symlink(struct inode *parent /* inode of parent + * directory */ , + struct dentry *dentry /* dentry of new object to + * create */ , + const char *linkname /* pathname to put into + * symlink */ ) +{ + reiser4_object_create_data data; + + data.name = linkname; + data.id = SYMLINK_FILE_PLUGIN_ID; + data.mode = S_IFLNK | S_IRWXUGO; + return invoke_create_method(parent, dentry, &data); +} + +/* ->mknod() VFS method in reiser4 inode_operations */ +static int +reiser4_mknod(struct inode *parent /* inode of parent directory */ , + struct dentry *dentry /* dentry of new object to + * create */ , + int mode /* new object's mode */ , + dev_t rdev /* minor and major of new device node */ ) +{ + reiser4_object_create_data data; + + data.mode = mode; + data.rdev = rdev; + data.id = SPECIAL_FILE_PLUGIN_ID; + return invoke_create_method(parent, dentry, &data); +} + +/* ->rename() inode operation */ +static int +reiser4_rename(struct inode *old_dir, struct dentry *old, struct inode *new_dir, struct dentry *new) +{ + int result; + reiser4_context ctx; + + assert("nikita-2314", old_dir != NULL); + assert("nikita-2315", old != NULL); + assert("nikita-2316", new_dir != NULL); + assert("nikita-2317", new != NULL); + + init_context(&ctx, old_dir->i_sb); + + result = perm_chk(old_dir, rename, old_dir, old, new_dir, new); + if (result == 0) { + dir_plugin *dplug; + + dplug = inode_dir_plugin(old_dir); + if (dplug == NULL) + result = RETERR(-ENOTDIR); + else if (dplug->rename == NULL) + result = RETERR(-EPERM); + else + result = dplug->rename(old_dir, old, new_dir, new); + } + context_set_commit_async(&ctx); + reiser4_exit_context(&ctx); + return result; +} + +/* reiser4_lookup() - entry point for ->lookup() method. + + This is a wrapper for lookup_object which is a wrapper for the directory + plugin that does the lookup. + + This is installed in ->lookup() in reiser4_inode_operations. +*/ +static struct dentry * +reiser4_lookup(struct inode *parent, /* directory within which we are to + * look for the name specified in + * dentry */ + struct dentry *dentry, /* this contains the name that is to + be looked for on entry, and on exit + contains a filled in dentry with a + pointer to the inode (unless name + not found) */ + struct nameidata *nameidata) +{ + dir_plugin *dplug; + int retval; + struct dentry *result; + reiser4_context ctx; + int (*lookup) (struct inode * parent_inode, struct dentry **dentry); + + assert("nikita-403", parent != NULL); + assert("nikita-404", dentry != NULL); + + init_context(&ctx, parent->i_sb); + + /* find @parent directory plugin and make sure that it has lookup + method */ + dplug = inode_dir_plugin(parent); + if (dplug != NULL && dplug->lookup != NULL) + /* if parent directory has directory plugin with ->lookup + * method, use the latter to do lookup */ + lookup = dplug->lookup; +#if ENABLE_REISER4_PSEUDO + else if (!reiser4_is_set(parent->i_sb, REISER4_NO_PSEUDO)) + /* even if there is no ->lookup method, pseudo file lookup + * should still be performed, but only unless we are in + * "no-pseudo" mode */ + lookup = lookup_pseudo_file; +#endif /* ENABLE_REISER4_PSEUDO */ + else + lookup = NULL; + if (lookup != NULL) { + struct dentry *name; + + name = dentry; + /* call its lookup method */ + retval = lookup(parent, &name); + if (retval == 0) { + if (name == NULL) { + /* + * new object was looked up. Initialize it. + */ + struct inode *obj; + file_plugin *fplug; + + obj = dentry->d_inode; + assert("nikita-2645", obj != NULL); + fplug = inode_file_plugin(obj); + retval = fplug->bind(obj, parent); + } + } else if (retval == -ENOENT) { + /* object not found */ + if (!IS_DEADDIR(parent)) + d_add(dentry, NULL); + retval = 0; + name = NULL; + } + + if (retval == 0) + /* success */ + result = name; + else + result = ERR_PTR(retval); + } else + result = ERR_PTR(-ENOTDIR); + + /* prevent balance_dirty_pages() from being called: we don't want to + * do this under directory i_sem. */ + context_set_commit_async(&ctx); + reiser4_exit_context(&ctx); + return result; +} + +/* ->readlink() inode method, returns content of symbolic link */ +static int +reiser4_readlink(struct dentry *dentry, char *buf, int buflen) +{ + assert("vs-852", S_ISLNK(dentry->d_inode->i_mode)); + if (!dentry->d_inode->u.generic_ip || !inode_get_flag(dentry->d_inode, REISER4_GENERIC_PTR_USED)) + return RETERR(-EINVAL); + return vfs_readlink(dentry, buf, buflen, dentry->d_inode->u.generic_ip); +} + +/* ->follow_link() inode method. Follows a symbolic link */ +static int +reiser4_follow_link(struct dentry *dentry, struct nameidata *data) +{ + assert("vs-851", S_ISLNK(dentry->d_inode->i_mode)); + + if (!dentry->d_inode->u.generic_ip || !inode_get_flag(dentry->d_inode, REISER4_GENERIC_PTR_USED)) + return RETERR(-EINVAL); + return vfs_follow_link(data, dentry->d_inode->u.generic_ip); +} + +/* ->setattr() inode operation + + Called from notify_change. */ +static int +reiser4_setattr(struct dentry *dentry, struct iattr *attr) +{ + struct inode *inode; + int result; + reiser4_context ctx; + + assert("nikita-2269", attr != NULL); + + inode = dentry->d_inode; + assert("vs-1108", inode != NULL); + init_context(&ctx, inode->i_sb); + result = perm_chk(inode, setattr, dentry, attr); + if (result == 0) { + if (!inode_get_flag(inode, REISER4_IMMUTABLE)) { + file_plugin *fplug; + + fplug = inode_file_plugin(inode); + assert("nikita-2271", fplug != NULL); + assert("nikita-2296", fplug->setattr != NULL); + result = fplug->setattr(inode, attr); + } else + result = RETERR(-E_REPEAT); + } + context_set_commit_async(&ctx); + reiser4_exit_context(&ctx); + return result; +} + +/* ->getattr() inode operation called (indirectly) by sys_stat(). */ +static int +reiser4_getattr(struct vfsmount *mnt UNUSED_ARG, struct dentry *dentry, struct kstat *stat) +{ + struct inode *inode; + int result; + reiser4_context ctx; + + inode = dentry->d_inode; + init_context(&ctx, inode->i_sb); + result = perm_chk(inode, getattr, mnt, dentry, stat); + if (result == 0) { + file_plugin *fplug; + + fplug = inode_file_plugin(inode); + assert("nikita-2295", fplug != NULL); + assert("nikita-2297", fplug->getattr != NULL); + result = fplug->getattr(mnt, dentry, stat); + } + reiser4_exit_context(&ctx); + return result; +} + +/* helper function: call object plugin to truncate file to @size */ +static int +truncate_object(struct inode *inode /* object to truncate */ , + loff_t size /* size to truncate object to */ ) +{ + file_plugin *fplug; + int result; + + assert("nikita-1026", inode != NULL); + assert("nikita-1027", is_reiser4_inode(inode)); + assert("nikita-1028", inode->i_sb != NULL); + + fplug = inode_file_plugin(inode); + assert("vs-142", fplug != NULL); + + assert("nikita-2933", fplug->truncate != NULL); + result = fplug->truncate(inode, size); + if (result != 0) + warning("nikita-1602", "Truncate error: %i for %lli", result, + (unsigned long long)get_inode_oid(inode)); + + return result; +} + +/* ->truncate() VFS method in reiser4 inode_operations */ +static void +reiser4_truncate(struct inode *inode /* inode to truncate */ ) +{ + reiser4_context ctx; + + assert("umka-075", inode != NULL); + + init_context(&ctx, inode->i_sb); + + truncate_object(inode, inode->i_size); + + /* for mysterious reasons ->truncate() VFS call doesn't return + value */ + reiser4_exit_context(&ctx); +} + +/* ->permission() method in reiser4_inode_operations. */ +static int +reiser4_permission(struct inode *inode /* object */ , + int mask, /* mode bits to check permissions + * for */ + struct nameidata *nameidata) +{ + /* reiser4_context creation/destruction removed from here, + because permission checks currently don't require this. + + Permission plugin have to create context itself if necessary. */ + assert("nikita-1687", inode != NULL); + + return perm_chk(inode, mask, inode, mask); +} + +/* common part of both unlink and rmdir. */ +static int +unlink_file(struct inode *parent /* parent directory */ , + struct dentry *victim /* name of object being + * unlinked */ ) +{ + int result; + dir_plugin *dplug; + reiser4_context ctx; + + init_context(&ctx, parent->i_sb); + + assert("nikita-1435", parent != NULL); + assert("nikita-1436", victim != NULL); + + dplug = inode_dir_plugin(parent); + assert("nikita-1429", dplug != NULL); + if (dplug->unlink != NULL) + result = dplug->unlink(parent, victim); + else + result = RETERR(-EPERM); + /* @victim can be already removed from the disk by this time. Inode is + then marked so that iput() wouldn't try to remove stat data. But + inode itself is still there. + */ + /* we cannot release directory semaphore here, because name has + * already been deleted, but dentry (@victim) still exists. */ + /* prevent balance_dirty_pages() from being called: we don't want to + * do this under directory i_sem. */ + + context_set_commit_async(&ctx); + reiser4_exit_context(&ctx); + return result; +} + +/* ->unlink() VFS method in reiser4 inode_operations + + remove link from @parent directory to @victim object: delegate work + to object plugin +*/ +/* Audited by: umka (2002.06.12) */ +static int +reiser4_unlink(struct inode *parent /* parent directory */ , + struct dentry *victim /* name of object being + * unlinked */ ) +{ + assert("nikita-2011", parent != NULL); + assert("nikita-2012", victim != NULL); + assert("nikita-2013", victim->d_inode != NULL); + if (inode_dir_plugin(victim->d_inode) == NULL) + return unlink_file(parent, victim); + else + return RETERR(-EISDIR); +} + +/* ->rmdir() VFS method in reiser4 inode_operations + + The same as unlink, but only for directories. + +*/ +/* Audited by: umka (2002.06.12) */ +static int +reiser4_rmdir(struct inode *parent /* parent directory */ , + struct dentry *victim /* name of directory being + * unlinked */ ) +{ + assert("nikita-2014", parent != NULL); + assert("nikita-2015", victim != NULL); + assert("nikita-2016", victim->d_inode != NULL); + + if (inode_dir_plugin(victim->d_inode) != NULL) + /* there is no difference between unlink and rmdir for + reiser4 */ + return unlink_file(parent, victim); + else + return RETERR(-ENOTDIR); +} + +/* ->link() VFS method in reiser4 inode_operations + + entry point for ->link() method. + + This is installed as ->link inode operation for reiser4 + inodes. Delegates all work to object plugin +*/ +/* Audited by: umka (2002.06.12) */ +static int +reiser4_link(struct dentry *existing /* dentry of existing + * object */ , + struct inode *parent /* parent directory */ , + struct dentry *where /* new name for @existing */ ) +{ + int result; + dir_plugin *dplug; + reiser4_context ctx; + + assert("umka-080", existing != NULL); + assert("nikita-1031", parent != NULL); + + init_context(&ctx, parent->i_sb); + context_set_commit_async(&ctx); + + dplug = inode_dir_plugin(parent); + assert("nikita-1430", dplug != NULL); + if (dplug->link != NULL) { + result = dplug->link(parent, existing, where); + if (result == 0) + d_instantiate(where, existing->d_inode); + } else { + result = RETERR(-EPERM); + } + up(&existing->d_inode->i_sem); + up(&parent->i_sem); + reiser4_exit_context(&ctx); + down(&parent->i_sem); + down(&existing->d_inode->i_sem); + return result; +} + +/* call ->create() directory plugin method. */ +reiser4_internal int +invoke_create_method(struct inode *parent /* parent directory */ , + struct dentry *dentry /* dentry of new + * object */ , + reiser4_object_create_data * data /* information + * necessary + * to create + * new + * object */ ) +{ + int result; + dir_plugin *dplug; + reiser4_context ctx; + + init_context(&ctx, parent->i_sb); + context_set_commit_async(&ctx); + + assert("nikita-426", parent != NULL); + assert("nikita-427", dentry != NULL); + assert("nikita-428", data != NULL); + + dplug = inode_dir_plugin(parent); + if (dplug == NULL) + result = RETERR(-ENOTDIR); + else if (dplug->create_child != NULL) { + struct inode *child; + + child = NULL; + + data->parent = parent; + data->dentry = dentry; + + result = dplug->create_child(data, &child); + if (unlikely(result != 0)) { + if (child != NULL) { + /* + * what we actually want to check in the + * assertion below is that @child only + * contains items that iput()->... is going to + * remove (usually stat-data). Obvious check + * for child->i_size == 0 doesn't work for + * symlinks. + */ + assert("nikita-3140", S_ISLNK(child->i_mode) || + child->i_size == 0); + reiser4_make_bad_inode(child); + iput(child); + } + } else { + d_instantiate(dentry, child); + } + } else + result = RETERR(-EPERM); + + reiser4_exit_context(&ctx); + return result; +} + +struct inode_operations reiser4_inode_operations = { + .create = reiser4_create, /* d */ + .lookup = reiser4_lookup, /* d */ + .link = reiser4_link, /* d */ + .unlink = reiser4_unlink, /* d */ + .symlink = reiser4_symlink, /* d */ + .mkdir = reiser4_mkdir, /* d */ + .rmdir = reiser4_rmdir, /* d */ + .mknod = reiser4_mknod, /* d */ + .rename = reiser4_rename, /* d */ + .readlink = NULL, + .follow_link = NULL, + .truncate = reiser4_truncate, /* d */ + .permission = reiser4_permission, /* d */ + .setattr = reiser4_setattr, /* d */ + .getattr = reiser4_getattr, /* d */ +}; + +struct inode_operations reiser4_symlink_inode_operations = { + .setattr = reiser4_setattr, /* d */ + .getattr = reiser4_getattr, /* d */ + .readlink = reiser4_readlink, + .follow_link = reiser4_follow_link +}; + +struct inode_operations reiser4_special_inode_operations = { + .setattr = reiser4_setattr, /* d */ + .getattr = reiser4_getattr /* d */ +}; + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/ioctl.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/ioctl.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,41 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +#if !defined( __REISER4_IOCTL_H__ ) +#define __REISER4_IOCTL_H__ + +#include + +/* + * ioctl(2) command used to "unpack" reiser4 file, that is, convert it into + * extents and fix in this state. This is used by applications that rely on + * + * . files being block aligned, and + * + * . files never migrating on disk + * + * for example, boot loaders (LILO) need this. + * + * This ioctl should be used as + * + * result = ioctl(fd, REISER4_IOC_UNPACK); + * + * File behind fd descriptor will be converted to the extents (if necessary), + * and its stat-data will be updated so that it will never be converted back + * into tails again. + */ +#define REISER4_IOC_UNPACK _IOW(0xCD,1,long) + +/* __REISER4_IOCTL_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/jnode.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/jnode.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,2035 @@ +/* Copyright 2001, 2002, 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ +/* Jnode manipulation functions. */ +/* Jnode is entity used to track blocks with data and meta-data in reiser4. + + In particular, jnodes are used to track transactional information + associated with each block. Each znode contains jnode as ->zjnode field. + + Jnode stands for either Josh or Journal node. +*/ + +/* + * Taxonomy. + * + * Jnode represents block containing data or meta-data. There are jnodes + * for: + * + * unformatted blocks (jnodes proper). There are plans, however to + * have a handle per extent unit rather than per each unformatted + * block, because there are so many of them. + * + * For bitmaps. Each bitmap is actually represented by two jnodes--one + * for working and another for "commit" data, together forming bnode. + * + * For io-heads. These are used by log writer. + * + * For formatted nodes (znode). See comment at the top of znode.c for + * details specific to the formatted nodes (znodes). + * + * Node data. + * + * Jnode provides access to the data of node it represents. Data are + * stored in a page. Page is kept in a page cache. This means, that jnodes + * are highly interconnected with page cache and VM internals. + * + * jnode has a pointer to page (->pg) containing its data. Pointer to data + * themselves is cached in ->data field to avoid frequent calls to + * page_address(). + * + * jnode and page are attached to each other by jnode_attach_page(). This + * function places pointer to jnode in page->private, sets PG_private flag + * and increments page counter. + * + * Opposite operation is performed by page_clear_jnode(). + * + * jnode->pg is protected by jnode spin lock, and page->private is + * protected by page lock. See comment at the top of page_cache.c for + * more. + * + * page can be detached from jnode for two reasons: + * + * . jnode is removed from a tree (file is truncated, of formatted + * node is removed by balancing). + * + * . during memory pressure, VM calls ->releasepage() method + * (reiser4_releasepage()) to evict page from memory. + * + * (there, of course, is also umount, but this is special case we are not + * concerned with here). + * + * To protect jnode page from eviction, one calls jload() function that + * "pins" page in memory (loading it if necessary), increments + * jnode->d_count, and kmap()s page. Page is unpinned through call to + * jrelse(). + * + * Jnode life cycle. + * + * jnode is created, placed in hash table, and, optionally, in per-inode + * radix tree. Page can be attached to jnode, pinned, released, etc. + * + * When jnode is captured into atom its reference counter is + * increased. While being part of an atom, jnode can be "early + * flushed". This means that as part of flush procedure, jnode is placed + * into "relocate set", and its page is submitted to the disk. After io + * completes, page can be detached, then loaded again, re-dirtied, etc. + * + * Thread acquired reference to jnode by calling jref() and releases it by + * jput(). When last reference is removed, jnode is still retained in + * memory (cached) if it has page attached, _unless_ it is scheduled for + * destruction (has JNODE_HEARD_BANSHEE bit set). + * + * Tree read-write lock was used as "existential" lock for jnodes. That is, + * jnode->x_count could be changed from 0 to 1 only under tree write lock, + * that is, tree lock protected unreferenced jnodes stored in the hash + * table, from recycling. + * + * This resulted in high contention on tree lock, because jref()/jput() is + * frequent operation. To ameliorate this problem, RCU is used: when jput() + * is just about to release last reference on jnode it sets JNODE_RIP bit + * on it, and then proceed with jnode destruction (removing jnode from hash + * table, cbk_cache, detaching page, etc.). All places that change jnode + * reference counter from 0 to 1 (jlookup(), zlook(), zget(), and + * cbk_cache_scan_slots()) check for JNODE_RIP bit (this is done by + * jnode_rip_check() function), and pretend that nothing was found in hash + * table if bit is set. + * + * jput defers actual return of jnode into slab cache to some later time + * (by call_rcu()), this guarantees that other threads can safely continue + * working with JNODE_RIP-ped jnode. + * + */ + +#include "reiser4.h" +#include "debug.h" +#include "dformat.h" +#include "plugin/plugin_header.h" +#include "plugin/plugin.h" +#include "txnmgr.h" +#include "jnode.h" +#include "znode.h" +#include "tree.h" +#include "tree_walk.h" +#include "super.h" +#include "inode.h" +#include "page_cache.h" + +#include /* UML needs this for PAGE_OFFSET */ +#include +#include +#include +#include /* for vmalloc(), vfree() */ +#include +#include /* for struct address_space */ +#include /* for inode_lock */ + +static kmem_cache_t *_jnode_slab = NULL; + +static void jnode_set_type(jnode * node, jnode_type type); +static int jdelete(jnode * node); +static int jnode_try_drop(jnode * node); + +#if REISER4_DEBUG +static int jnode_invariant(const jnode * node, int tlocked, int jlocked); +#endif + +/* true if valid page is attached to jnode */ +static inline int jnode_is_parsed (jnode * node) +{ + return JF_ISSET(node, JNODE_PARSED); +} + +/* hash table support */ + +/* compare two jnode keys for equality. Used by hash-table macros */ +static inline int +jnode_key_eq(const jnode_key_t * k1, const jnode_key_t * k2) +{ + assert("nikita-2350", k1 != NULL); + assert("nikita-2351", k2 != NULL); + + return (k1->index == k2->index && k1->objectid == k2->objectid); +} + +/* Hash jnode by its key (inode plus offset). Used by hash-table macros */ +static inline __u32 +jnode_key_hashfn(j_hash_table *table, const jnode_key_t * key) +{ + assert("nikita-2352", key != NULL); + assert("nikita-3346", IS_POW(table->_buckets)); + + /* yes, this is remarkable simply (where not stupid) hash function. */ + return (key->objectid + key->index) & (table->_buckets - 1); +} + +/* The hash table definition */ +#define KMALLOC(size) vmalloc(size) +#define KFREE(ptr, size) vfree(ptr) +TYPE_SAFE_HASH_DEFINE(j, jnode, jnode_key_t, key.j, link.j, jnode_key_hashfn, jnode_key_eq); +#undef KFREE +#undef KMALLOC + +/* call this to initialise jnode hash table */ +reiser4_internal int +jnodes_tree_init(reiser4_tree * tree /* tree to initialise jnodes for */ ) +{ + assert("nikita-2359", tree != NULL); + return j_hash_init(&tree->jhash_table, 16384); +} + +/* call this to destroy jnode hash table. This is called during umount. */ +reiser4_internal int +jnodes_tree_done(reiser4_tree * tree /* tree to destroy jnodes for */ ) +{ + j_hash_table *jtable; + jnode *node; + jnode *next; + + assert("nikita-2360", tree != NULL); + + /* + * Scan hash table and free all jnodes. + */ + jtable = &tree->jhash_table; + for_all_in_htable(jtable, j, node, next) { + assert("nikita-2361", !atomic_read(&node->x_count)); + jdrop(node); + } + + j_hash_done(&tree->jhash_table); + return 0; +} + +/* Initialize static variables in this file. */ +reiser4_internal int +jnode_init_static(void) +{ + assert("umka-168", _jnode_slab == NULL); + + _jnode_slab = kmem_cache_create("jnode", sizeof (jnode), 0, + SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT, + NULL, NULL); + + if (_jnode_slab == NULL) + goto error; + + return 0; + +error: + + if (_jnode_slab != NULL) + kmem_cache_destroy(_jnode_slab); + + return RETERR(-ENOMEM); +} + +/* Dual to jnode_init_static */ +reiser4_internal int +jnode_done_static(void) +{ + int ret = 0; + + if (_jnode_slab != NULL) { + ret = kmem_cache_destroy(_jnode_slab); + _jnode_slab = NULL; + } + + return ret; +} + +/* Initialize a jnode. */ +reiser4_internal void +jnode_init(jnode * node, reiser4_tree * tree, jnode_type type) +{ + assert("umka-175", node != NULL); + + memset(node, 0, sizeof (jnode)); + ON_DEBUG(node->magic = JMAGIC); + jnode_set_type(node, type); + atomic_set(&node->d_count, 0); + atomic_set(&node->x_count, 0); + spin_jnode_init(node); + spin_jload_init(node); + node->atom = NULL; + node->tree = tree; + capture_list_clean(node); + + ASSIGN_NODE_LIST(node, NOT_CAPTURED); + + INIT_RCU_HEAD(&node->rcu); + +#if REISER4_DEBUG + { + reiser4_super_info_data *sbinfo; + + sbinfo = get_super_private(tree->super); + spin_lock_irq(&sbinfo->all_guard); + list_add(&node->jnodes, &sbinfo->all_jnodes); + spin_unlock_irq(&sbinfo->all_guard); + /* link with which jnode is attached to reiser4_inode */ + inode_jnodes_list_clean(node); + } +#endif +} + +#if REISER4_DEBUG +/* + * Remove jnode from ->all_jnodes list. + */ +static void +jnode_done(jnode * node, reiser4_tree * tree) +{ + reiser4_super_info_data *sbinfo; + + sbinfo = get_super_private(tree->super); + + spin_lock_irq(&sbinfo->all_guard); + assert("nikita-2422", !list_empty(&node->jnodes)); + list_del_init(&node->jnodes); + spin_unlock_irq(&sbinfo->all_guard); +} +#endif + +/* return already existing jnode of page */ +reiser4_internal jnode * +jnode_by_page(struct page *pg) +{ + assert("nikita-2066", pg != NULL); + assert("nikita-2400", PageLocked(pg)); + assert("nikita-2068", PagePrivate(pg)); + assert("nikita-2067", jprivate(pg) != NULL); + return jprivate(pg); +} + +/* exported functions to allocate/free jnode objects outside this file */ +reiser4_internal jnode * +jalloc(void) +{ + jnode *jal = kmem_cache_alloc(_jnode_slab, GFP_KERNEL); + return jal; +} + +/* return jnode back to the slab allocator */ +reiser4_internal inline void +jfree(jnode * node) +{ + assert("zam-449", node != NULL); + + assert("nikita-2663", capture_list_is_clean(node) && NODE_LIST(node) == NOT_CAPTURED); + assert("nikita-2774", !JF_ISSET(node, JNODE_EFLUSH)); + assert("nikita-3222", list_empty(&node->jnodes)); + assert("nikita-3221", jnode_page(node) == NULL); + + /* not yet phash_jnode_destroy(node); */ + + /* poison memory. */ + ON_DEBUG(memset(node, 0xad, sizeof *node)); + kmem_cache_free(_jnode_slab, node); +} + +/* + * This function is supplied as RCU callback. It actually frees jnode when + * last reference to it is gone. + */ +static void +jnode_free_actor(struct rcu_head *head) +{ + jnode * node; + jnode_type jtype; + + node = container_of(head, jnode, rcu); + jtype = jnode_get_type(node); + + ON_DEBUG(jnode_done(node, jnode_get_tree(node))); + + switch (jtype) { + case JNODE_IO_HEAD: + case JNODE_BITMAP: + case JNODE_UNFORMATTED_BLOCK: + jfree(node); + break; + case JNODE_FORMATTED_BLOCK: + zfree(JZNODE(node)); + break; + case JNODE_INODE: + default: + wrong_return_value("nikita-3197", "Wrong jnode type"); + } +} + +/* + * Free a jnode. Post a callback to be executed later through RCU when all + * references to @node are released. + */ +static inline void +jnode_free(jnode * node, jnode_type jtype) +{ + if (jtype != JNODE_INODE) { + /*assert("nikita-3219", list_empty(&node->rcu.list));*/ + call_rcu(&node->rcu, jnode_free_actor); + } else + jnode_list_remove(node); +} + +/* allocate new unformatted jnode */ +static jnode * +jnew_unformatted(void) +{ + jnode *jal; + + jal = jalloc(); + if (jal == NULL) + return NULL; + + jnode_init(jal, current_tree, JNODE_UNFORMATTED_BLOCK); + jal->key.j.mapping = 0; + jal->key.j.index = (unsigned long)-1; + jal->key.j.objectid = 0; + return jal; +} + +/* look for jnode with given mapping and offset within hash table */ +reiser4_internal jnode * +jlookup(reiser4_tree * tree, oid_t objectid, unsigned long index) +{ + jnode_key_t jkey; + jnode *node; + + assert("nikita-2353", tree != NULL); + + jkey.objectid = objectid; + jkey.index = index; + + /* + * hash table is _not_ protected by any lock during lookups. All we + * have to do is to disable preemption to keep RCU happy. + */ + + rcu_read_lock(); + node = j_hash_find(&tree->jhash_table, &jkey); + if (node != NULL) { + /* protect @node from recycling */ + jref(node); + assert("nikita-2955", jnode_invariant(node, 0, 0)); + node = jnode_rip_check(tree, node); + } + rcu_read_unlock(); + return node; +} + +/* per inode radix tree of jnodes is protected by tree's read write spin lock */ +static jnode * +jfind_nolock(struct address_space *mapping, unsigned long index) +{ + assert("vs-1694", mapping->host != NULL); + + return radix_tree_lookup(jnode_tree_by_inode(mapping->host), index); +} + +reiser4_internal jnode * +jfind(struct address_space *mapping, unsigned long index) +{ + reiser4_tree *tree; + jnode *node; + + assert("vs-1694", mapping->host != NULL); + tree = tree_by_inode(mapping->host); + + RLOCK_TREE(tree); + node = jfind_nolock(mapping, index); + if (node != NULL) + jref(node); + RUNLOCK_TREE(tree); + return node; +} + +static void inode_attach_jnode(jnode *node) +{ + struct inode * inode; + reiser4_inode * info; + struct radix_tree_root * rtree; + + assert ("zam-1043", node->key.j.mapping != NULL); + inode = node->key.j.mapping->host; + info = reiser4_inode_data(inode); + rtree = jnode_tree_by_reiser4_inode(info); + + spin_lock(&inode_lock); + assert("zam-1049", equi(rtree->rnode != NULL, info->nr_jnodes != 0)); + check_me("zam-1045", !radix_tree_insert(rtree, node->key.j.index, node)); + ON_DEBUG(info->nr_jnodes ++); + inode->i_state |= I_JNODES; + spin_unlock(&inode_lock); +} + +static void inode_detach_jnode(jnode *node) +{ + struct inode *inode; + reiser4_inode *info; + struct radix_tree_root *rtree; + + assert ("zam-1044", node->key.j.mapping != NULL); + inode = node->key.j.mapping->host; + info = reiser4_inode_data(inode); + rtree = jnode_tree_by_reiser4_inode(info); + + spin_lock(&inode_lock); + assert("zam-1051", info->nr_jnodes != 0); + assert("zam-1052", rtree->rnode != NULL); + assert("vs-1730", !JF_ISSET(node, JNODE_EFLUSH)); + ON_DEBUG(info->nr_jnodes --); + + /* delete jnode from inode's radix tree of jnodes */ + check_me("zam-1046", radix_tree_delete(rtree, node->key.j.index)); + if (rtree->rnode == NULL) { + inode->i_state &= ~I_JNODES; + } + spin_unlock(&inode_lock); +} + +/* put jnode into hash table (where they can be found by flush who does not know + mapping) and to inode's tree of jnodes (where they can be found (hopefully + faster) in places where mapping is known). Currently it is used by + fs/reiser4/plugin/item/extent_file_ops.c:index_extent_jnode when new jnode is + created */ +static void +hash_unformatted_jnode(jnode *node, struct address_space *mapping, unsigned long index) +{ + j_hash_table *jtable; + + assert("vs-1446", jnode_is_unformatted(node)); + assert("vs-1442", node->key.j.mapping == 0); + assert("vs-1443", node->key.j.objectid == 0); + assert("vs-1444", node->key.j.index == (unsigned long)-1); + assert("nikita-3439", rw_tree_is_write_locked(jnode_get_tree(node))); + + node->key.j.mapping = mapping; + node->key.j.objectid = get_inode_oid(mapping->host); + node->key.j.index = index; + + jtable = &jnode_get_tree(node)->jhash_table; + + /* race with some other thread inserting jnode into the hash table is + * impossible, because we keep the page lock. */ + /* + * following assertion no longer holds because of RCU: it is possible + * jnode is in the hash table, but with JNODE_RIP bit set. + */ + /* assert("nikita-3211", j_hash_find(jtable, &node->key.j) == NULL); */ + j_hash_insert_rcu(jtable, node); + inode_attach_jnode(node); +} + +static void +unhash_unformatted_node_nolock(jnode *node) +{ + assert("vs-1683", node->key.j.mapping != NULL); + assert("vs-1684", node->key.j.objectid == get_inode_oid(node->key.j.mapping->host)); + + /* remove jnode from hash-table */ + j_hash_remove_rcu(&node->tree->jhash_table, node); + inode_detach_jnode(node); + node->key.j.mapping = 0; + node->key.j.index = (unsigned long)-1; + node->key.j.objectid = 0; + +} + +/* remove jnode from hash table and from inode's tree of jnodes. This is used in + reiser4_invalidatepage and in kill_hook_extent -> truncate_inode_jnodes -> + uncapture_jnode */ +reiser4_internal void +unhash_unformatted_jnode(jnode *node) +{ + assert("vs-1445", jnode_is_unformatted(node)); + WLOCK_TREE(node->tree); + + unhash_unformatted_node_nolock(node); + + WUNLOCK_TREE(node->tree); +} + +/* + * search hash table for a jnode with given oid and index. If not found, + * allocate new jnode, insert it, and also insert into radix tree for the + * given inode/mapping. + */ +reiser4_internal jnode * +find_get_jnode(reiser4_tree * tree, struct address_space *mapping, oid_t oid, + unsigned long index) +{ + jnode *result; + jnode *shadow; + int preload; + + result = jnew_unformatted(); + + if (unlikely(result == NULL)) + return ERR_PTR(RETERR(-ENOMEM)); + + preload = radix_tree_preload(GFP_KERNEL); + if (preload != 0) + return ERR_PTR(preload); + + WLOCK_TREE(tree); + shadow = jfind_nolock(mapping, index); + if (likely(shadow == NULL)) { + /* add new jnode to hash table and inode's radix tree of jnodes */ + jref(result); + hash_unformatted_jnode(result, mapping, index); + } else { + /* jnode is found in inode's radix tree of jnodes */ + jref(shadow); + jnode_free(result, JNODE_UNFORMATTED_BLOCK); + assert("vs-1498", shadow->key.j.mapping == mapping); + result = shadow; + } + WUNLOCK_TREE(tree); + + assert("nikita-2955", ergo(result != NULL, jnode_invariant(result, 0, 0))); + radix_tree_preload_end(); + return result; +} + + +/* jget() (a la zget() but for unformatted nodes). Returns (and possibly + creates) jnode corresponding to page @pg. jnode is attached to page and + inserted into jnode hash-table. */ +static jnode * +do_jget(reiser4_tree * tree, struct page * pg) +{ + /* + * There are two ways to create jnode: starting with pre-existing page + * and without page. + * + * When page already exists, jnode is created + * (jnode_of_page()->do_jget()) under page lock. This is done in + * ->writepage(), or when capturing anonymous page dirtied through + * mmap. + * + * Jnode without page is created by index_extent_jnode(). + * + */ + + jnode *result; + oid_t oid = get_inode_oid(pg->mapping->host); + + assert("umka-176", pg != NULL); + assert("nikita-2394", PageLocked(pg)); + + result = jprivate(pg); + if (likely(result != NULL)) + return jref(result); + + tree = tree_by_page(pg); + + /* check hash-table first */ + result = jfind(pg->mapping, pg->index); + if (unlikely(result != NULL)) { + UNDER_SPIN_VOID(jnode, result, jnode_attach_page(result, pg)); + result->key.j.mapping = pg->mapping; + return result; + } + + result = find_get_jnode(tree, pg->mapping, oid, pg->index); + if (unlikely(IS_ERR(result))) + return result; + /* attach jnode to page */ + UNDER_SPIN_VOID(jnode, result, jnode_attach_page(result, pg)); + return result; +} + +/* + * return jnode for @pg, creating it if necessary. + */ +reiser4_internal jnode * +jnode_of_page(struct page * pg) +{ + jnode * result; + + assert("umka-176", pg != NULL); + assert("nikita-2394", PageLocked(pg)); + + result = do_jget(tree_by_page(pg), pg); + + if (REISER4_DEBUG && !IS_ERR(result)) { + assert("nikita-3210", result == jprivate(pg)); + assert("nikita-2046", jnode_page(jprivate(pg)) == pg); + if (jnode_is_unformatted(jprivate(pg))) { + assert("nikita-2364", jprivate(pg)->key.j.index == pg->index); + assert("nikita-2367", + jprivate(pg)->key.j.mapping == pg->mapping); + assert("nikita-2365", + jprivate(pg)->key.j.objectid == get_inode_oid(pg->mapping->host)); + assert("vs-1200", + jprivate(pg)->key.j.objectid == pg->mapping->host->i_ino); + assert("nikita-2356", jnode_is_unformatted(jnode_by_page(pg))); + } + assert("nikita-2956", jnode_invariant(jprivate(pg), 0, 0)); + } + return result; +} + +/* attach page to jnode: set ->pg pointer in jnode, and ->private one in the + * page.*/ +reiser4_internal void +jnode_attach_page(jnode * node, struct page *pg) +{ + assert("nikita-2060", node != NULL); + assert("nikita-2061", pg != NULL); + + assert("nikita-2050", pg->private == 0ul); + assert("nikita-2393", !PagePrivate(pg)); + assert("vs-1741", node->pg == NULL); + + assert("nikita-2396", PageLocked(pg)); + assert("nikita-2397", spin_jnode_is_locked(node)); + + page_cache_get(pg); + pg->private = (unsigned long) node; + node->pg = pg; + SetPagePrivate(pg); +} + +/* Dual to jnode_attach_page: break a binding between page and jnode */ +reiser4_internal void +page_clear_jnode(struct page *page, jnode * node) +{ + assert("nikita-2424", page != NULL); + assert("nikita-2425", PageLocked(page)); + assert("nikita-2426", node != NULL); + assert("nikita-2427", spin_jnode_is_locked(node)); + assert("nikita-2428", PagePrivate(page)); + + assert("nikita-3551", !PageWriteback(page)); + + JF_CLR(node, JNODE_PARSED); + page->private = 0ul; + ClearPagePrivate(page); + node->pg = NULL; + page_cache_release(page); +} + +/* it is only used in one place to handle error */ +reiser4_internal void +page_detach_jnode(struct page *page, struct address_space *mapping, unsigned long index) +{ + assert("nikita-2395", page != NULL); + + lock_page(page); + if ((page->mapping == mapping) && (page->index == index) && PagePrivate(page)) { + jnode *node; + + node = jprivate(page); + assert("nikita-2399", spin_jnode_is_not_locked(node)); + UNDER_SPIN_VOID(jnode, node, page_clear_jnode(page, node)); + } + unlock_page(page); +} + +/* return @node page locked. + + Locking ordering requires that one first takes page lock and afterwards + spin lock on node attached to this page. Sometimes it is necessary to go in + the opposite direction. This is done through standard trylock-and-release + loop. +*/ +static struct page * +jnode_lock_page(jnode * node) +{ + struct page *page; + + assert("nikita-2052", node != NULL); + assert("nikita-2401", spin_jnode_is_not_locked(node)); + + while (1) { + + LOCK_JNODE(node); + page = jnode_page(node); + if (page == NULL) { + break; + } + + /* no need to page_cache_get( page ) here, because page cannot + be evicted from memory without detaching it from jnode and + this requires spin lock on jnode that we already hold. + */ + if (!TestSetPageLocked(page)) { + /* We won a lock on jnode page, proceed. */ + break; + } + + /* Page is locked by someone else. */ + page_cache_get(page); + UNLOCK_JNODE(node); + wait_on_page_locked(page); + /* it is possible that page was detached from jnode and + returned to the free pool, or re-assigned while we were + waiting on locked bit. This will be rechecked on the next + loop iteration. + */ + page_cache_release(page); + + /* try again */ + } + return page; +} + +/* + * is JNODE_PARSED bit is not set, call ->parse() method of jnode, to verify + * validness of jnode content. + */ +static inline int +jparse(jnode * node) +{ + int result; + + assert("nikita-2466", node != NULL); + + LOCK_JNODE(node); + if (likely(!jnode_is_parsed(node))) { + result = jnode_ops(node)->parse(node); + if (likely(result == 0)) + JF_SET(node, JNODE_PARSED); + } else + result = 0; + UNLOCK_JNODE(node); + return result; +} + +/* Lock a page attached to jnode, create and attach page to jnode if it had no + * one. */ +reiser4_internal struct page * +jnode_get_page_locked(jnode * node, int gfp_flags) +{ + struct page * page; + + LOCK_JNODE(node); + page = jnode_page(node); + + if (page == NULL) { + UNLOCK_JNODE(node); + page = find_or_create_page(jnode_get_mapping(node), + jnode_get_index(node), gfp_flags); + if (page == NULL) + return ERR_PTR(RETERR(-ENOMEM)); + } else { + if (!TestSetPageLocked(page)) { + UNLOCK_JNODE(node); + return page; + } + page_cache_get(page); + UNLOCK_JNODE(node); + lock_page(page); + assert("nikita-3134", page->mapping == jnode_get_mapping(node)); + } + + LOCK_JNODE(node); + if (!jnode_page(node)) + jnode_attach_page(node, page); + UNLOCK_JNODE(node); + + page_cache_release(page); + assert ("zam-894", jnode_page(node) == page); + return page; +} + +/* Start read operation for jnode's page if page is not up-to-date. */ +static int jnode_start_read (jnode * node, struct page * page) +{ + assert ("zam-893", PageLocked(page)); + + if (PageUptodate(page)) { + unlock_page(page); + return 0; + } + return page_io(page, node, READ, GFP_KERNEL); +} + +#if REISER4_DEBUG +static void check_jload(jnode * node, struct page * page) +{ + if (jnode_is_znode(node)) { + node40_header *nh; + znode *z; + + z = JZNODE(node); + if (znode_is_any_locked(z)) { + nh = (node40_header *)kmap(page); + /* this only works for node40-only file systems. For + * debugging. */ + assert("nikita-3253", + z->nr_items == d16tocpu(&nh->nr_items)); + kunmap(page); + } + assert("nikita-3565", znode_invariant(z)); + } +} +#else +#define check_jload(node, page) noop +#endif + +/* prefetch jnode to speed up next call to jload. Call this when you are going + * to call jload() shortly. This will bring appropriate portion of jnode into + * CPU cache. */ +reiser4_internal void jload_prefetch(jnode *node) +{ + prefetchw(&node->x_count); +} + +/* load jnode's data into memory */ +reiser4_internal int +jload_gfp (jnode * node /* node to load */, + int gfp_flags /* allocation flags*/, + int do_kmap /* true if page should be kmapped */) +{ + struct page * page; + int result = 0; + int parsed; + + assert("nikita-3010", schedulable()); + + prefetchw(&node->pg); + + /* taking d-reference implies taking x-reference. */ + jref(node); + + /* + * acquiring d-reference to @jnode and check for JNODE_PARSED bit + * should be atomic, otherwise there is a race against + * reiser4_releasepage(). + */ + LOCK_JLOAD(node); + add_d_ref(node); + parsed = jnode_is_parsed(node); + UNLOCK_JLOAD(node); + + if (unlikely(!parsed)) { + page = jnode_get_page_locked(node, gfp_flags); + if (unlikely(IS_ERR(page))) { + result = PTR_ERR(page); + goto failed; + } + + result = jnode_start_read(node, page); + if (unlikely(result != 0)) + goto failed; + + wait_on_page_locked(page); + if (unlikely(!PageUptodate(page))) { + result = RETERR(-EIO); + goto failed; + } + + if (do_kmap) + node->data = kmap(page); + + result = jparse(node); + if (unlikely(result != 0)) { + if (do_kmap) + kunmap(page); + goto failed; + } + check_jload(node, page); + } else { + page = jnode_page(node); + check_jload(node, page); + if (do_kmap) + node->data = kmap(page); + } + + if (unlikely(JF_ISSET(node, JNODE_EFLUSH))) + UNDER_SPIN_VOID(jnode, node, eflush_del(node, 0)); + + if (!is_writeout_mode()) + /* We do not mark pages active if jload is called as a part of + * jnode_flush() or reiser4_write_logs(). Both jnode_flush() + * and write_logs() add no value to cached data, there is no + * sense to mark pages as active when they go to disk, it just + * confuses vm scanning routines because clean page could be + * moved out from inactive list as a result of this + * mark_page_accessed() call. */ + mark_page_accessed(page); + + return 0; + + failed: + jrelse_tail(node); + return result; + +} + +/* start asynchronous reading for given jnode's page. */ +reiser4_internal int jstartio (jnode * node) +{ + struct page * page; + + page = jnode_get_page_locked(node, GFP_KERNEL); + if (IS_ERR(page)) + return PTR_ERR(page); + + return jnode_start_read(node, page); +} + + +/* Initialize a node by calling appropriate plugin instead of reading + * node from disk as in jload(). */ +reiser4_internal int jinit_new (jnode * node, int gfp_flags) +{ + struct page * page; + int result; + + jref(node); + add_d_ref(node); + + page = jnode_get_page_locked(node, gfp_flags); + if (IS_ERR(page)) { + result = PTR_ERR(page); + goto failed; + } + + SetPageUptodate(page); + unlock_page(page); + + node->data = kmap(page); + + if (!jnode_is_parsed(node)) { + jnode_plugin * jplug = jnode_ops(node); + result = UNDER_SPIN(jnode, node, jplug->init(node)); + if (result) { + kunmap(page); + goto failed; + } + JF_SET(node, JNODE_PARSED); + } + + return 0; + + failed: + jrelse(node); + return result; +} + +/* release a reference to jnode acquired by jload(), decrement ->d_count */ +reiser4_internal void +jrelse_tail(jnode * node /* jnode to release references to */) +{ + assert("nikita-489", atomic_read(&node->d_count) > 0); + atomic_dec(&node->d_count); + /* release reference acquired in jload_gfp() or jinit_new() */ + jput(node); + LOCK_CNT_DEC(d_refs); +} + +/* drop reference to node data. When last reference is dropped, data are + unloaded. */ +reiser4_internal void +jrelse(jnode * node /* jnode to release references to */) +{ + struct page *page; + + assert("nikita-487", node != NULL); + assert("nikita-1906", spin_jnode_is_not_locked(node)); + + page = jnode_page(node); + if (likely(page != NULL)) { + /* + * it is safe not to lock jnode here, because at this point + * @node->d_count is greater than zero (if jrelse() is used + * correctly, that is). JNODE_PARSED may be not set yet, if, + * for example, we got here as a result of error handling path + * in jload(). Anyway, page cannot be detached by + * reiser4_releasepage(). truncate will invalidate page + * regardless, but this should not be a problem. + */ + kunmap(page); + } + jrelse_tail(node); +} + +/* called from jput() to wait for io completion */ +static void jnode_finish_io(jnode * node) +{ + struct page *page; + + assert("nikita-2922", node != NULL); + + LOCK_JNODE(node); + page = jnode_page(node); + if (page != NULL) { + page_cache_get(page); + UNLOCK_JNODE(node); + wait_on_page_writeback(page); + page_cache_release(page); + } else + UNLOCK_JNODE(node); +} + +/* + * This is called by jput() when last reference to jnode is released. This is + * separate function, because we want fast path of jput() to be inline and, + * therefore, small. + */ +reiser4_internal void +jput_final(jnode * node) +{ + int r_i_p; + + /* A fast check for keeping node in cache. We always keep node in cache + * if its page is present and node was not marked for deletion */ + if (jnode_page(node) != NULL && !JF_ISSET(node, JNODE_HEARD_BANSHEE)) { + rcu_read_unlock(); + return; + } + + r_i_p = !JF_TEST_AND_SET(node, JNODE_RIP); + /* + * if r_i_p is true, we were first to set JNODE_RIP on this node. In + * this case it is safe to access node after unlock. + */ + rcu_read_unlock(); + if (r_i_p) { + jnode_finish_io(node); + if (JF_ISSET(node, JNODE_HEARD_BANSHEE)) + /* node is removed from the tree. */ + jdelete(node); + else + jnode_try_drop(node); + } + /* if !r_i_p some other thread is already killing it */ +} + +reiser4_internal int +jwait_io(jnode * node, int rw) +{ + struct page *page; + int result; + + assert("zam-447", node != NULL); + assert("zam-448", jnode_page(node) != NULL); + + page = jnode_page(node); + + result = 0; + if (rw == READ) { + wait_on_page_locked(page); + } else { + assert("nikita-2227", rw == WRITE); + wait_on_page_writeback(page); + } + if (PageError(page)) + result = RETERR(-EIO); + + return result; +} + +/* + * jnode types and plugins. + * + * jnode by itself is a "base type". There are several different jnode + * flavors, called "jnode types" (see jnode_type for a list). Sometimes code + * has to do different things based on jnode type. In the standard reiser4 way + * this is done by having jnode plugin (see fs/reiser4/plugin.h:jnode_plugin). + * + * Functions below deal with jnode types and define methods of jnode plugin. + * + */ + +/* set jnode type. This is done during jnode initialization. */ +static void +jnode_set_type(jnode * node, jnode_type type) +{ + static unsigned long type_to_mask[] = { + [JNODE_UNFORMATTED_BLOCK] = 1, + [JNODE_FORMATTED_BLOCK] = 0, + [JNODE_BITMAP] = 2, + [JNODE_IO_HEAD] = 6, + [JNODE_INODE] = 4 + }; + + assert("zam-647", type < LAST_JNODE_TYPE); + assert("nikita-2815", !jnode_is_loaded(node)); + assert("nikita-3386", node->state == 0); + + node->state |= (type_to_mask[type] << JNODE_TYPE_1); +} + +/* ->init() method of jnode plugin for jnodes that don't require plugin + * specific initialization. */ +static int +init_noinit(jnode * node UNUSED_ARG) +{ + return 0; +} + +/* ->parse() method of jnode plugin for jnodes that don't require plugin + * specific pasring. */ +static int +parse_noparse(jnode * node UNUSED_ARG) +{ + return 0; +} + +/* ->mapping() method for unformatted jnode */ +reiser4_internal struct address_space * +mapping_jnode(const jnode * node) +{ + struct address_space *map; + + assert("nikita-2713", node != NULL); + + /* mapping is stored in jnode */ + + map = node->key.j.mapping; + assert("nikita-2714", map != NULL); + assert("nikita-2897", is_reiser4_inode(map->host)); + assert("nikita-2715", get_inode_oid(map->host) == node->key.j.objectid); + assert("vs-1447", !JF_ISSET(node, JNODE_CC)); + return map; +} + +/* ->index() method for unformatted jnodes */ +reiser4_internal unsigned long +index_jnode(const jnode * node) +{ + assert("vs-1447", !JF_ISSET(node, JNODE_CC)); + /* index is stored in jnode */ + return node->key.j.index; +} + +/* ->remove() method for unformatted jnodes */ +static inline void +remove_jnode(jnode * node, reiser4_tree * tree) +{ + /* remove jnode from hash table and radix tree */ + if (node->key.j.mapping) + unhash_unformatted_node_nolock(node); +} + +/* ->mapping() method for znodes */ +static struct address_space * +mapping_znode(const jnode * node) +{ + assert("vs-1447", !JF_ISSET(node, JNODE_CC)); + /* all znodes belong to fake inode */ + return get_super_fake(jnode_get_tree(node)->super)->i_mapping; +} + +extern int znode_shift_order; +/* ->index() method for znodes */ +static unsigned long +index_znode(const jnode * node) +{ + unsigned long addr; + assert("nikita-3317", (1 << znode_shift_order) < sizeof(znode)); + + /* index of znode is just its address (shifted) */ + addr = (unsigned long)node; + return (addr - PAGE_OFFSET) >> znode_shift_order; +} + +/* ->mapping() method for bitmap jnode */ +static struct address_space * +mapping_bitmap(const jnode * node) +{ + /* all bitmap blocks belong to special bitmap inode */ + return get_super_private(jnode_get_tree(node)->super)->bitmap->i_mapping; +} + +/* ->index() method for jnodes that are indexed by address */ +static unsigned long +index_is_address(const jnode * node) +{ + unsigned long ind; + + ind = (unsigned long)node; + return ind - PAGE_OFFSET; +} + +/* resolve race with jput */ +reiser4_internal jnode * +jnode_rip_sync(reiser4_tree *t, jnode * node) +{ + /* + * This is used as part of RCU-based jnode handling. + * + * jlookup(), zlook(), zget(), and cbk_cache_scan_slots() have to work + * with unreferenced jnodes (ones with ->x_count == 0). Hash table is + * not protected during this, so concurrent thread may execute + * zget-set-HEARD_BANSHEE-zput, or somehow else cause jnode to be + * freed in jput_final(). To avoid such races, jput_final() sets + * JNODE_RIP on jnode (under tree lock). All places that work with + * unreferenced jnodes call this function. It checks for JNODE_RIP bit + * (first without taking tree lock), and if this bit is set, released + * reference acquired by the current thread and returns NULL. + * + * As a result, if jnode is being concurrently freed, NULL is returned + * and caller should pretend that jnode wasn't found in the first + * place. + * + * Otherwise it's safe to release "rcu-read-lock" and continue with + * jnode. + */ + if (unlikely(JF_ISSET(node, JNODE_RIP))) { + RLOCK_TREE(t); + if (JF_ISSET(node, JNODE_RIP)) { + dec_x_ref(node); + node = NULL; + } + RUNLOCK_TREE(t); + } + return node; +} + + +reiser4_internal reiser4_key * +jnode_build_key(const jnode * node, reiser4_key * key) +{ + struct inode *inode; + item_plugin *iplug; + loff_t off; + + assert("nikita-3092", node != NULL); + assert("nikita-3093", key != NULL); + assert("nikita-3094", jnode_is_unformatted(node)); + + + off = ((loff_t)index_jnode(node)) << PAGE_CACHE_SHIFT; + inode = mapping_jnode(node)->host; + + if (node->parent_item_id != 0) + iplug = item_plugin_by_id(node->parent_item_id); + else + iplug = NULL; + + if (iplug != NULL && iplug->f.key_by_offset) + iplug->f.key_by_offset(inode, off, key); + else { + file_plugin *fplug; + + fplug = inode_file_plugin(inode); + assert ("zam-1007", fplug != NULL); + assert ("zam-1008", fplug->key_by_inode != NULL); + + fplug->key_by_inode(inode, off, key); + } + + return key; +} + +extern int zparse(znode * node); + +/* ->parse() method for formatted nodes */ +static int +parse_znode(jnode * node) +{ + return zparse(JZNODE(node)); +} + +/* ->delete() method for formatted nodes */ +static void +delete_znode(jnode * node, reiser4_tree * tree) +{ + znode *z; + + assert("nikita-2128", rw_tree_is_write_locked(tree)); + assert("vs-898", JF_ISSET(node, JNODE_HEARD_BANSHEE)); + + z = JZNODE(node); + assert("vs-899", z->c_count == 0); + + /* delete znode from sibling list. */ + sibling_list_remove(z); + + znode_remove(z, tree); +} + +/* ->remove() method for formatted nodes */ +static int +remove_znode(jnode * node, reiser4_tree * tree) +{ + znode *z; + + assert("nikita-2128", rw_tree_is_locked(tree)); + z = JZNODE(node); + + if (z->c_count == 0) { + /* detach znode from sibling list. */ + sibling_list_drop(z); + /* this is called with tree spin-lock held, so call + znode_remove() directly (rather than znode_lock_remove()). */ + znode_remove(z, tree); + return 0; + } + return RETERR(-EBUSY); +} + +/* ->init() method for formatted nodes */ +static int +init_znode(jnode * node) +{ + znode *z; + + z = JZNODE(node); + /* call node plugin to do actual initialization */ + return z->nplug->init(z); +} + +/* jplug->clone for formatted nodes (znodes) */ +znode *zalloc(int gfp_flag); +void zinit(znode *, const znode * parent, reiser4_tree *); + +/* ->clone() method for formatted nodes */ +static jnode * +clone_formatted(jnode *node) +{ + znode *clone; + + assert("vs-1430", jnode_is_znode(node)); + clone = zalloc(GFP_KERNEL); + if (clone == NULL) + return ERR_PTR(RETERR(-ENOMEM)); + zinit(clone, 0, current_tree); + jnode_set_block(ZJNODE(clone), jnode_get_block(node)); + /* ZJNODE(clone)->key.z is not initialized */ + clone->level = JZNODE(node)->level; + + return ZJNODE(clone); +} + +/* jplug->clone for unformatted nodes */ +static jnode * +clone_unformatted(jnode *node) +{ + jnode *clone; + + assert("vs-1431", jnode_is_unformatted(node)); + clone = jalloc(); + if (clone == NULL) + return ERR_PTR(RETERR(-ENOMEM)); + + jnode_init(clone, current_tree, JNODE_UNFORMATTED_BLOCK); + jnode_set_block(clone, jnode_get_block(node)); + + return clone; + +} + +/* + * Setup jnode plugin methods for various jnode types. + */ + +jnode_plugin jnode_plugins[LAST_JNODE_TYPE] = { + [JNODE_UNFORMATTED_BLOCK] = { + .h = { + .type_id = REISER4_JNODE_PLUGIN_TYPE, + .id = JNODE_UNFORMATTED_BLOCK, + .pops = NULL, + .label = "unformatted", + .desc = "unformatted node", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .init = init_noinit, + .parse = parse_noparse, + .mapping = mapping_jnode, + .index = index_jnode, + .clone = clone_unformatted + }, + [JNODE_FORMATTED_BLOCK] = { + .h = { + .type_id = REISER4_JNODE_PLUGIN_TYPE, + .id = JNODE_FORMATTED_BLOCK, + .pops = NULL, + .label = "formatted", + .desc = "formatted tree node", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .init = init_znode, + .parse = parse_znode, + .mapping = mapping_znode, + .index = index_znode, + .clone = clone_formatted + }, + [JNODE_BITMAP] = { + .h = { + .type_id = REISER4_JNODE_PLUGIN_TYPE, + .id = JNODE_BITMAP, + .pops = NULL, + .label = "bitmap", + .desc = "bitmap node", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .init = init_noinit, + .parse = parse_noparse, + .mapping = mapping_bitmap, + .index = index_is_address, + .clone = NULL + }, + [JNODE_IO_HEAD] = { + .h = { + .type_id = REISER4_JNODE_PLUGIN_TYPE, + .id = JNODE_IO_HEAD, + .pops = NULL, + .label = "io head", + .desc = "io head", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .init = init_noinit, + .parse = parse_noparse, + .mapping = mapping_bitmap, + .index = index_is_address, + .clone = NULL + }, + [JNODE_INODE] = { + .h = { + .type_id = REISER4_JNODE_PLUGIN_TYPE, + .id = JNODE_INODE, + .pops = NULL, + .label = "inode", + .desc = "inode's builtin jnode", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .init = NULL, + .parse = NULL, + .mapping = NULL, + .index = NULL, + .clone = NULL + } +}; + +/* + * jnode destruction. + * + * Thread may use a jnode after it acquired a reference to it. References are + * counted in ->x_count field. Reference protects jnode from being + * recycled. This is different from protecting jnode data (that are stored in + * jnode page) from being evicted from memory. Data are protected by jload() + * and released by jrelse(). + * + * If thread already possesses a reference to the jnode it can acquire another + * one through jref(). Initial reference is obtained (usually) by locating + * jnode in some indexing structure that depends on jnode type: formatted + * nodes are kept in global hash table, where they are indexed by block + * number, and also in the cbk cache. Unformatted jnodes are also kept in hash + * table, which is indexed by oid and offset within file, and in per-inode + * radix tree. + * + * Reference to jnode is released by jput(). If last reference is released, + * jput_final() is called. This function determines whether jnode has to be + * deleted (this happens when corresponding node is removed from the file + * system, jnode is marked with JNODE_HEARD_BANSHEE bit in this case), or it + * should be just "removed" (deleted from memory). + * + * Jnode destruction is signally delicate dance because of locking and RCU. + */ + +/* + * Returns true if jnode cannot be removed right now. This check is called + * under tree lock. If it returns true, jnode is irrevocably committed to be + * deleted/removed. + */ +static inline int +jnode_is_busy(const jnode * node, jnode_type jtype) +{ + /* if other thread managed to acquire a reference to this jnode, don't + * free it. */ + if (atomic_read(&node->x_count) > 0) + return 1; + /* also, don't free znode that has children in memory */ + if (jtype == JNODE_FORMATTED_BLOCK && JZNODE(node)->c_count > 0) + return 1; + return 0; +} + +/* + * this is called as part of removing jnode. Based on jnode type, call + * corresponding function that removes jnode from indices and returns it back + * to the appropriate slab (through RCU). + */ +static inline void +jnode_remove(jnode * node, jnode_type jtype, reiser4_tree * tree) +{ + switch (jtype) { + case JNODE_UNFORMATTED_BLOCK: + remove_jnode(node, tree); + break; + case JNODE_IO_HEAD: + case JNODE_BITMAP: + break; + case JNODE_INODE: + break; + case JNODE_FORMATTED_BLOCK: + remove_znode(node, tree); + break; + default: + wrong_return_value("nikita-3196", "Wrong jnode type"); + } +} + +/* + * this is called as part of deleting jnode. Based on jnode type, call + * corresponding function that removes jnode from indices and returns it back + * to the appropriate slab (through RCU). + * + * This differs from jnode_remove() only for formatted nodes---for them + * sibling list handling is different for removal and deletion. + */ +static inline void +jnode_delete(jnode * node, jnode_type jtype, reiser4_tree * tree UNUSED_ARG) +{ + switch (jtype) { + case JNODE_UNFORMATTED_BLOCK: + remove_jnode(node, tree); + break; + case JNODE_IO_HEAD: + case JNODE_BITMAP: + break; + case JNODE_FORMATTED_BLOCK: + delete_znode(node, tree); + break; + case JNODE_INODE: + default: + wrong_return_value("nikita-3195", "Wrong jnode type"); + } +} + +#if REISER4_DEBUG +/* + * remove jnode from the debugging list of all jnodes hanging off super-block. + */ +void jnode_list_remove(jnode * node) +{ + reiser4_super_info_data *sbinfo; + + sbinfo = get_super_private(jnode_get_tree(node)->super); + + spin_lock_irq(&sbinfo->all_guard); + assert("nikita-2422", !list_empty(&node->jnodes)); + list_del_init(&node->jnodes); + spin_unlock_irq(&sbinfo->all_guard); +} +#endif + +/* + * this is called by jput_final() to remove jnode when last reference to it is + * released. + */ +static int +jnode_try_drop(jnode * node) +{ + int result; + reiser4_tree *tree; + jnode_type jtype; + + assert("nikita-2491", node != NULL); + assert("nikita-2583", JF_ISSET(node, JNODE_RIP)); + + tree = jnode_get_tree(node); + jtype = jnode_get_type(node); + + LOCK_JNODE(node); + WLOCK_TREE(tree); + /* + * if jnode has a page---leave it alone. Memory pressure will + * eventually kill page and jnode. + */ + if (jnode_page(node) != NULL) { + UNLOCK_JNODE(node); + WUNLOCK_TREE(tree); + JF_CLR(node, JNODE_RIP); + return RETERR(-EBUSY); + } + + /* re-check ->x_count under tree lock. */ + result = jnode_is_busy(node, jtype); + if (result == 0) { + assert("nikita-2582", !JF_ISSET(node, JNODE_HEARD_BANSHEE)); + assert("nikita-3223", !JF_ISSET(node, JNODE_EFLUSH)); + assert("jmacd-511/b", atomic_read(&node->d_count) == 0); + + UNLOCK_JNODE(node); + /* no page and no references---despatch him. */ + jnode_remove(node, jtype, tree); + WUNLOCK_TREE(tree); + jnode_free(node, jtype); + } else { + /* busy check failed: reference was acquired by concurrent + * thread. */ + WUNLOCK_TREE(tree); + UNLOCK_JNODE(node); + JF_CLR(node, JNODE_RIP); + } + return result; +} + +/* jdelete() -- Delete jnode from the tree and file system */ +static int +jdelete(jnode * node /* jnode to finish with */) +{ + struct page *page; + int result; + reiser4_tree *tree; + jnode_type jtype; + + assert("nikita-467", node != NULL); + assert("nikita-2531", JF_ISSET(node, JNODE_RIP)); + /* jnode cannot be eflushed at this point, because emegrency flush + * acquired additional reference counter. */ + assert("nikita-2917", !JF_ISSET(node, JNODE_EFLUSH)); + + jtype = jnode_get_type(node); + + page = jnode_lock_page(node); + assert("nikita-2402", spin_jnode_is_locked(node)); + + tree = jnode_get_tree(node); + + WLOCK_TREE(tree); + /* re-check ->x_count under tree lock. */ + result = jnode_is_busy(node, jtype); + if (likely(!result)) { + assert("nikita-2123", JF_ISSET(node, JNODE_HEARD_BANSHEE)); + assert("jmacd-511", atomic_read(&node->d_count) == 0); + + /* detach page */ + if (page != NULL) { + /* + * FIXME this is racy against jnode_extent_write(). + */ + page_clear_jnode(page, node); + } + UNLOCK_JNODE(node); + /* goodbye */ + jnode_delete(node, jtype, tree); + WUNLOCK_TREE(tree); + jnode_free(node, jtype); + /* @node is no longer valid pointer */ + if (page != NULL) + drop_page(page); + } else { + /* busy check failed: reference was acquired by concurrent + * thread. */ + JF_CLR(node, JNODE_RIP); + WUNLOCK_TREE(tree); + UNLOCK_JNODE(node); + if (page != NULL) + unlock_page(page); + } + return result; +} + +/* drop jnode on the floor. + + Return value: + + -EBUSY: failed to drop jnode, because there are still references to it + + 0: successfully dropped jnode + +*/ +static int +jdrop_in_tree(jnode * node, reiser4_tree * tree) +{ + struct page *page; + jnode_type jtype; + int result; + + assert("zam-602", node != NULL); + assert("nikita-2362", rw_tree_is_not_locked(tree)); + assert("nikita-2403", !JF_ISSET(node, JNODE_HEARD_BANSHEE)); + // assert( "nikita-2532", JF_ISSET( node, JNODE_RIP ) ); + + + jtype = jnode_get_type(node); + + page = jnode_lock_page(node); + assert("nikita-2405", spin_jnode_is_locked(node)); + + WLOCK_TREE(tree); + + /* re-check ->x_count under tree lock. */ + result = jnode_is_busy(node, jtype); + if (!result) { + assert("nikita-2488", page == jnode_page(node)); + assert("nikita-2533", atomic_read(&node->d_count) == 0); + if (page != NULL) { + assert("nikita-2126", !PageDirty(page)); + assert("nikita-2127", PageUptodate(page)); + assert("nikita-2181", PageLocked(page)); + page_clear_jnode(page, node); + } + UNLOCK_JNODE(node); + jnode_remove(node, jtype, tree); + WUNLOCK_TREE(tree); + jnode_free(node, jtype); + if (page != NULL) { + drop_page(page); + } + } else { + /* busy check failed: reference was acquired by concurrent + * thread. */ + JF_CLR(node, JNODE_RIP); + WUNLOCK_TREE(tree); + UNLOCK_JNODE(node); + if (page != NULL) + unlock_page(page); + } + return result; +} + +/* This function frees jnode "if possible". In particular, [dcx]_count has to + be 0 (where applicable). */ +reiser4_internal void +jdrop(jnode * node) +{ + jdrop_in_tree(node, jnode_get_tree(node)); +} + + +/* IO head jnode implementation; The io heads are simple j-nodes with limited + functionality (these j-nodes are not in any hash table) just for reading + from and writing to disk. */ + +reiser4_internal jnode * +alloc_io_head(const reiser4_block_nr * block) +{ + jnode *jal = jalloc(); + + if (jal != NULL) { + jnode_init(jal, current_tree, JNODE_IO_HEAD); + jnode_set_block(jal, block); + } + + jref(jal); + + return jal; +} + +reiser4_internal void +drop_io_head(jnode * node) +{ + assert("zam-648", jnode_get_type(node) == JNODE_IO_HEAD); + + jput(node); + jdrop(node); +} + +/* protect keep jnode data from reiser4_releasepage() */ +reiser4_internal void +pin_jnode_data(jnode * node) +{ + assert("zam-671", jnode_page(node) != NULL); + page_cache_get(jnode_page(node)); +} + +/* make jnode data free-able again */ +reiser4_internal void +unpin_jnode_data(jnode * node) +{ + assert("zam-672", jnode_page(node) != NULL); + page_cache_release(jnode_page(node)); +} + +reiser4_internal struct address_space * +jnode_get_mapping(const jnode * node) +{ + assert("nikita-3162", node != NULL); + return jnode_ops(node)->mapping(node); +} + +#if REISER4_DEBUG +/* debugging aid: jnode invariant */ +reiser4_internal int +jnode_invariant_f(const jnode * node, + char const **msg) +{ +#define _ergo(ant, con) \ + ((*msg) = "{" #ant "} ergo {" #con "}", ergo((ant), (con))) +#define _check(exp) ((*msg) = #exp, (exp)) + + return + _check(node != NULL) && + + /* [jnode-queued] */ + + /* only relocated node can be queued, except that when znode + * is being deleted, its JNODE_RELOC bit is cleared */ + _ergo(JF_ISSET(node, JNODE_FLUSH_QUEUED), + JF_ISSET(node, JNODE_RELOC) || + JF_ISSET(node, JNODE_HEARD_BANSHEE)) && + + _check(node->jnodes.prev != NULL) && + _check(node->jnodes.next != NULL) && + + /* [jnode-dirty] invariant */ + + /* dirty inode is part of atom */ + _ergo(jnode_is_dirty(node), node->atom != NULL) && + + /* [jnode-oid] invariant */ + + /* for unformatted node ->objectid and ->mapping fields are + * consistent */ + _ergo(jnode_is_unformatted(node) && node->key.j.mapping != NULL, + node->key.j.objectid == get_inode_oid(node->key.j.mapping->host)) && + /* [jnode-atom-valid] invariant */ + + /* node atom has valid state */ + _ergo(node->atom != NULL, + node->atom->stage != ASTAGE_INVALID) && + + /* [jnode-page-binding] invariant */ + + /* if node points to page, it points back to node */ + _ergo(node->pg != NULL, jprivate(node->pg) == node) && + + /* [jnode-refs] invariant */ + + /* only referenced jnode can be loaded */ + _check(atomic_read(&node->x_count) >= atomic_read(&node->d_count)); + +} + +/* debugging aid: check znode invariant and panic if it doesn't hold */ +static int +jnode_invariant(const jnode * node, int tlocked, int jlocked) +{ + char const *failed_msg; + int result; + reiser4_tree *tree; + + tree = jnode_get_tree(node); + + assert("umka-063312", node != NULL); + assert("umka-064321", tree != NULL); + + if (!jlocked && !tlocked) + LOCK_JNODE((jnode *) node); + if (!tlocked) + RLOCK_TREE(jnode_get_tree(node)); + result = jnode_invariant_f(node, &failed_msg); + if (!result) { + info_jnode("corrupted node", node); + warning("jmacd-555", "Condition %s failed", failed_msg); + } + if (!tlocked) + RUNLOCK_TREE(jnode_get_tree(node)); + if (!jlocked && !tlocked) + UNLOCK_JNODE((jnode *) node); + return result; +} + +static const char * +jnode_type_name(jnode_type type) +{ + switch (type) { + case JNODE_UNFORMATTED_BLOCK: + return "unformatted"; + case JNODE_FORMATTED_BLOCK: + return "formatted"; + case JNODE_BITMAP: + return "bitmap"; + case JNODE_IO_HEAD: + return "io head"; + case JNODE_INODE: + return "inode"; + case LAST_JNODE_TYPE: + return "last"; + default:{ + static char unknown[30]; + + sprintf(unknown, "unknown %i", type); + return unknown; + } + } +} + +#define jnode_state_name( node, flag ) \ + ( JF_ISSET( ( node ), ( flag ) ) ? ((#flag "|")+6) : "" ) + +/* debugging aid: output human readable information about @node */ +reiser4_internal void +info_jnode(const char *prefix /* prefix to print */ , + const jnode * node /* node to print */ ) +{ + assert("umka-068", prefix != NULL); + + if (node == NULL) { + printk("%s: null\n", prefix); + return; + } + + printk("%s: %p: state: %lx: [%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s], level: %i," + " block: %s, d_count: %d, x_count: %d, " + "pg: %p, atom: %p, lock: %i:%i, type: %s, ", + prefix, node, node->state, + jnode_state_name(node, JNODE_PARSED), + jnode_state_name(node, JNODE_HEARD_BANSHEE), + jnode_state_name(node, JNODE_LEFT_CONNECTED), + jnode_state_name(node, JNODE_RIGHT_CONNECTED), + jnode_state_name(node, JNODE_ORPHAN), + jnode_state_name(node, JNODE_CREATED), + jnode_state_name(node, JNODE_RELOC), + jnode_state_name(node, JNODE_OVRWR), + jnode_state_name(node, JNODE_DIRTY), + jnode_state_name(node, JNODE_IS_DYING), + jnode_state_name(node, JNODE_EFLUSH), + jnode_state_name(node, JNODE_FLUSH_QUEUED), + jnode_state_name(node, JNODE_RIP), + jnode_state_name(node, JNODE_MISSED_IN_CAPTURE), + jnode_state_name(node, JNODE_WRITEBACK), + jnode_state_name(node, JNODE_NEW), + jnode_state_name(node, JNODE_DKSET), + jnode_state_name(node, JNODE_EPROTECTED), + jnode_state_name(node, JNODE_REPACK), + jnode_state_name(node, JNODE_CLUSTER_PAGE), + jnode_get_level(node), sprint_address(jnode_get_block(node)), + atomic_read(&node->d_count), atomic_read(&node->x_count), + jnode_page(node), node->atom, + 0, 0, + jnode_type_name(jnode_get_type(node))); + if (jnode_is_unformatted(node)) { + printk("inode: %llu, index: %lu, ", + node->key.j.objectid, node->key.j.index); + } +} + +/* debugging aid: output human readable information about @node */ +reiser4_internal void +print_jnode(const char *prefix /* prefix to print */ , + const jnode * node /* node to print */) +{ + if (jnode_is_znode(node)) + print_znode(prefix, JZNODE(node)); + else + info_jnode(prefix, node); +} + +#endif /* REISER4_DEBUG */ + +/* this is only used to created jnode during capture copy */ +reiser4_internal jnode *jclone(jnode *node) +{ + jnode *clone; + + assert("vs-1429", jnode_ops(node)->clone); + clone = jnode_ops(node)->clone(node); + if (IS_ERR(clone)) + return clone; + + jref(clone); + JF_SET(clone, JNODE_HEARD_BANSHEE); + JF_SET(clone, JNODE_CC); + return clone; +} + + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 80 + End: +*/ diff -puN /dev/null fs/reiser4/jnode.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/jnode.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,772 @@ +/* Copyright 2001, 2002, 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Declaration of jnode. See jnode.c for details. */ + +#ifndef __JNODE_H__ +#define __JNODE_H__ + +#include "forward.h" +#include "type_safe_hash.h" +#include "type_safe_list.h" +#include "txnmgr.h" +#include "key.h" +#include "debug.h" +#include "dformat.h" +#include "spin_macros.h" +#include "emergency_flush.h" + +#include "plugin/plugin.h" + +#include +#include +#include +#include +#include +#include +#include + +/* declare hash table of jnodes (jnodes proper, that is, unformatted + nodes) */ +TYPE_SAFE_HASH_DECLARE(j, jnode); + +/* declare hash table of znodes */ +TYPE_SAFE_HASH_DECLARE(z, znode); + +typedef struct { + __u64 objectid; + unsigned long index; + struct address_space *mapping; +} jnode_key_t; + +/* + Jnode is the "base class" of other nodes in reiser4. It is also happens to + be exactly the node we use for unformatted tree nodes. + + Jnode provides following basic functionality: + + . reference counting and indexing. + + . integration with page cache. Jnode has ->pg reference to which page can + be attached. + + . interface to transaction manager. It is jnode that is kept in transaction + manager lists, attached to atoms, etc. (NOTE-NIKITA one may argue that this + means, there should be special type of jnode for inode.) + + Locking: + + Spin lock: the following fields are protected by the per-jnode spin lock: + + ->state + ->atom + ->capture_link + + Following fields are protected by the global tree lock: + + ->link + ->key.z (content of ->key.z is only changed in znode_rehash()) + ->key.j + + Atomic counters + + ->x_count + ->d_count + + ->pg, and ->data are protected by spin lock for unused jnode and are + immutable for used jnode (one for which fs/reiser4/vfs_ops.c:releasable() + is false). + + ->tree is immutable after creation + + Unclear + + ->blocknr: should be under jnode spin-lock, but current interface is based + on passing of block address. + + If you ever need to spin lock two nodes at once, do this in "natural" + memory order: lock znode with lower address first. (See lock_two_nodes().) + + Invariants involving this data-type: + + [jnode-dirty] + [jnode-refs] + [jnode-oid] + [jnode-queued] + [jnode-atom-valid] + [jnode-page-binding] +*/ + +struct jnode { +#if REISER4_DEBUG +#define JMAGIC 0x52654973 /* "ReIs" */ + int magic; +#endif + /* FIRST CACHE LINE (16 bytes): data used by jload */ + + /* jnode's state: bitwise flags from the reiser4_jnode_state enum. */ + /* 0 */ unsigned long state; + + /* lock, protecting jnode's fields. */ + /* 4 */ reiser4_spin_data load; + + /* counter of references to jnode itself. Increased on jref(). + Decreased on jput(). + */ + /* 8 */ atomic_t x_count; + + /* counter of references to jnode's data. Pin data page(s) in + memory while this is greater than 0. Increased on jload(). + Decreased on jrelse(). + */ + /* 12 */ atomic_t d_count; + + /* SECOND CACHE LINE: data used by hash table lookups */ + + /* 16 */ union { + /* znodes are hashed by block number */ + reiser4_block_nr z; + /* unformatted nodes are hashed by mapping plus offset */ + jnode_key_t j; + } key; + + /* THIRD CACHE LINE */ + + /* 32 */ union { + /* pointers to maintain hash-table */ + z_hash_link z; + j_hash_link j; + } link; + + /* pointer to jnode page. */ + /* 36 */ struct page *pg; + /* pointer to node itself. This is page_address(node->pg) when page is + attached to the jnode + */ + /* 40 */ void *data; + + /* 44 */ reiser4_tree *tree; + + /* FOURTH CACHE LINE: atom related fields */ + + /* 48 */ reiser4_spin_data guard; + + /* atom the block is in, if any */ + /* 52 */ txn_atom *atom; + + /* capture list */ + /* 56 */ capture_list_link capture_link; + + /* FIFTH CACHE LINE */ + + /* 64 */ struct rcu_head rcu; /* crosses cache line */ + + /* SIXTH CACHE LINE */ + + /* the real blocknr (where io is going to/from) */ + /* 80 */ reiser4_block_nr blocknr; + /* Parent item type, unformatted and CRC need it for offset => key conversion. */ + /* NOTE: this parent_item_id looks like jnode type. */ + /* 88 */ reiser4_plugin_id parent_item_id; + /* 92 */ +#if REISER4_DEBUG + /* number of pages referenced by the jnode (meaningful while capturing of + page clusters) */ + int page_count; + /* list of all jnodes for debugging purposes. */ + struct list_head jnodes; + /* how many times this jnode was written in one transaction */ + int written; + /* this indicates which atom's list the jnode is on */ + atom_list list1; + /* for debugging jnodes of one inode are attached to inode via this list */ + inode_jnodes_list_link inode_link; +#endif +} __attribute__((aligned(16))); + + +/* + * jnode types. Enumeration of existing jnode types. + */ +typedef enum { + JNODE_UNFORMATTED_BLOCK, /* unformatted block */ + JNODE_FORMATTED_BLOCK, /* formatted block, znode */ + JNODE_BITMAP, /* bitmap */ + JNODE_IO_HEAD, /* jnode representing a block in the + * wandering log */ + JNODE_INODE, /* jnode embedded into inode */ + LAST_JNODE_TYPE +} jnode_type; + +TYPE_SAFE_LIST_DEFINE(capture, jnode, capture_link); +#if REISER4_DEBUG +TYPE_SAFE_LIST_DEFINE(inode_jnodes, jnode, inode_link); +#endif + +/* jnode states */ +typedef enum { + /* jnode's page is loaded and data checked */ + JNODE_PARSED = 0, + /* node was deleted, not all locks on it were released. This + node is empty and is going to be removed from the tree + shortly. */ + JNODE_HEARD_BANSHEE = 1, + /* left sibling pointer is valid */ + JNODE_LEFT_CONNECTED = 2, + /* right sibling pointer is valid */ + JNODE_RIGHT_CONNECTED = 3, + + /* znode was just created and doesn't yet have a pointer from + its parent */ + JNODE_ORPHAN = 4, + + /* this node was created by its transaction and has not been assigned + a block address. */ + JNODE_CREATED = 5, + + /* this node is currently relocated */ + JNODE_RELOC = 6, + /* this node is currently wandered */ + JNODE_OVRWR = 7, + + /* this znode has been modified */ + JNODE_DIRTY = 8, + + /* znode lock is being invalidated */ + JNODE_IS_DYING = 9, + + /* THIS PLACE IS INTENTIONALLY LEFT BLANK */ + + JNODE_EFLUSH = 11, + + /* jnode is queued for flushing. */ + JNODE_FLUSH_QUEUED = 12, + + /* In the following bits jnode type is encoded. */ + JNODE_TYPE_1 = 13, + JNODE_TYPE_2 = 14, + JNODE_TYPE_3 = 15, + + /* jnode is being destroyed */ + JNODE_RIP = 16, + + /* znode was not captured during locking (it might so be because + ->level != LEAF_LEVEL and lock_mode == READ_LOCK) */ + JNODE_MISSED_IN_CAPTURE = 17, + + /* write is in progress */ + JNODE_WRITEBACK = 18, + + /* FIXME: now it is used by crypto-compress plugin only */ + JNODE_NEW = 19, + + /* delimiting keys are already set for this znode. */ + JNODE_DKSET = 20, + + /* cheap and effective protection of jnode from emergency flush. This + * bit can only be set by thread that holds long term lock on jnode + * parent node (twig node, where extent unit lives). */ + JNODE_EPROTECTED = 21, + JNODE_CLUSTER_PAGE = 22, + /* Jnode is marked for repacking, that means the reiser4 flush and the + * block allocator should process this node special way */ + JNODE_REPACK = 23, + /* node should be converted by flush in squalloc phase */ + JNODE_CONVERTIBLE = 24, + + JNODE_SCANNED = 25, + JNODE_JLOADED_BY_GET_OVERWRITE_SET = 26, + /* capture copy jnode */ + JNODE_CC = 27, + /* this jnode is copy of coced original */ + JNODE_CCED = 28, + /* + * When jnode is dirtied for the first time in given transaction, + * do_jnode_make_dirty() checks whether this jnode can possible became + * member of overwrite set. If so, this bit is set, and one block is + * reserved in the ->flush_reserved space of atom. + * + * This block is "used" (and JNODE_FLUSH_RESERVED bit is cleared) when + * + * (1) flush decides that we want this block to go into relocate + * set after all. + * + * (2) wandering log is allocated (by log writer) + * + * (3) extent is allocated + * + */ + JNODE_FLUSH_RESERVED = 29, + /* if page was dirtied through mmap, we don't want to lose data, even + * though page and jnode may be clean. Mark jnode with JNODE_KEEPME so + * that ->releasepage() can tell. This is used only for + * unformatted */ + JNODE_KEEPME = 30, +#if REISER4_DEBUG + /* uneflushed */ + JNODE_UNEFLUSHED = 31 +#endif +} reiser4_jnode_state; + +/* Macros for accessing the jnode state. */ + +static inline void +JF_CLR(jnode * j, int f) +{ + assert("unknown-1", j->magic == JMAGIC); + clear_bit(f, &j->state); +} +static inline int +JF_ISSET(const jnode * j, int f) +{ + assert("unknown-2", j->magic == JMAGIC); + return test_bit(f, &((jnode *) j)->state); +} +static inline void +JF_SET(jnode * j, int f) +{ + assert("unknown-3", j->magic == JMAGIC); + set_bit(f, &j->state); +} + +static inline int +JF_TEST_AND_SET(jnode * j, int f) +{ + assert("unknown-4", j->magic == JMAGIC); + return test_and_set_bit(f, &j->state); +} + +/* ordering constraint for znode spin lock: znode lock is weaker than + tree lock and dk lock */ +#define spin_ordering_pred_jnode( node ) \ + ( ( lock_counters() -> rw_locked_tree == 0 ) && \ + ( lock_counters() -> spin_locked_txnh == 0 ) && \ + ( lock_counters() -> rw_locked_zlock == 0 ) && \ + ( lock_counters() -> rw_locked_dk == 0 ) && \ + /* \ + in addition you cannot hold more than one jnode spin lock at a \ + time. \ + */ \ + ( lock_counters() -> spin_locked_jnode < 2 ) ) + +/* Define spin_lock_jnode, spin_unlock_jnode, and spin_jnode_is_locked. + Take and release short-term spinlocks. Don't hold these across + io. +*/ +SPIN_LOCK_FUNCTIONS(jnode, jnode, guard); + +#define spin_ordering_pred_jload(node) (1) + +SPIN_LOCK_FUNCTIONS(jload, jnode, load); + +static inline int +jnode_is_in_deleteset(const jnode * node) +{ + return JF_ISSET(node, JNODE_RELOC); +} + + +extern int jnode_init_static(void); +extern int jnode_done_static(void); + +/* Jnode routines */ +extern jnode *jalloc(void); +extern void jfree(jnode * node) NONNULL; +extern jnode *jclone(jnode *); +extern jnode *jlookup(reiser4_tree * tree, + oid_t objectid, unsigned long ind) NONNULL; +extern jnode *jfind(struct address_space *, unsigned long index) NONNULL; +extern jnode *jnode_by_page(struct page *pg) NONNULL; +extern jnode *jnode_of_page(struct page *pg) NONNULL; +void jnode_attach_page(jnode * node, struct page *pg); +jnode *find_get_jnode(reiser4_tree * tree, + struct address_space *mapping, oid_t oid, + unsigned long index); + +void unhash_unformatted_jnode(jnode *); +struct page *jnode_get_page_locked(jnode *, int gfp_flags); +extern jnode *page_next_jnode(jnode * node) NONNULL; +extern void jnode_init(jnode * node, reiser4_tree * tree, jnode_type) NONNULL; +extern void jnode_make_dirty(jnode * node) NONNULL; +extern void jnode_make_clean(jnode * node) NONNULL; +extern void jnode_make_wander_nolock(jnode * node) NONNULL; +extern void jnode_make_wander(jnode*) NONNULL; +extern void znode_make_reloc(znode*, flush_queue_t*) NONNULL; +extern void unformatted_make_reloc(jnode*, flush_queue_t*) NONNULL; + +extern void jnode_set_block(jnode * node, + const reiser4_block_nr * blocknr) NONNULL; +/*extern struct page *jnode_lock_page(jnode *) NONNULL;*/ +extern struct address_space *jnode_get_mapping(const jnode * node) NONNULL; + +/* block number of node */ +static inline const reiser4_block_nr * +jnode_get_block(const jnode * node /* jnode to query */) +{ + assert("nikita-528", node != NULL); + + return &node->blocknr; +} + +/* block number for IO. Usually this is the same as jnode_get_block(), unless + * jnode was emergency flushed---then block number chosen by eflush is + * used. */ +static inline const reiser4_block_nr * +jnode_get_io_block(const jnode * node) +{ + assert("nikita-2768", node != NULL); + assert("nikita-2769", spin_jnode_is_locked(node)); + + if (unlikely(JF_ISSET(node, JNODE_EFLUSH))) + return eflush_get(node); + else + return jnode_get_block(node); +} + +/* Jnode flush interface. */ +extern reiser4_blocknr_hint *pos_hint(flush_pos_t * pos); +extern flush_queue_t * pos_fq(flush_pos_t * pos); + +/* FIXME-VS: these are used in plugin/item/extent.c */ + +/* does extent_get_block have to be called */ +#define jnode_mapped(node) JF_ISSET (node, JNODE_MAPPED) +#define jnode_set_mapped(node) JF_SET (node, JNODE_MAPPED) +/* pointer to this block was just created (either by appending or by plugging a + hole), or zinit_new was called */ +#define jnode_created(node) JF_ISSET (node, JNODE_CREATED) +#define jnode_set_created(node) JF_SET (node, JNODE_CREATED) + +/* the node should be converted during flush squalloc phase */ +#define jnode_convertible(node) JF_ISSET (node, JNODE_CONVERTIBLE) +#define jnode_set_convertible(node) JF_SET (node, JNODE_CONVERTIBLE) + +/* Macros to convert from jnode to znode, znode to jnode. These are macros + because C doesn't allow overloading of const prototypes. */ +#define ZJNODE(x) (& (x) -> zjnode) +#define JZNODE(x) \ +({ \ + typeof (x) __tmp_x; \ + \ + __tmp_x = (x); \ + assert ("jmacd-1300", jnode_is_znode (__tmp_x)); \ + (znode*) __tmp_x; \ +}) + +extern int jnodes_tree_init(reiser4_tree * tree); +extern int jnodes_tree_done(reiser4_tree * tree); + +#if REISER4_DEBUG + +extern int znode_is_any_locked(const znode * node); +extern void jnode_list_remove(jnode * node); +extern void info_jnode(const char *prefix, const jnode * node); +extern void print_jnode(const char *prefix, const jnode * node); + +#else + +#define jnode_list_remove(node) noop +#define info_jnode(p, n) noop +#define print_jnode(p, n) noop + +#endif + +int znode_is_root(const znode * node) NONNULL; + +/* bump reference counter on @node */ +static inline void +add_x_ref(jnode * node /* node to increase x_count of */ ) +{ + assert("nikita-1911", node != NULL); + + atomic_inc(&node->x_count); + LOCK_CNT_INC(x_refs); +} + +static inline void +dec_x_ref(jnode * node) +{ + assert("nikita-3215", node != NULL); + assert("nikita-3216", atomic_read(&node->x_count) > 0); + + atomic_dec(&node->x_count); + assert("nikita-3217", LOCK_CNT_GTZ(x_refs)); + LOCK_CNT_DEC(x_refs); +} + +/* jref() - increase counter of references to jnode/znode (x_count) */ +static inline jnode * +jref(jnode * node) +{ + assert("jmacd-508", (node != NULL) && !IS_ERR(node)); + add_x_ref(node); + return node; +} + +/* get the page of jnode */ +static inline struct page * +jnode_page(const jnode * node) +{ + return node->pg; +} + +/* return pointer to jnode data */ +static inline char * +jdata(const jnode * node) +{ + assert("nikita-1415", node != NULL); + assert("nikita-3198", jnode_page(node) != NULL); + return node->data; +} + +static inline int +jnode_is_loaded(const jnode * node) +{ + assert("zam-506", node != NULL); + return atomic_read(&node->d_count) > 0; +} + +extern void page_detach_jnode(struct page *page, + struct address_space *mapping, + unsigned long index) NONNULL; +extern void page_clear_jnode(struct page *page, jnode * node) NONNULL; + +static inline void +jnode_set_reloc(jnode * node) +{ + assert("nikita-2431", node != NULL); + assert("nikita-2432", !JF_ISSET(node, JNODE_OVRWR)); + JF_SET(node, JNODE_RELOC); +} + +/* bump data counter on @node */ +static inline void add_d_ref(jnode * node /* node to increase d_count of */ ) +{ + assert("nikita-1962", node != NULL); + + atomic_inc(&node->d_count); + LOCK_CNT_INC(d_refs); +} + + +/* jload/jwrite/junload give a bread/bwrite/brelse functionality for jnodes */ + +extern int jload_gfp(jnode * node, int gfp, int do_kmap) NONNULL; + +static inline int jload(jnode * node) +{ + return jload_gfp(node, GFP_KERNEL, 1); +} + +extern int jinit_new(jnode * node, int gfp_flags) NONNULL; +extern int jstartio(jnode * node) NONNULL; + +extern void jdrop(jnode * node) NONNULL; +extern int jwait_io(jnode * node, int rw) NONNULL; + +void jload_prefetch(jnode *); + +extern jnode *alloc_io_head(const reiser4_block_nr * block) NONNULL; +extern void drop_io_head(jnode * node) NONNULL; + +static inline reiser4_tree * +jnode_get_tree(const jnode * node) +{ + assert("nikita-2691", node != NULL); + return node->tree; +} + +extern void pin_jnode_data(jnode *); +extern void unpin_jnode_data(jnode *); + +static inline jnode_type +jnode_get_type(const jnode * node) +{ + static const unsigned long state_mask = + (1 << JNODE_TYPE_1) | (1 << JNODE_TYPE_2) | (1 << JNODE_TYPE_3); + + static jnode_type mask_to_type[] = { + /* JNODE_TYPE_3 : JNODE_TYPE_2 : JNODE_TYPE_1 */ + + /* 000 */ + [0] = JNODE_FORMATTED_BLOCK, + /* 001 */ + [1] = JNODE_UNFORMATTED_BLOCK, + /* 010 */ + [2] = JNODE_BITMAP, + /* 011 */ + [3] = LAST_JNODE_TYPE, /*invalid */ + /* 100 */ + [4] = JNODE_INODE, + /* 101 */ + [5] = LAST_JNODE_TYPE, + /* 110 */ + [6] = JNODE_IO_HEAD, + /* 111 */ + [7] = LAST_JNODE_TYPE, /* invalid */ + }; + + return mask_to_type[(node->state & state_mask) >> JNODE_TYPE_1]; +} + +/* returns true if node is a znode */ +static inline int +jnode_is_znode(const jnode * node) +{ + return jnode_get_type(node) == JNODE_FORMATTED_BLOCK; +} + +/* return true if "node" is dirty */ +static inline int +jnode_is_dirty(const jnode * node) +{ + assert("nikita-782", node != NULL); + assert("jmacd-1800", spin_jnode_is_locked(node) || (jnode_is_znode(node) && znode_is_any_locked(JZNODE(node)))); + return JF_ISSET(node, JNODE_DIRTY); +} + +/* return true if "node" is dirty, node is unlocked */ +static inline int +jnode_check_dirty(jnode * node) +{ + assert("jmacd-7798", node != NULL); + assert("jmacd-7799", spin_jnode_is_not_locked(node)); + return UNDER_SPIN(jnode, node, jnode_is_dirty(node)); +} + +static inline int +jnode_is_flushprepped(const jnode * node) +{ + assert("jmacd-78212", node != NULL); + assert("jmacd-71276", spin_jnode_is_locked(node)); + return !jnode_is_dirty(node) || JF_ISSET(node, JNODE_RELOC) + || JF_ISSET(node, JNODE_OVRWR); +} + +/* Return true if @node has already been processed by the squeeze and allocate + process. This implies the block address has been finalized for the + duration of this atom (or it is clean and will remain in place). If this + returns true you may use the block number as a hint. */ +static inline int +jnode_check_flushprepped(jnode * node) +{ + /* It must be clean or relocated or wandered. New allocations are set to relocate. */ + assert("jmacd-71275", spin_jnode_is_not_locked(node)); + return UNDER_SPIN(jnode, node, jnode_is_flushprepped(node)); +} + +/* returns true if node is unformatted */ +static inline int +jnode_is_unformatted(const jnode * node) +{ + assert("jmacd-0123", node != NULL); + return jnode_get_type(node) == JNODE_UNFORMATTED_BLOCK; +} + +/* returns true if node represents a cluster cache page */ +static inline int +jnode_is_cluster_page(const jnode * node) +{ + assert("edward-50", node != NULL); + return (JF_ISSET(node, JNODE_CLUSTER_PAGE)); +} + +/* returns true is node is builtin inode's jnode */ +static inline int +jnode_is_inode(const jnode * node) +{ + assert("vs-1240", node != NULL); + return jnode_get_type(node) == JNODE_INODE; +} + +static inline jnode_plugin * +jnode_ops_of(const jnode_type type) +{ + assert("nikita-2367", type < LAST_JNODE_TYPE); + return jnode_plugin_by_id((reiser4_plugin_id) type); +} + +static inline jnode_plugin * +jnode_ops(const jnode * node) +{ + assert("nikita-2366", node != NULL); + + return jnode_ops_of(jnode_get_type(node)); +} + +/* Get the index of a block. */ +static inline unsigned long +jnode_get_index(jnode * node) +{ + return jnode_ops(node)->index(node); +} + +/* return true if "node" is the root */ +static inline int +jnode_is_root(const jnode * node) +{ + return jnode_is_znode(node) && znode_is_root(JZNODE(node)); +} + +extern struct address_space * mapping_jnode(const jnode * node); +extern unsigned long index_jnode(const jnode * node); + +static inline void jput(jnode * node); +extern void jput_final(jnode * node); + + +/* jput() - decrement x_count reference counter on znode. + + Count may drop to 0, jnode stays in cache until memory pressure causes the + eviction of its page. The c_count variable also ensures that children are + pressured out of memory before the parent. The jnode remains hashed as + long as the VM allows its page to stay in memory. +*/ +static inline void +jput(jnode * node) +{ + assert("jmacd-509", node != NULL); + assert("jmacd-510", atomic_read(&node->x_count) > 0); + assert("nikita-3065", spin_jnode_is_not_locked(node)); + assert("zam-926", schedulable()); + LOCK_CNT_DEC(x_refs); + + rcu_read_lock(); + /* + * we don't need any kind of lock here--jput_final() uses RCU. + */ + if (unlikely(atomic_dec_and_test(&node->x_count))) { + jput_final(node); + } else + rcu_read_unlock(); + assert("nikita-3473", schedulable()); +} + +extern void jrelse(jnode * node); +extern void jrelse_tail(jnode * node); + +extern jnode *jnode_rip_sync(reiser4_tree *t, jnode * node); + +/* resolve race with jput */ +static inline jnode * +jnode_rip_check(reiser4_tree *tree, jnode * node) +{ + if (unlikely(JF_ISSET(node, JNODE_RIP))) + node = jnode_rip_sync(tree, node); + return node; +} + +extern reiser4_key * jnode_build_key(const jnode * node, reiser4_key * key); + +/* __JNODE_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/kassign.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/kassign.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,738 @@ +/* Copyright 2001, 2002, 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Key assignment policy implementation */ + +/* + * In reiser4 every piece of file system data and meta-data has a key. Keys + * are used to store information in and retrieve it from reiser4 internal + * tree. In addition to this, keys define _ordering_ of all file system + * information: things having close keys are placed into the same or + * neighboring (in the tree order) nodes of the tree. As our block allocator + * tries to respect tree order (see flush.c), keys also define order in which + * things are laid out on the disk, and hence, affect performance directly. + * + * Obviously, assignment of keys to data and meta-data should be consistent + * across whole file system. Algorithm that calculates a key for a given piece + * of data or meta-data is referred to as "key assignment". + * + * Key assignment is too expensive to be implemented as a plugin (that is, + * with an ability to support different key assignment schemas in the same + * compiled kernel image). As a compromise, all key-assignment functions and + * data-structures are collected in this single file, so that modifications to + * key assignment algorithm can be localized. Additional changes may be + * required in key.[ch]. + * + * Current default reiser4 key assignment algorithm is dubbed "Plan A". As one + * may guess, there is "Plan B" too. + * + */ + +/* + * Additional complication with key assignment implementation is a requirement + * to support different key length. + */ + +/* + * KEY ASSIGNMENT: PLAN A, LONG KEYS. + * + * DIRECTORY ITEMS + * + * | 60 | 4 | 7 |1| 56 | 64 | 64 | + * +--------------+---+---+-+-------------+------------------+-----------------+ + * | dirid | 0 | F |H| prefix-1 | prefix-2 | prefix-3/hash | + * +--------------+---+---+-+-------------+------------------+-----------------+ + * | | | | | + * | 8 bytes | 8 bytes | 8 bytes | 8 bytes | + * + * dirid objectid of directory this item is for + * + * F fibration, see fs/reiser4/plugin/fibration.[ch] + * + * H 1 if last 8 bytes of the key contain hash, + * 0 if last 8 bytes of the key contain prefix-3 + * + * prefix-1 first 7 characters of file name. + * Padded by zeroes if name is not long enough. + * + * prefix-2 next 8 characters of the file name. + * + * prefix-3 next 8 characters of the file name. + * + * hash hash of the rest of file name (i.e., portion of file + * name not included into prefix-1 and prefix-2). + * + * File names shorter than 23 (== 7 + 8 + 8) characters are completely encoded + * in the key. Such file names are called "short". They are distinguished by H + * bit set 0 in the key. + * + * Other file names are "long". For long name, H bit is 1, and first 15 (== 7 + * + 8) characters are encoded in prefix-1 and prefix-2 portions of the + * key. Last 8 bytes of the key are occupied by hash of the remaining + * characters of the name. + * + * This key assignment reaches following important goals: + * + * (1) directory entries are sorted in approximately lexicographical + * order. + * + * (2) collisions (when multiple directory items have the same key), while + * principally unavoidable in a tree with fixed length keys, are rare. + * + * STAT DATA + * + * | 60 | 4 | 64 | 4 | 60 | 64 | + * +--------------+---+-----------------+---+--------------+-----------------+ + * | locality id | 1 | ordering | 0 | objectid | 0 | + * +--------------+---+-----------------+---+--------------+-----------------+ + * | | | | | + * | 8 bytes | 8 bytes | 8 bytes | 8 bytes | + * + * locality id object id of a directory where first name was created for + * the object + * + * ordering copy of second 8-byte portion of the key of directory + * entry for the first name of this object. Ordering has a form + * { + * fibration :7; + * h :1; + * prefix1 :56; + * } + * see description of key for directory entry above. + * + * objectid object id for this object + * + * This key assignment policy is designed to keep stat-data in the same order + * as corresponding directory items, thus speeding up readdir/stat types of + * workload. + * + * FILE BODY + * + * | 60 | 4 | 64 | 4 | 60 | 64 | + * +--------------+---+-----------------+---+--------------+-----------------+ + * | locality id | 4 | ordering | 0 | objectid | offset | + * +--------------+---+-----------------+---+--------------+-----------------+ + * | | | | | + * | 8 bytes | 8 bytes | 8 bytes | 8 bytes | + * + * locality id object id of a directory where first name was created for + * the object + * + * ordering the same as in the key of stat-data for this object + * + * objectid object id for this object + * + * offset logical offset from the beginning of this file. + * Measured in bytes. + * + * + * KEY ASSIGNMENT: PLAN A, SHORT KEYS. + * + * DIRECTORY ITEMS + * + * | 60 | 4 | 7 |1| 56 | 64 | + * +--------------+---+---+-+-------------+-----------------+ + * | dirid | 0 | F |H| prefix-1 | prefix-2/hash | + * +--------------+---+---+-+-------------+-----------------+ + * | | | | + * | 8 bytes | 8 bytes | 8 bytes | + * + * dirid objectid of directory this item is for + * + * F fibration, see fs/reiser4/plugin/fibration.[ch] + * + * H 1 if last 8 bytes of the key contain hash, + * 0 if last 8 bytes of the key contain prefix-2 + * + * prefix-1 first 7 characters of file name. + * Padded by zeroes if name is not long enough. + * + * prefix-2 next 8 characters of the file name. + * + * hash hash of the rest of file name (i.e., portion of file + * name not included into prefix-1). + * + * File names shorter than 15 (== 7 + 8) characters are completely encoded in + * the key. Such file names are called "short". They are distinguished by H + * bit set in the key. + * + * Other file names are "long". For long name, H bit is 0, and first 7 + * characters are encoded in prefix-1 portion of the key. Last 8 bytes of the + * key are occupied by hash of the remaining characters of the name. + * + * STAT DATA + * + * | 60 | 4 | 4 | 60 | 64 | + * +--------------+---+---+--------------+-----------------+ + * | locality id | 1 | 0 | objectid | 0 | + * +--------------+---+---+--------------+-----------------+ + * | | | | + * | 8 bytes | 8 bytes | 8 bytes | + * + * locality id object id of a directory where first name was created for + * the object + * + * objectid object id for this object + * + * FILE BODY + * + * | 60 | 4 | 4 | 60 | 64 | + * +--------------+---+---+--------------+-----------------+ + * | locality id | 4 | 0 | objectid | offset | + * +--------------+---+---+--------------+-----------------+ + * | | | | + * | 8 bytes | 8 bytes | 8 bytes | + * + * locality id object id of a directory where first name was created for + * the object + * + * objectid object id for this object + * + * offset logical offset from the beginning of this file. + * Measured in bytes. + * + * + */ + +#include "debug.h" +#include "key.h" +#include "kassign.h" +#include "vfs_ops.h" +#include "inode.h" +#include "super.h" +#include "dscale.h" + +#include /* for __u?? */ +#include /* for struct super_block, etc */ + +#if REISER4_LARGE_KEY +#define ORDERING_CHARS (sizeof(__u64) - 1) +#define OID_CHARS (sizeof(__u64)) +#else +#define ORDERING_CHARS (0) +#define OID_CHARS (sizeof(__u64) - 1) +#endif + +#define OFFSET_CHARS (sizeof(__u64)) + +#define INLINE_CHARS (ORDERING_CHARS + OID_CHARS) + +/* bitmask for H bit (see comment at the beginning of this file */ +static const __u64 longname_mark = 0x0100000000000000ull; +/* bitmask for F and H portions of the key. */ +static const __u64 fibration_mask = 0xff00000000000000ull; + +/* return true if name is not completely encoded in @key */ +reiser4_internal int +is_longname_key(const reiser4_key *key) +{ + __u64 highpart; + + assert("nikita-2863", key != NULL); + if (get_key_type(key) != KEY_FILE_NAME_MINOR) + print_key("oops", key); + assert("nikita-2864", get_key_type(key) == KEY_FILE_NAME_MINOR); + + if (REISER4_LARGE_KEY) + highpart = get_key_ordering(key); + else + highpart = get_key_objectid(key); + + return (highpart & longname_mark) ? 1 : 0; +} + +/* return true if @name is too long to be completely encoded in the key */ +reiser4_internal int +is_longname(const char *name UNUSED_ARG, int len) +{ + return len > ORDERING_CHARS + OID_CHARS + OFFSET_CHARS; +} + +/* code ascii string into __u64. + + Put characters of @name into result (@str) one after another starting + from @start_idx-th highest (arithmetically) byte. This produces + endian-safe encoding. memcpy(2) will not do. + +*/ +static __u64 +pack_string(const char *name /* string to encode */ , + int start_idx /* highest byte in result from + * which to start encoding */ ) +{ + unsigned i; + __u64 str; + + str = 0; + for (i = 0; (i < sizeof str - start_idx) && name[i]; ++i) { + str <<= 8; + str |= (unsigned char) name[i]; + } + str <<= (sizeof str - i - start_idx) << 3; + return str; +} + +#if !REISER4_DEBUG +static +#endif +/* opposite to pack_string(). Takes value produced by pack_string(), restores + * string encoded in it and stores result in @buf */ +reiser4_internal char * +unpack_string(__u64 value, char *buf) +{ + do { + *buf = value >> (64 - 8); + if (*buf) + ++ buf; + value <<= 8; + } while(value != 0); + *buf = 0; + return buf; +} + +/* obtain name encoded in @key and store it in @buf */ +reiser4_internal char * +extract_name_from_key(const reiser4_key *key, char *buf) +{ + char *c; + + assert("nikita-2868", !is_longname_key(key)); + + c = buf; + if (REISER4_LARGE_KEY) { + c = unpack_string(get_key_ordering(key) & ~fibration_mask, c); + c = unpack_string(get_key_fulloid(key), c); + } else + c = unpack_string(get_key_fulloid(key) & ~fibration_mask, c); + unpack_string(get_key_offset(key), c); + return buf; +} + +/* build key for directory entry. + ->build_entry_key() for directory plugin */ +reiser4_internal void +build_entry_key_common(const struct inode *dir /* directory where entry is + * (or will be) in.*/ , + const struct qstr *qname /* name of file referenced + * by this entry */ , + reiser4_key * result /* resulting key of directory + * entry */ ) +{ + __u64 ordering; + __u64 objectid; + __u64 offset; + const char *name; + int len; + +#if REISER4_LARGE_KEY +#define second_el ordering +#else +#define second_el objectid +#endif + + assert("nikita-1139", dir != NULL); + assert("nikita-1140", qname != NULL); + assert("nikita-1141", qname->name != NULL); + assert("nikita-1142", result != NULL); + + name = qname->name; + len = qname->len; + + assert("nikita-2867", strlen(name) == len); + + reiser4_key_init(result); + /* locality of directory entry's key is objectid of parent + directory */ + set_key_locality(result, get_inode_oid(dir)); + /* minor packing locality is constant */ + set_key_type(result, KEY_FILE_NAME_MINOR); + /* dot is special case---we always want it to be first entry in + a directory. Actually, we just want to have smallest + directory entry. + */ + if (len == 1 && name[0] == '.') + return; + + /* This is our brand new proposed key allocation algorithm for + directory entries: + + If name is shorter than 7 + 8 = 15 characters, put first 7 + characters into objectid field and remaining characters (if + any) into offset field. Dream long dreamt came true: file + name as a key! + + If file name is longer than 15 characters, put first 7 + characters into objectid and hash of remaining characters + into offset field. + + To distinguish above cases, in latter set up unused high bit + in objectid field. + + + With large keys (REISER4_LARGE_KEY) algorithm is updated + appropriately. + */ + + /* objectid of key is composed of seven first characters of + file's name. This imposes global ordering on directory + entries. + */ + second_el = pack_string(name, 1); + if (REISER4_LARGE_KEY) { + if (len > ORDERING_CHARS) + objectid = pack_string(name + ORDERING_CHARS, 0); + else + objectid = 0ull; + } + + if (!is_longname(name, len)) { + if (len > INLINE_CHARS) + offset = pack_string(name + INLINE_CHARS, 0); + else + offset = 0ull; + } else { + /* note in a key the fact that offset contains hash. */ + second_el |= longname_mark; + + /* offset is the hash of the file name. */ + offset = inode_hash_plugin(dir)->hash(name + INLINE_CHARS, + len - INLINE_CHARS); + } + + assert("nikita-3480", inode_fibration_plugin(dir) != NULL); + second_el |= inode_fibration_plugin(dir)->fibre(dir, name, len); + + set_key_ordering(result, ordering); + set_key_fulloid(result, objectid); + set_key_offset(result, offset); + return; +} + +/* build key for directory entry. + ->build_entry_key() for directory plugin + + This is for directories where we want repeatable and restartable readdir() + even in case 32bit user level struct dirent (readdir(3)). +*/ +reiser4_internal void +build_entry_key_stable_entry(const struct inode *dir /* directory where + * entry is (or + * will be) in. */ , + const struct qstr *name /* name of file + * referenced by + * this entry */ , + reiser4_key * result /* resulting key of + * directory entry */ ) +{ + oid_t objectid; + + assert("nikita-2283", dir != NULL); + assert("nikita-2284", name != NULL); + assert("nikita-2285", name->name != NULL); + assert("nikita-2286", result != NULL); + + reiser4_key_init(result); + /* locality of directory entry's key is objectid of parent + directory */ + set_key_locality(result, get_inode_oid(dir)); + /* minor packing locality is constant */ + set_key_type(result, KEY_FILE_NAME_MINOR); + /* dot is special case---we always want it to be first entry in + a directory. Actually, we just want to have smallest + directory entry. + */ + if ((name->len == 1) && (name->name[0] == '.')) + return; + + /* objectid of key is 31 lowest bits of hash. */ + objectid = inode_hash_plugin(dir)->hash(name->name, (int) name->len) & 0x7fffffff; + + assert("nikita-2303", !(objectid & ~KEY_OBJECTID_MASK)); + set_key_objectid(result, objectid); + + /* offset is always 0. */ + set_key_offset(result, (__u64) 0); + return; +} + +/* build key to be used by ->readdir() method. + + See reiser4_readdir() for more detailed comment. + Common implementation of dir plugin's method build_readdir_key +*/ +reiser4_internal int +build_readdir_key_common(struct file *dir /* directory being read */ , + reiser4_key * result /* where to store key */ ) +{ + reiser4_file_fsdata *fdata; + struct inode *inode; + + assert("nikita-1361", dir != NULL); + assert("nikita-1362", result != NULL); + assert("nikita-1363", dir->f_dentry != NULL); + inode = dir->f_dentry->d_inode; + assert("nikita-1373", inode != NULL); + + fdata = reiser4_get_file_fsdata(dir); + if (IS_ERR(fdata)) + return PTR_ERR(fdata); + assert("nikita-1364", fdata != NULL); + return extract_key_from_de_id(get_inode_oid(inode), &fdata->dir.readdir.position.dir_entry_key, result); + +} + +/* true, if @key is the key of "." */ +reiser4_internal int +is_dot_key(const reiser4_key * key /* key to check */ ) +{ + assert("nikita-1717", key != NULL); + assert("nikita-1718", get_key_type(key) == KEY_FILE_NAME_MINOR); + return + (get_key_ordering(key) == 0ull) && + (get_key_objectid(key) == 0ull) && + (get_key_offset(key) == 0ull); +} + +/* build key for stat-data. + + return key of stat-data of this object. This should became sd plugin + method in the future. For now, let it be here. + +*/ +reiser4_internal reiser4_key * +build_sd_key(const struct inode * target /* inode of an object */ , + reiser4_key * result /* resulting key of @target + stat-data */ ) +{ + assert("nikita-261", result != NULL); + + reiser4_key_init(result); + set_key_locality(result, reiser4_inode_data(target)->locality_id); + set_key_ordering(result, get_inode_ordering(target)); + set_key_objectid(result, get_inode_oid(target)); + set_key_type(result, KEY_SD_MINOR); + set_key_offset(result, (__u64) 0); + return result; +} + +/* encode part of key into &obj_key_id + + This encodes into @id part of @key sufficient to restore @key later, + given that latter is key of object (key of stat-data). + + See &obj_key_id +*/ +reiser4_internal int +build_obj_key_id(const reiser4_key * key /* key to encode */ , + obj_key_id * id /* id where key is encoded in */ ) +{ + assert("nikita-1151", key != NULL); + assert("nikita-1152", id != NULL); + + memcpy(id, key, sizeof *id); + return 0; +} + +/* encode reference to @obj in @id. + + This is like build_obj_key_id() above, but takes inode as parameter. */ +reiser4_internal int +build_inode_key_id(const struct inode *obj /* object to build key of */ , + obj_key_id * id /* result */ ) +{ + reiser4_key sdkey; + + assert("nikita-1166", obj != NULL); + assert("nikita-1167", id != NULL); + + build_sd_key(obj, &sdkey); + build_obj_key_id(&sdkey, id); + return 0; +} + +/* decode @id back into @key + + Restore key of object stat-data from @id. This is dual to + build_obj_key_id() above. +*/ +reiser4_internal int +extract_key_from_id(const obj_key_id * id /* object key id to extract key + * from */ , + reiser4_key * key /* result */ ) +{ + assert("nikita-1153", id != NULL); + assert("nikita-1154", key != NULL); + + reiser4_key_init(key); + memcpy(key, id, sizeof *id); + return 0; +} + +/* extract objectid of directory from key of directory entry within said + directory. + */ +reiser4_internal oid_t +extract_dir_id_from_key(const reiser4_key * de_key /* key of + * directory + * entry */ ) +{ + assert("nikita-1314", de_key != NULL); + return get_key_locality(de_key); +} + +/* encode into @id key of directory entry. + + Encode into @id information sufficient to later distinguish directory + entries within the same directory. This is not whole key, because all + directory entries within directory item share locality which is equal + to objectid of their directory. + +*/ +reiser4_internal int +build_de_id(const struct inode *dir /* inode of directory */ , + const struct qstr *name /* name to be given to @obj by + * directory entry being + * constructed */ , + de_id * id /* short key of directory entry */ ) +{ + reiser4_key key; + + assert("nikita-1290", dir != NULL); + assert("nikita-1292", id != NULL); + + /* NOTE-NIKITA this is suboptimal. */ + inode_dir_plugin(dir)->build_entry_key(dir, name, &key); + return build_de_id_by_key(&key, id); +} + +/* encode into @id key of directory entry. + + Encode into @id information sufficient to later distinguish directory + entries within the same directory. This is not whole key, because all + directory entries within directory item share locality which is equal + to objectid of their directory. + +*/ +reiser4_internal int +build_de_id_by_key(const reiser4_key * entry_key /* full key of directory + * entry */ , + de_id * id /* short key of directory entry */ ) +{ + memcpy(id, ((__u64 *) entry_key) + 1, sizeof *id); + return 0; +} + +/* restore from @id key of directory entry. + + Function dual to build_de_id(): given @id and locality, build full + key of directory entry within directory item. + +*/ +reiser4_internal int +extract_key_from_de_id(const oid_t locality /* locality of directory + * entry */ , + const de_id * id /* directory entry id */ , + reiser4_key * key /* result */ ) +{ + /* no need to initialise key here: all fields are overwritten */ + memcpy(((__u64 *) key) + 1, id, sizeof *id); + set_key_locality(key, locality); + set_key_type(key, KEY_FILE_NAME_MINOR); + return 0; +} + +/* compare two &de_id's */ +reiser4_internal cmp_t +de_id_cmp(const de_id * id1 /* first &de_id to compare */ , + const de_id * id2 /* second &de_id to compare */ ) +{ + /* NOTE-NIKITA ugly implementation */ + reiser4_key k1; + reiser4_key k2; + + extract_key_from_de_id((oid_t) 0, id1, &k1); + extract_key_from_de_id((oid_t) 0, id2, &k2); + return keycmp(&k1, &k2); +} + +/* compare &de_id with key */ +reiser4_internal cmp_t +de_id_key_cmp(const de_id * id /* directory entry id to compare */ , + const reiser4_key * key /* key to compare */ ) +{ + cmp_t result; + reiser4_key *k1; + + k1 = (reiser4_key *)(((unsigned long)id) - sizeof key->el[0]); + result = KEY_DIFF_EL(k1, key, 1); + if (result == EQUAL_TO) { + result = KEY_DIFF_EL(k1, key, 2); + if (REISER4_LARGE_KEY && result == EQUAL_TO) { + result = KEY_DIFF_EL(k1, key, 3); + } + } + return result; +} + +/* + * return number of bytes necessary to encode @inode identity. + */ +int inode_onwire_size(const struct inode *inode) +{ + int result; + + result = dscale_bytes(get_inode_oid(inode)); + result += dscale_bytes(get_inode_locality(inode)); + + /* + * ordering is large (it usually has highest bits set), so it makes + * little sense to dscale it. + */ + if (REISER4_LARGE_KEY) + result += sizeof(get_inode_ordering(inode)); + return result; +} + +/* + * encode @inode identity at @start + */ +char *build_inode_onwire(const struct inode *inode, char *start) +{ + start += dscale_write(start, get_inode_locality(inode)); + start += dscale_write(start, get_inode_oid(inode)); + + if (REISER4_LARGE_KEY) { + cputod64(get_inode_ordering(inode), (d64 *)start); + start += sizeof(get_inode_ordering(inode)); + } + return start; +} + +/* + * extract key that was previously encoded by build_inode_onwire() at @addr + */ +char *extract_obj_key_id_from_onwire(char *addr, obj_key_id *key_id) +{ + __u64 val; + + addr += dscale_read(addr, &val); + val = (val << KEY_LOCALITY_SHIFT) | KEY_SD_MINOR; + cputod64(val, (d64 *)key_id->locality); + addr += dscale_read(addr, &val); + cputod64(val, (d64 *)key_id->objectid); +#if REISER4_LARGE_KEY + memcpy(&key_id->ordering, addr, sizeof key_id->ordering); + addr += sizeof key_id->ordering; +#endif + return addr; +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/kassign.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/kassign.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,97 @@ +/* Copyright 2001, 2002, 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Key assignment policy interface. See kassign.c for details. */ + +#if !defined( __KASSIGN_H__ ) +#define __KASSIGN_H__ + +#include "forward.h" +#include "key.h" +#include "dformat.h" + +#include /* for __u?? */ +#include /* for struct super_block, etc */ +#include /* for struct qstr */ + +/* key assignment functions */ + +/* Information from which key of file stat-data can be uniquely + restored. This depends on key assignment policy for + stat-data. Currently it's enough to store object id and locality id + (60+60==120) bits, because minor packing locality and offset of + stat-data key are always known constants: KEY_SD_MINOR and 0 + respectively. For simplicity 4 bits are wasted in each id, and just + two 64 bit integers are stored. + + This field has to be byte-aligned, because we don't want to waste + space in directory entries. There is another side of a coin of + course: we waste CPU and bus bandwidth in stead, by copying data back + and forth. + + Next optimization: &obj_key_id is mainly used to address stat data from + directory entries. Under the assumption that majority of files only have + only name (one hard link) from *the* parent directory it seems reasonable + to only store objectid of stat data and take its locality from key of + directory item. + + This requires some flag to be added to the &obj_key_id to distinguish + between these two cases. Remaining bits in flag byte are then asking to be + used to store file type. + + This optimization requires changes in directory item handling code. + +*/ +typedef struct obj_key_id { + d8 locality[sizeof (__u64)]; + ON_LARGE_KEY(d8 ordering[sizeof (__u64)];) + d8 objectid[sizeof (__u64)]; +} obj_key_id; + +/* Information sufficient to uniquely identify directory entry within + compressed directory item. + + For alignment issues see &obj_key_id above. +*/ +typedef struct de_id { + ON_LARGE_KEY(d8 ordering[sizeof (__u64)];) + d8 objectid[sizeof (__u64)]; + d8 offset[sizeof (__u64)]; +} de_id; + +extern int inode_onwire_size(const struct inode *obj); +extern char *build_inode_onwire(const struct inode *obj, char *area); +extern char *extract_obj_key_id_from_onwire(char *area, obj_key_id * key_id); + +extern int build_inode_key_id(const struct inode *obj, obj_key_id * id); +extern int extract_key_from_id(const obj_key_id * id, reiser4_key * key); +extern int build_obj_key_id(const reiser4_key * key, obj_key_id * id); +extern oid_t extract_dir_id_from_key(const reiser4_key * de_key); +extern int build_de_id(const struct inode *dir, const struct qstr *name, de_id * id); +extern int build_de_id_by_key(const reiser4_key * entry_key, de_id * id); +extern int extract_key_from_de_id(const oid_t locality, const de_id * id, reiser4_key * key); +extern cmp_t de_id_cmp(const de_id * id1, const de_id * id2); +extern cmp_t de_id_key_cmp(const de_id * id, const reiser4_key * key); + +extern int build_readdir_key_common(struct file *dir, reiser4_key * result); +extern void build_entry_key_common(const struct inode *dir, const struct qstr *name, reiser4_key * result); +extern void build_entry_key_stable_entry(const struct inode *dir, const struct qstr *name, reiser4_key * result); +extern int is_dot_key(const reiser4_key * key); +extern reiser4_key *build_sd_key(const struct inode *target, reiser4_key * result); + +extern int is_longname_key(const reiser4_key *key); +extern int is_longname(const char *name, int len); +extern char *extract_name_from_key(const reiser4_key *key, char *buf); + +/* __KASSIGN_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/kcond.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/kcond.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,283 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Kernel condition variables implementation. + + This is simplistic (90 LOC mod comments) condition variable + implementation. Condition variable is the most natural "synchronization + object" in some circumstances. + + Each CS text-book on multi-threading should discuss condition + variables. Also see man/info for: + + pthread_cond_init(3), + pthread_cond_destroy(3), + pthread_cond_signal(3), + pthread_cond_broadcast(3), + pthread_cond_wait(3), + pthread_cond_timedwait(3). + + See comments in kcond_wait(). + + TODO + + 1. Add an option (to kcond_init?) to make conditional variable async-safe + so that signals and broadcasts can be done from interrupt + handlers. Requires using spin_lock_irq in kcond_*(). + + 2. "Predicated" sleeps: add predicate function to the qlink and only wake + sleeper if predicate is true. Probably requires additional parameters to + the kcond_{signal,broadcast}() to supply cookie to the predicate. Standard + wait_queues already have this functionality. Idea is that if one has + object behaving like finite state automaton it is possible to use single + per-object condition variable to signal all state transitions. Predicates + allow waiters to select only transitions they are interested in without + going through context switch. + + 3. It is relatively easy to add support for sleeping on the several + condition variables at once. Does anybody need this? + +*/ + +#include "debug.h" +#include "kcond.h" +#include "spin_macros.h" + +#include +#include + +static void kcond_timeout(unsigned long datum); +static void kcond_remove(kcond_t * cvar, kcond_queue_link_t * link); + +/* initialize condition variable. Initializer for global condition variables + is macro in kcond.h */ +reiser4_internal kcond_t * +kcond_init(kcond_t * cvar /* cvar to init */ ) +{ + assert("nikita-1868", cvar != NULL); + + memset(cvar, 0, sizeof *cvar); + spin_lock_init(&cvar->lock); + cvar->queue = NULL; + return cvar; +} + +/* Wait until condition variable is signalled. Call this with @lock locked. + If @signl is true, then sleep on condition variable will be interruptible + by signals. -EINTR is returned if sleep were interrupted by signal and 0 + otherwise. + + kcond_t is just a queue protected by spinlock. Whenever thread is going to + sleep on the kcond_t it does the following: + + (1) prepares "queue link" @qlink which is semaphore constructed locally on + the stack of the thread going to sleep. + + (2) takes @cvar spinlock + + (3) adds @qlink to the @cvar queue of waiters + + (4) releases @cvar spinlock + + (5) sleeps on semaphore constructed at step (1) + + When @cvar will be signalled or broadcasted all semaphors enqueued to the + @cvar queue will be upped and kcond_wait() will return. + + By use of local semaphore for each waiter we avoid races between going to + sleep and waking up---endemic plague of condition variables. + + For example, should kcond_broadcast() come in between steps (4) and (5) it + would call up() on semaphores already in a queue and hence, down() in the + step (5) would return immediately. + +*/ +reiser4_internal int +kcond_wait(kcond_t * cvar /* cvar to wait for */ , + spinlock_t * lock /* lock to use */ , + int signl /* if 0, ignore signals during sleep */ ) +{ + kcond_queue_link_t qlink; + int result; + + assert("nikita-1869", cvar != NULL); + assert("nikita-1870", lock != NULL); + assert("nikita-1871", check_spin_is_locked(lock)); + + spin_lock(&cvar->lock); + qlink.next = cvar->queue; + cvar->queue = &qlink; + init_MUTEX_LOCKED(&qlink.wait); + spin_unlock(&cvar->lock); + spin_unlock(lock); + + result = 0; + if (signl) + result = down_interruptible(&qlink.wait); + else + down(&qlink.wait); + spin_lock(&cvar->lock); + if (result != 0) { + /* if thread was woken up by signal, @qlink is probably still + in the queue, remove it. */ + kcond_remove(cvar, &qlink); + } + /* if it wasn't woken up by signal, spinlock here is still useful, + because we want to wait until kcond_{broadcast|signal} + finishes. Otherwise down() could interleave with up() in such a way + that, that kcond_wait() would exit and up() would see garbage in a + semaphore. + */ + spin_unlock(&cvar->lock); + spin_lock(lock); + return result; +} + +typedef struct { + kcond_queue_link_t *link; + int *woken_up; +} kcond_timer_arg; + +/* like kcond_wait(), but with timeout */ +reiser4_internal int +kcond_timedwait(kcond_t * cvar /* cvar to wait for */ , + spinlock_t * lock /* lock to use */ , + signed long timeout /* timeout in jiffies */ , + int signl /* if 0, ignore signals during sleep */ ) +{ + struct timer_list timer; + kcond_queue_link_t qlink; + int result; + int woken_up; + kcond_timer_arg targ; + + assert("nikita-2437", cvar != NULL); + assert("nikita-2438", lock != NULL); + assert("nikita-2439", check_spin_is_locked(lock)); + + spin_lock(&cvar->lock); + qlink.next = cvar->queue; + cvar->queue = &qlink; + init_MUTEX_LOCKED(&qlink.wait); + spin_unlock(&cvar->lock); + spin_unlock(lock); + + assert("nikita-3011", schedulable()); + + /* prepare timer */ + init_timer(&timer); + timer.expires = jiffies + timeout; + timer.data = (unsigned long) &targ; + timer.function = kcond_timeout; + + woken_up = 0; + + targ.link = &qlink; + targ.woken_up = &woken_up; + + /* ... and set it up */ + add_timer(&timer); + + result = 0; + if (signl) + result = down_interruptible(&qlink.wait); + else + down(&qlink.wait); + + /* cancel timer */ + del_timer_sync(&timer); + + if (woken_up) + result = -ETIMEDOUT; + + spin_lock(&cvar->lock); + if (result != 0) { + /* if thread was woken up by signal, or due to time-out, + @qlink is probably still in the queue, remove it. */ + kcond_remove(cvar, &qlink); + } + spin_unlock(&cvar->lock); + + spin_lock(lock); + return result; +} + +/* Signal condition variable: wake up one waiter, if any. */ +reiser4_internal int +kcond_signal(kcond_t * cvar /* cvar to signal */ ) +{ + kcond_queue_link_t *queue_head; + + assert("nikita-1872", cvar != NULL); + + spin_lock(&cvar->lock); + + queue_head = cvar->queue; + if (queue_head != NULL) { + cvar->queue = queue_head->next; + up(&queue_head->wait); + } + spin_unlock(&cvar->lock); + return 1; +} + +/* Broadcast condition variable: wake up all waiters. */ +reiser4_internal int +kcond_broadcast(kcond_t * cvar /* cvar to broadcast */ ) +{ + kcond_queue_link_t *queue_head; + + assert("nikita-1875", cvar != NULL); + + spin_lock(&cvar->lock); + + for (queue_head = cvar->queue; queue_head != NULL; queue_head = queue_head->next) + up(&queue_head->wait); + + cvar->queue = NULL; + spin_unlock(&cvar->lock); + return 1; +} + +/* timer expiration function used by kcond_timedwait */ +static void +kcond_timeout(unsigned long datum) +{ + kcond_timer_arg *arg; + + arg = (kcond_timer_arg *) datum; + *arg->woken_up = 1; + up(&arg->link->wait); +} + +/* helper function to remove @link from @cvar queue */ +static void +kcond_remove(kcond_t * cvar /* cvar to operate on */ , + kcond_queue_link_t * link /* link to remove */ ) +{ + kcond_queue_link_t *scan; + kcond_queue_link_t *prev; + + assert("nikita-2440", cvar != NULL); + assert("nikita-2441", check_spin_is_locked(&cvar->lock)); + + for (scan = cvar->queue, prev = NULL; scan != NULL; prev = scan, scan = scan->next) { + if (scan == link) { + if (prev == NULL) + cvar->queue = scan->next; + else + prev->next = scan->next; + break; + } + } +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/kcond.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/kcond.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,56 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Declaration of kernel condition variables and API. See kcond.c for more + info. */ + +#ifndef __KCOND_H__ +#define __KCOND_H__ + +#include +#include + +typedef struct kcond_queue_link_s kcond_queue_link_t; + +/* condition variable */ +typedef struct kcond_s { + /* lock protecting integrity of @queue */ + spinlock_t lock; + /* queue of waiters */ + kcond_queue_link_t *queue; +} kcond_t; + +/* queue link added to the kcond->queue by each waiter */ +struct kcond_queue_link_s { + /* next link in the queue */ + kcond_queue_link_t *next; + /* semaphore to signal on wake up */ + struct semaphore wait; +}; + +extern kcond_t *kcond_init(kcond_t * cvar); + +extern int kcond_wait(kcond_t * cvar, spinlock_t * lock, int signl); +extern int kcond_timedwait(kcond_t * cvar, spinlock_t * lock, signed long timeout, int signl); +extern int kcond_signal(kcond_t * cvar); +extern int kcond_broadcast(kcond_t * cvar); + +extern void kcond_print(kcond_t * cvar); + +#define KCOND_STATIC_INIT \ + { \ + .lock = SPIN_LOCK_UNLOCKED, \ + .queue = NULL \ + } + +/* __KCOND_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/Kconfig --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/Kconfig Mon Jun 13 15:05:23 2005 @@ -0,0 +1,90 @@ +config REISER4_FS + tristate "Reiser4 (EXPERIMENTAL very fast general purpose filesystem)" + depends on EXPERIMENTAL && !4KSTACKS + help + Reiser4 is more than twice as fast for both reads and writes as + ReiserFS V3, and is the fastest Linux filesystem, by a lot, + for typical IO intensive workloads. [It is slow at fsync + intensive workloads as it is not yet optimized for fsync + (sponsors are welcome for that work), and it is instead + optimized for atomicity, see below.] Benchmarks that define + "a lot" are at http://www.namesys.com/benchmarks.html. + + It is the storage layer of what will become a general purpose naming + system --- like what Microsoft wants WinFS to be except designed with a + clean new semantic layer rather than being SQL based like WinFS. + For details read http://www.namesys.com/whitepaper.html + + It performs all filesystem operations as atomic transactions, which + means that it either performs a write, or it does not, and in the + event of a crash it does not partially perform it or corrupt it. + Many applications that currently use fsync don't need to if they use + reiser4, and that means a lot for performance. An API for performing + multiple file system operations as one high performance atomic write + is almost finished. + + It stores files in dancing trees, which are like balanced trees but + faster. It packs small files together so that they share blocks + without wasting space. This means you can use it to store really + small files. It also means that it saves you disk space. It avoids + hassling you with anachronisms like having a maximum number of + inodes, and wasting space if you use less than that number. + + It can handle really large directories, because its search + algorithms are logarithmic with size not linear. With Reiser4 you + should use subdirectories because they help YOU, not because they + help your filesystem's performance, or because your filesystem won't + be able to shrink a directory once you have let it grow. For squid + and similar applications, everything in one directory should perform + better. + + It has a plugin-based infrastructure, which means that you can easily + invent new kinds of files, and so can other people, so it will evolve + rapidly. + + We will be adding a variety of security features to it that DARPA has + funded us to write. + + "reiser4" is a distinct filesystem mount type from "reiserfs" (V3), + which means that "reiserfs" filesystems will be unaffected by any + reiser4 bugs. + + ReiserFS V3 is the stablest Linux filesystem, and V4 is the fastest. + + In regards to claims by ext2 that they are the de facto + standard Linux filesystem, the most polite thing to say is that + many persons disagree, and it is interesting that those persons + seem to include the distros that are growing in market share. + See http://www.namesys.com/benchmarks.html for why many disagree. + + If you'd like to upgrade from reiserfs to reiser4, use tar to a + temporary disk, maybe using NFS/ssh/SFS to get to that disk, or ask + your favorite distro to sponsor writing a conversion program. + + Sponsored by the Defensed Advanced Research Projects Agency (DARPA) + of the United States Government. DARPA does not endorse this + project, it merely sponsors it. + See http://www.darpa.mil/ato/programs/chats.htm + + If you would like to learn about our plans to add + military grade security to reiser4, please read + http://www.namesys.com/blackbox_security.html. + + To learn more about reiser4, go to http://www.namesys.com + +config REISER4_DEBUG + bool "Enable reiser4 debug mode" + depends on REISER4_FS + help + Don't use this unless you are a developer debugging reiser4. If + using a kernel made by a distro that thinks they are our competitor + (sigh) rather than made by Linus, always check each release to make + sure they have not turned this on to make us look slow as was done + once in the past. This checks everything imaginable while reiser4 + runs. + + When adding features to reiser4 you should set this, and then + extensively test the code, and then send to us and we will test it + again. Include a description of what you did to test it. All + reiser4 code must be tested, reviewed, and signed off on by two + persons before it will be accepted into a stable kernel by Hans. diff -puN /dev/null fs/reiser4/key.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/key.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,168 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Key manipulations. */ + +#include "debug.h" +#include "key.h" +#include "super.h" +#include "reiser4.h" + +#include /* for __u?? */ + +/* Minimal possible key: all components are zero. It is presumed that this is + independent of key scheme. */ +static const reiser4_key MINIMAL_KEY = { + .el = { + {0ull}, + ON_LARGE_KEY({0ull},) + {0ull}, + {0ull} + } +}; + +/* Maximal possible key: all components are ~0. It is presumed that this is + independent of key scheme. */ +static const reiser4_key MAXIMAL_KEY = { + .el = { + {~0ull}, + ON_LARGE_KEY({~0ull},) + {~0ull}, + {~0ull} + } +}; + +/* Initialise key. */ +reiser4_internal void +reiser4_key_init(reiser4_key * key /* key to init */ ) +{ + assert("nikita-1169", key != NULL); + memset(key, 0, sizeof *key); +} + +/* minimal possible key in the tree. Return pointer to the static storage. */ +reiser4_internal const reiser4_key * +min_key(void) +{ + return &MINIMAL_KEY; +} + +/* maximum possible key in the tree. Return pointer to the static storage. */ +reiser4_internal const reiser4_key * +max_key(void) +{ + return &MAXIMAL_KEY; +} + +#if REISER4_DEBUG +/* debugging aid: print symbolic name of key type */ +static const char * +type_name(unsigned int key_type /* key type */ ) +{ + switch (key_type) { + case KEY_FILE_NAME_MINOR: + return "file name"; + case KEY_SD_MINOR: + return "stat data"; + case KEY_ATTR_NAME_MINOR: + return "attr name"; + case KEY_ATTR_BODY_MINOR: + return "attr body"; + case KEY_BODY_MINOR: + return "file body"; + default: + return "unknown"; + } +} + +extern char *unpack_string(__u64 value, char *buf); + +/* debugging aid: print human readable information about key */ +reiser4_internal void +print_key(const char *prefix /* prefix to print */ , + const reiser4_key * key /* key to print */ ) +{ + /* turn bold on */ + /* printf ("\033[1m"); */ + if (key == NULL) + printk("%s: null key\n", prefix); + else { + if (REISER4_LARGE_KEY) + printk("%s: (%Lx:%x:%Lx:%Lx:%Lx:%Lx)", prefix, + get_key_locality(key), + get_key_type(key), + get_key_ordering(key), + get_key_band(key), + get_key_objectid(key), + get_key_offset(key)); + else + printk("%s: (%Lx:%x:%Lx:%Lx:%Lx)", prefix, + get_key_locality(key), + get_key_type(key), + get_key_band(key), + get_key_objectid(key), + get_key_offset(key)); + /* + * if this is a key of directory entry, try to decode part of + * a name stored in the key, and output it. + */ + if (get_key_type(key) == KEY_FILE_NAME_MINOR) { + char buf[DE_NAME_BUF_LEN]; + char *c; + + c = buf; + c = unpack_string(get_key_ordering(key), c); + unpack_string(get_key_fulloid(key), c); + printk("[%s", buf); + if (is_longname_key(key)) + /* + * only part of the name is stored in the key. + */ + printk("...]\n"); + else { + /* + * whole name is stored in the key. + */ + unpack_string(get_key_offset(key), buf); + printk("%s]\n", buf); + } + } else { + printk("[%s]\n", type_name(get_key_type(key))); + } + } + /* turn bold off */ + /* printf ("\033[m\017"); */ +} + +#endif + +/* like print_key() but outputs key representation into @buffer. */ +reiser4_internal int +sprintf_key(char *buffer /* buffer to print key into */ , + const reiser4_key * key /* key to print */ ) +{ + if (REISER4_LARGE_KEY) + return sprintf(buffer, "(%Lx:%x:%Lx:%Lx:%Lx:%Lx)", + (unsigned long long)get_key_locality(key), + get_key_type(key), + (unsigned long long)get_key_ordering(key), + (unsigned long long)get_key_band(key), + (unsigned long long)get_key_objectid(key), + (unsigned long long)get_key_offset(key)); + else + return sprintf(buffer, "(%Lx:%x:%Lx:%Lx:%Lx)", + (unsigned long long)get_key_locality(key), + get_key_type(key), + (unsigned long long)get_key_band(key), + (unsigned long long)get_key_objectid(key), + (unsigned long long)get_key_offset(key)); +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/key.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/key.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,389 @@ +/* Copyright 2000, 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Declarations of key-related data-structures and operations on keys. */ + +#if !defined( __REISER4_KEY_H__ ) +#define __REISER4_KEY_H__ + +#include "dformat.h" +#include "forward.h" +#include "debug.h" + +#include /* for __u?? */ + +/* Operations on keys in reiser4 tree */ + +/* No access to any of these fields shall be done except via a + wrapping macro/function, and that wrapping macro/function shall + convert to little endian order. Compare keys will consider cpu byte order. */ + +/* A storage layer implementation difference between a regular unix file body and its attributes is in the typedef below + which causes all of the attributes of a file to be near in key to all of the other attributes for all of the files + within that directory, and not near to the file itself. It is interesting to consider whether this is the wrong + approach, and whether there should be no difference at all. For current usage patterns this choice is probably the + right one. */ + +/* possible values for minor packing locality (4 bits required) */ +typedef enum { + /* file name */ + KEY_FILE_NAME_MINOR = 0, + /* stat-data */ + KEY_SD_MINOR = 1, + /* file attribute name */ + KEY_ATTR_NAME_MINOR = 2, + /* file attribute value */ + KEY_ATTR_BODY_MINOR = 3, + /* file body (tail or extent) */ + KEY_BODY_MINOR = 4, +} key_minor_locality; + +/* everything stored in the tree has a unique key, which means that the tree is (logically) fully ordered by key. + Physical order is determined by dynamic heuristics that attempt to reflect key order when allocating available space, + and by the repacker. It is stylistically better to put aggregation information into the key. Thus, if you want to + segregate extents from tails, it is better to give them distinct minor packing localities rather than changing + block_alloc.c to check the node type when deciding where to allocate the node. + + The need to randomly displace new directories and large files disturbs this symmetry unfortunately. However, it + should be noted that this is a need that is not clearly established given the existence of a repacker. Also, in our + current implementation tails have a different minor packing locality from extents, and no files have both extents and + tails, so maybe symmetry can be had without performance cost after all. Symmetry is what we ship for now.... +*/ + +/* Arbitrary major packing localities can be assigned to objects using + the reiser4(filenameA/..packing<=some_number) system call. + + In reiser4, the creat() syscall creates a directory + + whose default flow (that which is referred to if the directory is + read as a file) is the traditional unix file body. + + whose directory plugin is the 'filedir' + + whose major packing locality is that of the parent of the object created. + + The static_stat item is a particular commonly used directory + compression (the one for normal unix files). + + The filedir plugin checks to see if the static_stat item exists. + There is a unique key for static_stat. If yes, then it uses the + static_stat item for all of the values that it contains. The + static_stat item contains a flag for each stat it contains which + indicates whether one should look outside the static_stat item for its + contents. +*/ + +/* offset of fields in reiser4_key. Value of each element of this enum + is index within key (thought as array of __u64's) where this field + is. */ +typedef enum { + /* major "locale", aka dirid. Sits in 1st element */ + KEY_LOCALITY_INDEX = 0, + /* minor "locale", aka item type. Sits in 1st element */ + KEY_TYPE_INDEX = 0, + ON_LARGE_KEY(KEY_ORDERING_INDEX,) + /* "object band". Sits in 2nd element */ + KEY_BAND_INDEX, + /* objectid. Sits in 2nd element */ + KEY_OBJECTID_INDEX = KEY_BAND_INDEX, + /* full objectid. Sits in 2nd element */ + KEY_FULLOID_INDEX = KEY_BAND_INDEX, + /* Offset. Sits in 3rd element */ + KEY_OFFSET_INDEX, + /* Name hash. Sits in 3rd element */ + KEY_HASH_INDEX = KEY_OFFSET_INDEX, + KEY_CACHELINE_END = KEY_OFFSET_INDEX, + KEY_LAST_INDEX +} reiser4_key_field_index; + +/* key in reiser4 internal "balanced" tree. It is just array of three + 64bit integers in disk byte order (little-endian by default). This + array is actually indexed by reiser4_key_field. Each __u64 within + this array is called "element". Logical key component encoded within + elements are called "fields". + + We declare this as union with second component dummy to suppress + inconvenient array<->pointer casts implied in C. */ +union reiser4_key { + d64 el[KEY_LAST_INDEX]; + int pad; +}; + +/* bitmasks showing where within reiser4_key particular key is + stored. */ +typedef enum { + /* major locality occupies higher 60 bits of the first element */ + KEY_LOCALITY_MASK = 0xfffffffffffffff0ull, + /* minor locality occupies lower 4 bits of the first element */ + KEY_TYPE_MASK = 0xfull, + /* controversial band occupies higher 4 bits of the 2nd element */ + KEY_BAND_MASK = 0xf000000000000000ull, + /* objectid occupies lower 60 bits of the 2nd element */ + KEY_OBJECTID_MASK = 0x0fffffffffffffffull, + /* full 64bit objectid*/ + KEY_FULLOID_MASK = 0xffffffffffffffffull, + /* offset is just 3rd L.M.Nt itself */ + KEY_OFFSET_MASK = 0xffffffffffffffffull, + /* ordering is whole second element */ + KEY_ORDERING_MASK = 0xffffffffffffffffull, +} reiser4_key_field_mask; + +/* how many bits key element should be shifted to left to get particular field */ +typedef enum { + KEY_LOCALITY_SHIFT = 4, + KEY_TYPE_SHIFT = 0, + KEY_BAND_SHIFT = 60, + KEY_OBJECTID_SHIFT = 0, + KEY_FULLOID_SHIFT = 0, + KEY_OFFSET_SHIFT = 0, + KEY_ORDERING_SHIFT = 0, +} reiser4_key_field_shift; + +static inline __u64 +get_key_el(const reiser4_key * key, reiser4_key_field_index off) +{ + assert("nikita-753", key != NULL); + assert("nikita-754", off < KEY_LAST_INDEX); + return d64tocpu(&key->el[off]); +} + +static inline void +set_key_el(reiser4_key * key, reiser4_key_field_index off, __u64 value) +{ + assert("nikita-755", key != NULL); + assert("nikita-756", off < KEY_LAST_INDEX); + cputod64(value, &key->el[off]); +} + +/* macro to define getter and setter functions for field F with type T */ +#define DEFINE_KEY_FIELD( L, U, T ) \ +static inline T get_key_ ## L ( const reiser4_key *key ) \ +{ \ + assert( "nikita-750", key != NULL ); \ + return ( T ) ( get_key_el( key, KEY_ ## U ## _INDEX ) & \ + KEY_ ## U ## _MASK ) >> KEY_ ## U ## _SHIFT; \ +} \ + \ +static inline void set_key_ ## L ( reiser4_key *key, T loc ) \ +{ \ + __u64 el; \ + \ + assert( "nikita-752", key != NULL ); \ + \ + el = get_key_el( key, KEY_ ## U ## _INDEX ); \ + /* clear field bits in the key */ \ + el &= ~KEY_ ## U ## _MASK; \ + /* actually it should be \ + \ + el |= ( loc << KEY_ ## U ## _SHIFT ) & KEY_ ## U ## _MASK; \ + \ + but we trust user to never pass values that wouldn't fit \ + into field. Clearing extra bits is one operation, but this \ + function is time-critical. \ + But check this in assertion. */ \ + assert( "nikita-759", ( ( loc << KEY_ ## U ## _SHIFT ) & \ + ~KEY_ ## U ## _MASK ) == 0 ); \ + el |= ( loc << KEY_ ## U ## _SHIFT ); \ + set_key_el( key, KEY_ ## U ## _INDEX, el ); \ +} + +typedef __u64 oid_t; + +/* define get_key_locality(), set_key_locality() */ +DEFINE_KEY_FIELD(locality, LOCALITY, oid_t); +/* define get_key_type(), set_key_type() */ +DEFINE_KEY_FIELD(type, TYPE, key_minor_locality); +/* define get_key_band(), set_key_band() */ +DEFINE_KEY_FIELD(band, BAND, __u64); +/* define get_key_objectid(), set_key_objectid() */ +DEFINE_KEY_FIELD(objectid, OBJECTID, oid_t); +/* define get_key_fulloid(), set_key_fulloid() */ +DEFINE_KEY_FIELD(fulloid, FULLOID, oid_t); +/* define get_key_offset(), set_key_offset() */ +DEFINE_KEY_FIELD(offset, OFFSET, __u64); +#if (REISER4_LARGE_KEY) +/* define get_key_ordering(), set_key_ordering() */ +DEFINE_KEY_FIELD(ordering, ORDERING, __u64); +#else +static inline __u64 get_key_ordering(const reiser4_key *key) +{ + return 0; +} + +static inline void set_key_ordering(reiser4_key *key, __u64 val) +{ +} +#endif + +/* key comparison result */ +typedef enum { LESS_THAN = -1, /* if first key is less than second */ + EQUAL_TO = 0, /* if keys are equal */ + GREATER_THAN = +1 /* if first key is greater than second */ +} cmp_t; + +void reiser4_key_init(reiser4_key * key); + +/* minimal possible key in the tree. Return pointer to the static storage. */ +extern const reiser4_key *min_key(void); +extern const reiser4_key *max_key(void); + +/* helper macro for keycmp() */ +#define KEY_DIFF(k1, k2, field) \ +({ \ + typeof (get_key_ ## field (k1)) f1; \ + typeof (get_key_ ## field (k2)) f2; \ + \ + f1 = get_key_ ## field (k1); \ + f2 = get_key_ ## field (k2); \ + \ + (f1 < f2) ? LESS_THAN : ((f1 == f2) ? EQUAL_TO : GREATER_THAN); \ +}) + +/* helper macro for keycmp() */ +#define KEY_DIFF_EL(k1, k2, off) \ +({ \ + __u64 e1; \ + __u64 e2; \ + \ + e1 = get_key_el(k1, off); \ + e2 = get_key_el(k2, off); \ + \ + (e1 < e2) ? LESS_THAN : ((e1 == e2) ? EQUAL_TO : GREATER_THAN); \ +}) + +/* compare `k1' and `k2'. This function is a heart of "key allocation + policy". All you need to implement new policy is to add yet another + clause here. */ +static inline cmp_t +keycmp(const reiser4_key * k1 /* first key to compare */ , + const reiser4_key * k2 /* second key to compare */ ) +{ + cmp_t result; + + /* + * This function is the heart of reiser4 tree-routines. Key comparison + * is among most heavily used operations in the file system. + */ + + assert("nikita-439", k1 != NULL); + assert("nikita-440", k2 != NULL); + + /* there is no actual branch here: condition is compile time constant + * and constant folding and propagation ensures that only one branch + * is actually compiled in. */ + + if (REISER4_PLANA_KEY_ALLOCATION) { + /* if physical order of fields in a key is identical + with logical order, we can implement key comparison + as three 64bit comparisons. */ + /* logical order of fields in plan-a: + locality->type->objectid->offset. */ + /* compare locality and type at once */ + result = KEY_DIFF_EL(k1, k2, 0); + if (result == EQUAL_TO) { + /* compare objectid (and band if it's there) */ + result = KEY_DIFF_EL(k1, k2, 1); + /* compare offset */ + if (result == EQUAL_TO) { + result = KEY_DIFF_EL(k1, k2, 2); + if (REISER4_LARGE_KEY && result == EQUAL_TO) { + result = KEY_DIFF_EL(k1, k2, 3); + } + } + } + } else if (REISER4_3_5_KEY_ALLOCATION) { + result = KEY_DIFF(k1, k2, locality); + if (result == EQUAL_TO) { + result = KEY_DIFF(k1, k2, objectid); + if (result == EQUAL_TO) { + result = KEY_DIFF(k1, k2, type); + if (result == EQUAL_TO) + result = KEY_DIFF(k1, k2, offset); + } + } + } else + impossible("nikita-441", "Unknown key allocation scheme!"); + return result; +} + +/* true if @k1 equals @k2 */ +static inline int +keyeq(const reiser4_key * k1 /* first key to compare */ , + const reiser4_key * k2 /* second key to compare */ ) +{ + assert("nikita-1879", k1 != NULL); + assert("nikita-1880", k2 != NULL); + return !memcmp(k1, k2, sizeof *k1); +} + +/* true if @k1 is less than @k2 */ +static inline int +keylt(const reiser4_key * k1 /* first key to compare */ , + const reiser4_key * k2 /* second key to compare */ ) +{ + assert("nikita-1952", k1 != NULL); + assert("nikita-1953", k2 != NULL); + return keycmp(k1, k2) == LESS_THAN; +} + +/* true if @k1 is less than or equal to @k2 */ +static inline int +keyle(const reiser4_key * k1 /* first key to compare */ , + const reiser4_key * k2 /* second key to compare */ ) +{ + assert("nikita-1954", k1 != NULL); + assert("nikita-1955", k2 != NULL); + return keycmp(k1, k2) != GREATER_THAN; +} + +/* true if @k1 is greater than @k2 */ +static inline int +keygt(const reiser4_key * k1 /* first key to compare */ , + const reiser4_key * k2 /* second key to compare */ ) +{ + assert("nikita-1959", k1 != NULL); + assert("nikita-1960", k2 != NULL); + return keycmp(k1, k2) == GREATER_THAN; +} + +/* true if @k1 is greater than or equal to @k2 */ +static inline int +keyge(const reiser4_key * k1 /* first key to compare */ , + const reiser4_key * k2 /* second key to compare */ ) +{ + assert("nikita-1956", k1 != NULL); + assert("nikita-1957", k2 != NULL); /* October 4: sputnik launched + * November 3: Laika */ + return keycmp(k1, k2) != LESS_THAN; +} + +static inline void +prefetchkey(reiser4_key *key) +{ + prefetch(key); + prefetch(&key->el[KEY_CACHELINE_END]); +} + +/* (%Lx:%x:%Lx:%Lx:%Lx:%Lx) = + 1 + 16 + 1 + 1 + 1 + 1 + 1 + 16 + 1 + 16 + 1 + 16 + 1 */ +/* size of a buffer suitable to hold human readable key representation */ +#define KEY_BUF_LEN (80) + +extern int sprintf_key(char *buffer, const reiser4_key * key); +#if REISER4_DEBUG +extern void print_key(const char *prefix, const reiser4_key * key); +#else +#define print_key(p,k) noop +#endif + +/* __FS_REISERFS_KEY_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/ktxnmgrd.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/ktxnmgrd.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,274 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ +/* Transaction manager daemon. */ + +/* + * ktxnmgrd is a kernel daemon responsible for committing transactions. It is + * needed/important for the following reasons: + * + * 1. in reiser4 atom is not committed immediately when last transaction + * handle closes, unless atom is either too old or too large (see + * atom_should_commit()). This is done to avoid committing too frequently. + * because: + * + * 2. sometimes we don't want to commit atom when closing last transaction + * handle even if it is old and fat enough. For example, because we are at + * this point under directory semaphore, and committing would stall all + * accesses to this directory. + * + * ktxnmgrd binds its time sleeping on condition variable. When is awakes + * either due to (tunable) timeout or because it was explicitly woken up by + * call to ktxnmgrd_kick(), it scans list of all atoms and commits ones + * eligible. + * + */ + +#include "debug.h" +#include "kcond.h" +#include "txnmgr.h" +#include "tree.h" +#include "ktxnmgrd.h" +#include "super.h" +#include "reiser4.h" + +#include /* for struct task_struct */ +#include +#include +#include + +static int scan_mgr(txn_mgr * mgr); + +reiser4_internal int +init_ktxnmgrd_context(txn_mgr * mgr) +{ + ktxnmgrd_context * ctx; + + assert ("zam-1013", mgr != NULL); + assert ("zam-1014", mgr->daemon == NULL); + + ctx = reiser4_kmalloc(sizeof(ktxnmgrd_context), GFP_KERNEL); + if (ctx == NULL) + return RETERR(-ENOMEM); + + assert("nikita-2442", ctx != NULL); + + memset(ctx, 0, sizeof *ctx); + init_completion(&ctx->finish); + kcond_init(&ctx->startup); + kcond_init(&ctx->wait); + spin_lock_init(&ctx->guard); + ctx->timeout = REISER4_TXNMGR_TIMEOUT; + mgr->daemon = ctx; + return 0; +} + +/* change current->comm so that ps, top, and friends will see changed + state. This serves no useful purpose whatsoever, but also costs + nothing. May be it will make lonely system administrator feeling less alone + at 3 A.M. +*/ +#define set_comm( state ) \ + snprintf( current -> comm, sizeof( current -> comm ), \ + "%s:%s:%s", __FUNCTION__, (super)->s_id, ( state ) ) + +/* The background transaction manager daemon, started as a kernel thread + during reiser4 initialization. */ +static int +ktxnmgrd(void *arg) +{ + struct task_struct *me; + struct super_block * super; + ktxnmgrd_context *ctx; + txn_mgr * mgr; + + /* standard kernel thread prologue */ + me = current; + /* reparent_to_init() is done by daemonize() */ + daemonize(__FUNCTION__); + + /* block all signals */ + spin_lock_irq(&me->sighand->siglock); + siginitsetinv(&me->blocked, 0); + recalc_sigpending(); + spin_unlock_irq(&me->sighand->siglock); + + /* do_fork() just copies task_struct into the new + thread. ->fs_context shouldn't be copied of course. This shouldn't + be a problem for the rest of the code though. + */ + me->journal_info = NULL; + + mgr = arg; + ctx = mgr->daemon; + spin_lock(&ctx->guard); + ctx->tsk = me; + super = container_of(mgr, reiser4_super_info_data, tmgr)->tree.super; + kcond_broadcast(&ctx->startup); + while (1) { + int result; + + /* software suspend support. */ + if (me->flags & PF_FREEZE) { + spin_unlock(&ctx->guard); + refrigerator(PF_FREEZE/*PF_IOTHREAD*/); + spin_lock(&ctx->guard); + } + + set_comm("wait"); + /* wait for @ctx -> timeout or explicit wake up. + + kcond_wait() is called with last argument 1 enabling wakeup + by signals so that this thread is not counted in + load-average. This doesn't require any special handling, + because all signals were blocked. + */ + result = kcond_timedwait(&ctx->wait, + &ctx->guard, ctx->timeout, 1); + + if (result != -ETIMEDOUT && result != -EINTR && result != 0) { + /* some other error */ + warning("nikita-2443", "Error: %i", result); + continue; + } + + /* we are asked to exit */ + if (ctx->done) + break; + + set_comm(result ? "timed" : "run"); + + /* wait timed out or ktxnmgrd was woken up by explicit request + to commit something. Scan list of atoms in txnmgr and look + for too old atoms. + */ + do { + ctx->rescan = 0; + scan_mgr(mgr); + spin_lock(&ctx->guard); + if (ctx->rescan) { + /* the list could be modified while ctx + spinlock was released, we have to + repeat scanning from the + beginning */ + break; + } + } while (ctx->rescan); + } + + spin_unlock(&ctx->guard); + + complete_and_exit(&ctx->finish, 0); + /* not reached. */ + return 0; +} + +#undef set_comm + +reiser4_internal void +ktxnmgrd_kick(txn_mgr * mgr) +{ + assert("nikita-3234", mgr != NULL); + assert("nikita-3235", mgr->daemon != NULL); + kcond_signal(&mgr->daemon->wait); +} + +reiser4_internal int +is_current_ktxnmgrd(void) +{ + return (get_current_super_private()->tmgr.daemon->tsk == current); +} + +/* scan one transaction manager for old atoms; should be called with ktxnmgrd + * spinlock, releases this spin lock at exit */ +static int +scan_mgr(txn_mgr * mgr) +{ + int ret; + reiser4_context ctx; + reiser4_tree *tree; + + assert("nikita-2454", mgr != NULL); + + /* NOTE-NIKITA this only works for atoms embedded into super blocks. */ + tree = &container_of(mgr, reiser4_super_info_data, tmgr)->tree; + assert("nikita-2455", tree != NULL); + assert("nikita-2456", tree->super != NULL); + + init_context(&ctx, tree->super); + + ret = commit_some_atoms(mgr); + + reiser4_exit_context(&ctx); + return ret; +} + + +reiser4_internal int start_ktxnmgrd (txn_mgr * mgr) +{ + ktxnmgrd_context * ctx; + + assert("nikita-2448", mgr != NULL); + assert("zam-1015", mgr->daemon != NULL); + + ctx = mgr->daemon; + + spin_lock(&ctx->guard); + + ctx->rescan = 1; + ctx->done = 0; + + spin_unlock(&ctx->guard); + + kernel_thread(ktxnmgrd, mgr, CLONE_KERNEL); + + spin_lock(&ctx->guard); + + /* daemon thread is not yet initialized */ + if (ctx->tsk == NULL) + /* wait until initialization completes */ + kcond_wait(&ctx->startup, &ctx->guard, 0); + + assert("nikita-2452", ctx->tsk != NULL); + + spin_unlock(&ctx->guard); + return 0; +} + +reiser4_internal void stop_ktxnmgrd (txn_mgr * mgr) +{ + ktxnmgrd_context * ctx; + + assert ("zam-1016", mgr != NULL); + assert ("zam-1017", mgr->daemon != NULL); + + ctx = mgr->daemon; + + spin_lock(&ctx->guard); + ctx->tsk = NULL; + ctx->done = 1; + spin_unlock(&ctx->guard); + + kcond_signal(&ctx->wait); + + /* wait until daemon finishes */ + wait_for_completion(&ctx->finish); +} + +reiser4_internal void +done_ktxnmgrd_context (txn_mgr * mgr) +{ + assert ("zam-1011", mgr != NULL); + assert ("zam-1012", mgr->daemon != NULL); + + reiser4_kfree(mgr->daemon); + mgr->daemon = NULL; +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/ktxnmgrd.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/ktxnmgrd.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,63 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Transaction manager daemon. See ktxnmgrd.c for comments. */ + +#ifndef __KTXNMGRD_H__ +#define __KTXNMGRD_H__ + +#include "kcond.h" +#include "txnmgr.h" +#include "spin_macros.h" + +#include +#include +#include +#include +#include /* for struct task_struct */ + +/* in this structure all data necessary to start up, shut down and communicate + * with ktxnmgrd are kept. */ +struct ktxnmgrd_context { + /* conditional variable used to synchronize start up of ktxnmgrd */ + kcond_t startup; + /* completion used to synchronize shut down of ktxnmgrd */ + struct completion finish; + /* condition variable on which ktxnmgrd sleeps */ + kcond_t wait; + /* spin lock protecting all fields of this structure */ + spinlock_t guard; + /* timeout of sleeping on ->wait */ + signed long timeout; + /* kernel thread running ktxnmgrd */ + struct task_struct *tsk; + /* list of all file systems served by this ktxnmgrd */ + txn_mgrs_list_head queue; + /* is ktxnmgrd being shut down? */ + int done:1; + /* should ktxnmgrd repeat scanning of atoms? */ + int rescan:1; +}; + +extern int init_ktxnmgrd_context(txn_mgr *); +extern void done_ktxnmgrd_context(txn_mgr *); + +extern int start_ktxnmgrd(txn_mgr *); +extern void stop_ktxnmgrd(txn_mgr *); + +extern void ktxnmgrd_kick(txn_mgr * mgr); + +extern int is_current_ktxnmgrd(void); + +/* __KTXNMGRD_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/lib.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/lib.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,75 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#if !defined (__FS_REISER4_LIB_H__) +#define __FS_REISER4_LIB_H__ + +/* These 2 functions of 64 bit numbers division were taken from + include/sound/pcm.h */ + +/* Helper function for 64 bits numbers division. */ +static inline void +divl(__u32 high, __u32 low, __u32 div, __u32 * q, __u32 * r) +{ + __u64 n = (__u64) high << 32 | low; + __u64 d = (__u64) div << 31; + __u32 q1 = 0; + int c = 32; + + while (n > 0xffffffffU) { + q1 <<= 1; + if (n >= d) { + n -= d; + q1 |= 1; + } + d >>= 1; + c--; + } + q1 <<= c; + if (n) { + low = n; + *q = q1 | (low / div); + if (r) + *r = low % div; + } else { + if (r) + *r = 0; + *q = q1; + } + return; +} + +/* Function for 64 bits numbers division. */ +static inline __u64 +div64_32(__u64 n, __u32 div, __u32 * rem) +{ + __u32 low, high; + + low = n & 0xffffffff; + high = n >> 32; + if (high) { + __u32 high1 = high % div; + __u32 low1 = low; + high /= div; + divl(high1, low1, div, &low, rem); + return (__u64) high << 32 | low; + } else { + if (rem) + *rem = low % div; + return low / div; + } + + return 0; +} + +#endif /* __FS_REISER4_LIB_H__ */ + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/lock.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/lock.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1402 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Traditional deadlock avoidance is achieved by acquiring all locks in a single + order. V4 balances the tree from the bottom up, and searches the tree from + the top down, and that is really the way we want it, so tradition won't work + for us. + + Instead we have two lock orderings, a high priority lock ordering, and a low + priority lock ordering. Each node in the tree has a lock in its znode. + + Suppose we have a set of processes which lock (R/W) tree nodes. Each process + has a set (maybe empty) of already locked nodes ("process locked set"). Each + process may have a pending lock request to a node locked by another process. + Note: we lock and unlock, but do not transfer locks: it is possible + transferring locks instead would save some bus locking.... + + Deadlock occurs when we have a loop constructed from process locked sets and + lock request vectors. + + + NOTE: The reiser4 "tree" is a tree on disk, but its cached representation in + memory is extended with "znodes" with which we connect nodes with their left + and right neighbors using sibling pointers stored in the znodes. When we + perform balancing operations we often go from left to right and from right to + left. + + + +-P1-+ +-P3-+ + |+--+| V1 |+--+| + ||N1|| -------> ||N3|| + |+--+| |+--+| + +----+ +----+ + ^ | + |V2 |V3 + | v + +---------P2---------+ + |+--+ +--+| + ||N2| -------- |N4|| + |+--+ +--+| + +--------------------+ + + We solve this by ensuring that only low priority processes lock in top to + bottom order and from right to left, and high priority processes lock from + bottom to top and left to right. + + ZAM-FIXME-HANS: order not just node locks in this way, order atom locks, and + kill those damn busy loops. + ANSWER(ZAM): atom locks (which are introduced by ASTAGE_CAPTURE_WAIT atom + stage) cannot be ordered that way. There are no rules what nodes can belong + to the atom and what nodes cannot. We cannot define what is right or left + direction, what is top or bottom. We can take immediate parent or side + neighbor of one node, but nobody guarantees that, say, left neighbor node is + not a far right neighbor for other nodes from the same atom. It breaks + deadlock avoidance rules and hi-low priority locking cannot be applied for + atom locks. + + How does it help to avoid deadlocks ? + + Suppose we have a deadlock with n processes. Processes from one priority + class never deadlock because they take locks in one consistent + order. + + So, any possible deadlock loop must have low priority as well as high + priority processes. There are no other lock priority levels except low and + high. We know that any deadlock loop contains at least one node locked by a + low priority process and requested by a high priority process. If this + situation is caught and resolved it is sufficient to avoid deadlocks. + + V4 DEADLOCK PREVENTION ALGORITHM IMPLEMENTATION. + + The deadlock prevention algorithm is based on comparing + priorities of node owners (processes which keep znode locked) and + requesters (processes which want to acquire a lock on znode). We + implement a scheme where low-priority owners yield locks to + high-priority requesters. We created a signal passing system that + is used to ask low-priority processes to yield one or more locked + znodes. + + The condition when a znode needs to change its owners is described by the + following formula: + + ############################################# + # # + # (number of high-priority requesters) > 0 # + # AND # + # (numbers of high-priority owners) == 0 # + # # + ############################################# + + Note that a low-priority process + delays node releasing if another high-priority process owns this node. So, slightly more strictly speaking, to have a deadlock capable cycle you must have a loop in which a high priority process is waiting on a low priority process to yield a node, which is slightly different from saying a high priority process is waiting on a node owned by a low priority process. + + It is enough to avoid deadlocks if we prevent any low-priority process from + falling asleep if its locked set contains a node which satisfies the + deadlock condition. + + That condition is implicitly or explicitly checked in all places where new + high-priority requests may be added or removed from node request queue or + high-priority process takes or releases a lock on node. The main + goal of these checks is to never lose the moment when node becomes "has + wrong owners" and send "must-yield-this-lock" signals to its low-pri owners + at that time. + + The information about received signals is stored in the per-process + structure (lock stack) and analyzed before a low-priority process goes to + sleep but after a "fast" attempt to lock a node fails. Any signal wakes + sleeping process up and forces him to re-check lock status and received + signal info. If "must-yield-this-lock" signals were received the locking + primitive (longterm_lock_znode()) fails with -E_DEADLOCK error code. + + V4 LOCKING DRAWBACKS + + If we have already balanced on one level, and we are propagating our changes upward to a higher level, it could be + very messy to surrender all locks on the lower level because we put so much computational work into it, and reverting + them to their state before they were locked might be very complex. We also don't want to acquire all locks before + performing balancing because that would either be almost as much work as the balancing, or it would be too + conservative and lock too much. We want balancing to be done only at high priority. Yet, we might want to go to the + left one node and use some of its empty space... So we make one attempt at getting the node to the left using + try_lock, and if it fails we do without it, because we didn't really need it, it was only a nice to have. + + LOCK STRUCTURES DESCRIPTION + + The following data structures are used in the reiser4 locking + implementation: + + All fields related to long-term locking are stored in znode->lock. + + The lock stack is a per thread object. It owns all znodes locked by the + thread. One znode may be locked by several threads in case of read lock or + one znode may be write locked by one thread several times. The special link + objects (lock handles) support n<->m relation between znodes and lock + owners. + + + + +---------+ +---------+ + | LS1 | | LS2 | + +---------+ +---------+ + ^ ^ + |---------------+ +----------+ + v v v v + +---------+ +---------+ +---------+ +---------+ + | LH1 | | LH2 | | LH3 | | LH4 | + +---------+ +---------+ +---------+ +---------+ + ^ ^ ^ ^ + | +------------+ | + v v v + +---------+ +---------+ +---------+ + | Z1 | | Z2 | | Z3 | + +---------+ +---------+ +---------+ + + Thread 1 locked znodes Z1 and Z2, thread 2 locked znodes Z2 and Z3. The picture above shows that lock stack LS1 has a + list of 2 lock handles LH1 and LH2, lock stack LS2 has a list with lock handles LH3 and LH4 on it. Znode Z1 is + locked by only one thread, znode has only one lock handle LH1 on its list, similar situation is for Z3 which is + locked by the thread 2 only. Z2 is locked (for read) twice by different threads and two lock handles are on its + list. Each lock handle represents a single relation of a locking of a znode by a thread. Locking of a znode is an + establishing of a locking relation between the lock stack and the znode by adding of a new lock handle to a list of + lock handles, the lock stack. The lock stack links all lock handles for all znodes locked by the lock stack. The znode + list groups all lock handles for all locks stacks which locked the znode. + + Yet another relation may exist between znode and lock owners. If lock + procedure cannot immediately take lock on an object it adds the lock owner + on special `requestors' list belongs to znode. That list represents a + queue of pending lock requests. Because one lock owner may request only + only one lock object at a time, it is a 1->n relation between lock objects + and a lock owner implemented as it is described above. Full information + (priority, pointers to lock and link objects) about each lock request is + stored in lock owner structure in `request' field. + + SHORT_TERM LOCKING + + This is a list of primitive operations over lock stacks / lock handles / + znodes and locking descriptions for them. + + 1. locking / unlocking which is done by two list insertion/deletion, one + to/from znode's list of lock handles, another one is to/from lock stack's + list of lock handles. The first insertion is protected by + znode->lock.guard spinlock. The list owned by the lock stack can be + modified only by thread who owns the lock stack and nobody else can + modify/read it. There is nothing to be protected by a spinlock or + something else. + + 2. adding/removing a lock request to/from znode requesters list. The rule is + that znode->lock.guard spinlock should be taken for this. + + 3. we can traverse list of lock handles and use references to lock stacks who + locked given znode if znode->lock.guard spinlock is taken. + + 4. If a lock stack is associated with a znode as a lock requestor or lock + owner its existence is guaranteed by znode->lock.guard spinlock. Some its + (lock stack's) fields should be protected from being accessed in parallel + by two or more threads. Please look at lock_stack structure definition + for the info how those fields are protected. */ + +/* Znode lock and capturing intertwining. */ +/* In current implementation we capture formatted nodes before locking + them. Take a look on longterm lock znode, try_capture() request precedes + locking requests. The longterm_lock_znode function unconditionally captures + znode before even checking of locking conditions. + + Another variant is to capture znode after locking it. It was not tested, but + at least one deadlock condition is supposed to be there. One thread has + locked a znode (Node-1) and calls try_capture() for it. Try_capture() sleeps + because znode's atom has CAPTURE_WAIT state. Second thread is a flushing + thread, its current atom is the atom Node-1 belongs to. Second thread wants + to lock Node-1 and sleeps because Node-1 is locked by the first thread. The + described situation is a deadlock. */ + +#include "debug.h" +#include "txnmgr.h" +#include "znode.h" +#include "jnode.h" +#include "tree.h" +#include "plugin/node/node.h" +#include "super.h" + +#include + +#if REISER4_DEBUG +static int request_is_deadlock_safe(znode *, znode_lock_mode, + znode_lock_request); +#endif + +/* Returns a lock owner associated with current thread */ +reiser4_internal lock_stack * +get_current_lock_stack(void) +{ + return &get_current_context()->stack; +} + +/* Wakes up all low priority owners informing them about possible deadlock */ +static void +wake_up_all_lopri_owners(znode * node) +{ + lock_handle *handle; + + assert("nikita-1824", rw_zlock_is_locked(&node->lock)); + for_all_type_safe_list(owners, &node->lock.owners, handle) { + spin_lock_stack(handle->owner); + + assert("nikita-1832", handle->node == node); + /* count this signal in owner->nr_signaled */ + if (!handle->signaled) { + handle->signaled = 1; + atomic_inc(&handle->owner->nr_signaled); + } + /* Wake up a single process */ + __reiser4_wake_up(handle->owner); + + spin_unlock_stack(handle->owner); + } +} + +/* Adds a lock to a lock owner, which means creating a link to the lock and + putting the link into the two lists all links are on (the doubly linked list + that forms the lock_stack, and the doubly linked list of links attached + to a lock. +*/ +static inline void +link_object(lock_handle * handle, lock_stack * owner, znode * node) +{ + assert("jmacd-810", handle->owner == NULL); + assert("nikita-1828", owner == get_current_lock_stack()); + assert("nikita-1830", rw_zlock_is_locked(&node->lock)); + + handle->owner = owner; + handle->node = node; + + assert("reiser4-4", ergo(locks_list_empty(&owner->locks), owner->nr_locks == 0)); + locks_list_push_back(&owner->locks, handle); + owner->nr_locks ++; + + owners_list_push_front(&node->lock.owners, handle); + handle->signaled = 0; +} + +/* Breaks a relation between a lock and its owner */ +static inline void +unlink_object(lock_handle * handle) +{ + assert("zam-354", handle->owner != NULL); + assert("nikita-1608", handle->node != NULL); + assert("nikita-1633", rw_zlock_is_locked(&handle->node->lock)); + assert("nikita-1829", handle->owner == get_current_lock_stack()); + + assert("reiser4-5", handle->owner->nr_locks > 0); + locks_list_remove_clean(handle); + handle->owner->nr_locks --; + assert("reiser4-6", ergo(locks_list_empty(&handle->owner->locks), handle->owner->nr_locks == 0)); + + owners_list_remove_clean(handle); + + /* indicates that lock handle is free now */ + handle->owner = NULL; +} + +/* Actually locks an object knowing that we are able to do this */ +static void +lock_object(lock_stack * owner) +{ + lock_request *request; + znode *node; + assert("nikita-1839", owner == get_current_lock_stack()); + + request = &owner->request; + node = request->node; + assert("nikita-1834", rw_zlock_is_locked(&node->lock)); + if (request->mode == ZNODE_READ_LOCK) { + node->lock.nr_readers++; + } else { + /* check that we don't switched from read to write lock */ + assert("nikita-1840", node->lock.nr_readers <= 0); + /* We allow recursive locking; a node can be locked several + times for write by same process */ + node->lock.nr_readers--; + } + + link_object(request->handle, owner, node); + + if (owner->curpri) { + node->lock.nr_hipri_owners++; + } +} + +/* Check for recursive write locking */ +static int +recursive(lock_stack * owner) +{ + int ret; + znode *node; + + node = owner->request.node; + + /* Owners list is not empty for a locked node */ + assert("zam-314", !owners_list_empty(&node->lock.owners)); + assert("nikita-1841", owner == get_current_lock_stack()); + assert("nikita-1848", rw_zlock_is_locked(&node->lock)); + + ret = (owners_list_front(&node->lock.owners)->owner == owner); + + /* Recursive read locking should be done usual way */ + assert("zam-315", !ret || owner->request.mode == ZNODE_WRITE_LOCK); + /* mixing of read/write locks is not allowed */ + assert("zam-341", !ret || znode_is_wlocked(node)); + + return ret; +} + +#if REISER4_DEBUG +/* Returns true if the lock is held by the calling thread. */ +int +znode_is_any_locked(const znode * node) +{ + lock_handle *handle; + lock_stack *stack; + int ret; + + if (!znode_is_locked(node)) { + return 0; + } + + stack = get_current_lock_stack(); + + spin_lock_stack(stack); + + ret = 0; + + for_all_type_safe_list(locks, &stack->locks, handle) { + if (handle->node == node) { + ret = 1; + break; + } + } + + spin_unlock_stack(stack); + + return ret; +} + +#endif + +/* Returns true if a write lock is held by the calling thread. */ +reiser4_internal int +znode_is_write_locked(const znode * node) +{ + lock_stack *stack; + lock_handle *handle; + + assert("jmacd-8765", node != NULL); + + if (!znode_is_wlocked(node)) { + return 0; + } + + stack = get_current_lock_stack(); + + /* If it is write locked, then all owner handles must equal the current stack. */ + handle = owners_list_front(&node->lock.owners); + + return (handle->owner == stack); +} + +/* This "deadlock" condition is the essential part of reiser4 locking + implementation. This condition is checked explicitly by calling + check_deadlock_condition() or implicitly in all places where znode lock + state (set of owners and request queue) is changed. Locking code is + designed to use this condition to trigger procedure of passing object from + low priority owner(s) to high priority one(s). + + The procedure results in passing an event (setting lock_handle->signaled + flag) and counting this event in nr_signaled field of owner's lock stack + object and wakeup owner's process. +*/ +static inline int +check_deadlock_condition(znode * node) +{ + assert("nikita-1833", rw_zlock_is_locked(&node->lock)); + return node->lock.nr_hipri_requests > 0 && node->lock.nr_hipri_owners == 0; +} + +/* checks lock/request compatibility */ +static int +check_lock_object(lock_stack * owner) +{ + znode *node = owner->request.node; + + assert("nikita-1842", owner == get_current_lock_stack()); + assert("nikita-1843", rw_zlock_is_locked(&node->lock)); + + /* See if the node is disconnected. */ + if (unlikely(ZF_ISSET(node, JNODE_IS_DYING))) { + return RETERR(-EINVAL); + } + + /* Do not ever try to take a lock if we are going in low priority + direction and a node have a high priority request without high + priority owners. */ + if (unlikely(!owner->curpri && check_deadlock_condition(node))) { + return RETERR(-E_REPEAT); + } + + if (unlikely(!is_lock_compatible(node, owner->request.mode))) { + return RETERR(-E_REPEAT); + } + + return 0; +} + +/* check for lock/request compatibility and update tree statistics */ +static int +can_lock_object(lock_stack * owner) +{ + int result; + + result = check_lock_object(owner); + return result; +} + +/* Setting of a high priority to the process. It clears "signaled" flags + because znode locked by high-priority process can't satisfy our "deadlock + condition". */ +static void +set_high_priority(lock_stack * owner) +{ + assert("nikita-1846", owner == get_current_lock_stack()); + /* Do nothing if current priority is already high */ + if (!owner->curpri) { + /* We don't need locking for owner->locks list, because, this + * function is only called with the lock stack of the current + * thread, and no other thread can play with owner->locks list + * and/or change ->node pointers of lock handles in this list. + * + * (Interrupts also are not involved.) + */ + lock_handle *item = locks_list_front(&owner->locks); + while (!locks_list_end(&owner->locks, item)) { + znode *node = item->node; + + WLOCK_ZLOCK(&node->lock); + + node->lock.nr_hipri_owners++; + + /* we can safely set signaled to zero, because + previous statement (nr_hipri_owners ++) guarantees + that signaled will be never set again. */ + item->signaled = 0; + WUNLOCK_ZLOCK(&node->lock); + + item = locks_list_next(item); + } + owner->curpri = 1; + atomic_set(&owner->nr_signaled, 0); + } +} + +/* Sets a low priority to the process. */ +static void +set_low_priority(lock_stack * owner) +{ + assert("nikita-3075", owner == get_current_lock_stack()); + /* Do nothing if current priority is already low */ + if (owner->curpri) { + /* scan all locks (lock handles) held by @owner, which is + actually current thread, and check whether we are reaching + deadlock possibility anywhere. + */ + lock_handle *handle = locks_list_front(&owner->locks); + while (!locks_list_end(&owner->locks, handle)) { + znode *node = handle->node; + WLOCK_ZLOCK(&node->lock); + /* this thread just was hipri owner of @node, so + nr_hipri_owners has to be greater than zero. */ + assert("nikita-1835", node->lock.nr_hipri_owners > 0); + node->lock.nr_hipri_owners--; + /* If we have deadlock condition, adjust a nr_signaled + field. It is enough to set "signaled" flag only for + current process, other low-pri owners will be + signaled and waken up after current process unlocks + this object and any high-priority requestor takes + control. */ + if (check_deadlock_condition(node) + && !handle->signaled) { + handle->signaled = 1; + atomic_inc(&owner->nr_signaled); + } + WUNLOCK_ZLOCK(&node->lock); + handle = locks_list_next(handle); + } + owner->curpri = 0; + } +} + +#define MAX_CONVOY_SIZE ((NR_CPUS - 1)) + +/* helper function used by longterm_unlock_znode() to wake up requestor(s). */ +/* + * In certain multi threaded work loads jnode spin lock is the most + * contented one. Wake up of threads waiting for znode is, thus, + * important to do right. There are three well known strategies: + * + * (1) direct hand-off. Hasn't been tried. + * + * (2) wake all (thundering herd). This degrades performance in our + * case. + * + * (3) wake one. Simplest solution where requestor in the front of + * requestors list is awaken under znode spin lock is not very + * good on the SMP, because first thing requestor will try to do + * after waking up on another CPU is to acquire znode spin lock + * that is still held by this thread. As an optimization we grab + * lock stack spin lock, release znode spin lock and wake + * requestor. done_context() synchronize against stack spin lock + * to avoid (impossible) case where requestor has been waked by + * some other thread (wake_up_all_lopri_owners(), or something + * similar) and managed to exit before we waked it up. + * + * Effect of this optimization wasn't big, after all. + * + */ +static void +wake_up_requestor(znode *node) +{ +#if NR_CPUS > 2 + requestors_list_head *creditors; + lock_stack *convoy[MAX_CONVOY_SIZE]; + int convoyused; + int convoylimit; + + assert("nikita-3180", node != NULL); + assert("nikita-3181", rw_zlock_is_locked(&node->lock)); + + convoyused = 0; + convoylimit = min(num_online_cpus() - 1, MAX_CONVOY_SIZE); + creditors = &node->lock.requestors; + if (!requestors_list_empty(creditors)) { + convoy[0] = requestors_list_front(creditors); + convoyused = 1; + /* + * it has been verified experimentally, that there are no + * convoys on the leaf level. + */ + if (znode_get_level(node) != LEAF_LEVEL && + convoy[0]->request.mode == ZNODE_READ_LOCK && + convoylimit > 1) { + lock_stack *item; + + for (item = requestors_list_next(convoy[0]); + ! requestors_list_end(creditors, item); + item = requestors_list_next(item)) { + if (item->request.mode == ZNODE_READ_LOCK) { + convoy[convoyused] = item; + ++ convoyused; + /* + * it is safe to spin lock multiple + * lock stacks here, because lock + * stack cannot sleep on more than one + * requestors queue. + */ + /* + * use raw spin_lock in stead of macro + * wrappers, because spin lock + * profiling code cannot cope with so + * many locks held at the same time. + */ + spin_lock(&item->sguard.lock); + if (convoyused == convoylimit) + break; + } + } + } + spin_lock(&convoy[0]->sguard.lock); + } + + WUNLOCK_ZLOCK(&node->lock); + + while (convoyused > 0) { + -- convoyused; + __reiser4_wake_up(convoy[convoyused]); + spin_unlock(&convoy[convoyused]->sguard.lock); + } +#else + /* uniprocessor case: keep it simple */ + if (!requestors_list_empty(&node->lock.requestors)) { + lock_stack *requestor; + + requestor = requestors_list_front(&node->lock.requestors); + reiser4_wake_up(requestor); + } + + WUNLOCK_ZLOCK(&node->lock); +#endif +} + +#undef MAX_CONVOY_SIZE + +/* release long-term lock, acquired by longterm_lock_znode() */ +reiser4_internal void +longterm_unlock_znode(lock_handle * handle) +{ + znode *node = handle->node; + lock_stack *oldowner = handle->owner; + int hipri; + int readers; + int rdelta; + int youdie; + + /* + * this is time-critical and highly optimized code. Modify carefully. + */ + + assert("jmacd-1021", handle != NULL); + assert("jmacd-1022", handle->owner != NULL); + assert("nikita-1392", LOCK_CNT_GTZ(long_term_locked_znode)); + + assert("zam-130", oldowner == get_current_lock_stack()); + + LOCK_CNT_DEC(long_term_locked_znode); + + + /* + * to minimize amount of operations performed under lock, pre-compute + * all variables used within critical section. This makes code + * obscure. + */ + + /* was this lock of hi or lo priority */ + hipri = oldowner->curpri ? -1 : 0; + /* number of readers */ + readers = node->lock.nr_readers; + /* +1 if write lock, -1 if read lock */ + rdelta = (readers > 0) ? -1 : +1; + /* true if node is to die and write lock is released */ + youdie = ZF_ISSET(node, JNODE_HEARD_BANSHEE) && (readers < 0); + + WLOCK_ZLOCK(&node->lock); + + assert("zam-101", znode_is_locked(node)); + + /* Adjust a number of high priority owners of this lock */ + node->lock.nr_hipri_owners += hipri; + assert("nikita-1836", node->lock.nr_hipri_owners >= 0); + + /* Handle znode deallocation on last write-lock release. */ + if (znode_is_wlocked_once(node)) { + if (youdie) { + forget_znode(handle); + assert("nikita-2191", znode_invariant(node)); + zput(node); + return; + } + } + + if (handle->signaled) + atomic_dec(&oldowner->nr_signaled); + + /* Unlocking means owner<->object link deletion */ + unlink_object(handle); + + /* This is enough to be sure whether an object is completely + unlocked. */ + node->lock.nr_readers += rdelta; + + /* If the node is locked it must have an owners list. Likewise, if + the node is unlocked it must have an empty owners list. */ + assert("zam-319", equi(znode_is_locked(node), + !owners_list_empty(&node->lock.owners))); + +#if REISER4_DEBUG + if (!znode_is_locked(node)) + ++ node->times_locked; +#endif + + /* If there are pending lock requests we wake up a requestor */ + if (!znode_is_wlocked(node)) + wake_up_requestor(node); + else + WUNLOCK_ZLOCK(&node->lock); + + assert("nikita-3182", rw_zlock_is_not_locked(&node->lock)); + /* minus one reference from handle->node */ + handle->node = NULL; + assert("nikita-2190", znode_invariant(node)); + ON_DEBUG(check_lock_data()); + ON_DEBUG(check_lock_node_data(node)); + zput(node); +} + +/* final portion of longterm-lock */ +static int +lock_tail(lock_stack *owner, int wake_up_next, int ok, znode_lock_mode mode) +{ + znode *node = owner->request.node; + + assert("jmacd-807", rw_zlock_is_locked(&node->lock)); + + /* If we broke with (ok == 0) it means we can_lock, now do it. */ + if (ok == 0) { + lock_object(owner); + owner->request.mode = 0; + if (mode == ZNODE_READ_LOCK) + wake_up_next = 1; + } + + if (wake_up_next) + wake_up_requestor(node); + else + WUNLOCK_ZLOCK(&node->lock); + + if (ok == 0) { + /* count a reference from lockhandle->node + + znode was already referenced at the entry to this function, + hence taking spin-lock here is not necessary (see comment + in the zref()). + */ + zref(node); + + LOCK_CNT_INC(long_term_locked_znode); + } + + ON_DEBUG(check_lock_data()); + ON_DEBUG(check_lock_node_data(node)); + return ok; +} + +/* + * version of longterm_znode_lock() optimized for the most common case: read + * lock without any special flags. This is the kind of lock that any tree + * traversal takes on the root node of the tree, which is very frequent. + */ +static int +longterm_lock_tryfast(lock_stack * owner) +{ + int result; + int wake_up_next = 0; + znode *node; + zlock *lock; + + node = owner->request.node; + lock = &node->lock; + + assert("nikita-3340", schedulable()); + assert("nikita-3341", request_is_deadlock_safe(node, + ZNODE_READ_LOCK, + ZNODE_LOCK_LOPRI)); + + result = UNDER_RW(zlock, lock, read, can_lock_object(owner)); + + if (likely(result != -EINVAL)) { + spin_lock_znode(node); + result = try_capture( + ZJNODE(node), ZNODE_READ_LOCK, 0, 1/* can copy on capture */); + spin_unlock_znode(node); + WLOCK_ZLOCK(lock); + if (unlikely(result != 0)) { + owner->request.mode = 0; + wake_up_next = 1; + } else { + result = can_lock_object(owner); + if (unlikely(result == -E_REPEAT)) { + /* fall back to longterm_lock_znode() */ + WUNLOCK_ZLOCK(lock); + return 1; + } + } + return lock_tail(owner, wake_up_next, result, ZNODE_READ_LOCK); + } else + return 1; +} + +/* locks given lock object */ +reiser4_internal int +longterm_lock_znode( + /* local link object (allocated by lock owner thread, usually on its own + * stack) */ + lock_handle * handle, + /* znode we want to lock. */ + znode * node, + /* {ZNODE_READ_LOCK, ZNODE_WRITE_LOCK}; */ + znode_lock_mode mode, + /* {0, -EINVAL, -E_DEADLOCK}, see return codes description. */ + znode_lock_request request) +{ + int ret; + int hipri = (request & ZNODE_LOCK_HIPRI) != 0; + int wake_up_next = 0; + int non_blocking = 0; + int has_atom; + txn_capture cap_flags; + zlock *lock; + txn_handle *txnh; + tree_level level; + + /* Get current process context */ + lock_stack *owner = get_current_lock_stack(); + + /* Check that the lock handle is initialized and isn't already being + * used. */ + assert("jmacd-808", handle->owner == NULL); + assert("nikita-3026", schedulable()); + assert("nikita-3219", request_is_deadlock_safe(node, mode, request)); + /* long term locks are not allowed in the VM contexts (->writepage(), + * prune_{d,i}cache()). + * + * FIXME this doesn't work due to unused-dentry-with-unlinked-inode + * bug caused by d_splice_alias() only working for directories. + */ + assert("nikita-3547", 1 || ((current->flags & PF_MEMALLOC) == 0)); + + cap_flags = 0; + if (request & ZNODE_LOCK_NONBLOCK) { + cap_flags |= TXN_CAPTURE_NONBLOCKING; + non_blocking = 1; + } + + if (request & ZNODE_LOCK_DONT_FUSE) + cap_flags |= TXN_CAPTURE_DONT_FUSE; + + /* If we are changing our process priority we must adjust a number + of high priority owners for each znode that we already lock */ + if (hipri) { + set_high_priority(owner); + } else { + set_low_priority(owner); + } + + level = znode_get_level(node); + + /* Fill request structure with our values. */ + owner->request.mode = mode; + owner->request.handle = handle; + owner->request.node = node; + + txnh = get_current_context()->trans; + lock = &node->lock; + + if (mode == ZNODE_READ_LOCK && request == 0) { + ret = longterm_lock_tryfast(owner); + if (ret <= 0) + return ret; + } + + has_atom = (txnh->atom != NULL); + + /* Synchronize on node's zlock guard lock. */ + WLOCK_ZLOCK(lock); + + if (znode_is_locked(node) && + mode == ZNODE_WRITE_LOCK && recursive(owner)) + return lock_tail(owner, 0, 0, mode); + + for (;;) { + /* Check the lock's availability: if it is unavaiable we get + E_REPEAT, 0 indicates "can_lock", otherwise the node is + invalid. */ + ret = can_lock_object(owner); + + if (unlikely(ret == -EINVAL)) { + /* @node is dying. Leave it alone. */ + /* wakeup next requestor to support lock invalidating */ + wake_up_next = 1; + break; + } + + if (unlikely(ret == -E_REPEAT && non_blocking)) { + /* either locking of @node by the current thread will + * lead to the deadlock, or lock modes are + * incompatible. */ + break; + } + + assert("nikita-1844", (ret == 0) || ((ret == -E_REPEAT) && !non_blocking)); + /* If we can get the lock... Try to capture first before + taking the lock.*/ + + /* first handle commonest case where node and txnh are already + * in the same atom. */ + /* safe to do without taking locks, because: + * + * 1. read of aligned word is atomic with respect to writes to + * this word + * + * 2. false negatives are handled in try_capture(). + * + * 3. false positives are impossible. + * + * PROOF: left as an exercise to the curious reader. + * + * Just kidding. Here is one: + * + * At the time T0 txnh->atom is stored in txnh_atom. + * + * At the time T1 node->atom is stored in node_atom. + * + * At the time T2 we observe that + * + * txnh_atom != NULL && node_atom == txnh_atom. + * + * Imagine that at this moment we acquire node and txnh spin + * lock in this order. Suppose that under spin lock we have + * + * node->atom != txnh->atom, (S1) + * + * at the time T3. + * + * txnh->atom != NULL still, because txnh is open by the + * current thread. + * + * Suppose node->atom == NULL, that is, node was un-captured + * between T1, and T3. But un-capturing of formatted node is + * always preceded by the call to invalidate_lock(), which + * marks znode as JNODE_IS_DYING under zlock spin + * lock. Contradiction, because can_lock_object() above checks + * for JNODE_IS_DYING. Hence, node->atom != NULL at T3. + * + * Suppose that node->atom != node_atom, that is, atom, node + * belongs to was fused into another atom: node_atom was fused + * into node->atom. Atom of txnh was equal to node_atom at T2, + * which means that under spin lock, txnh->atom == node->atom, + * because txnh->atom can only follow fusion + * chain. Contradicts S1. + * + * The same for hypothesis txnh->atom != txnh_atom. Hence, + * node->atom == node_atom == txnh_atom == txnh->atom. Again + * contradicts S1. Hence S1 is false. QED. + * + */ + + if (likely(has_atom && ZJNODE(node)->atom == txnh->atom)) { + ; + } else { + /* + * unlock zlock spin lock here. It is possible for + * longterm_unlock_znode() to sneak in here, but there + * is no harm: invalidate_lock() will mark znode as + * JNODE_IS_DYING and this will be noted by + * can_lock_object() below. + */ + WUNLOCK_ZLOCK(lock); + spin_lock_znode(node); + ret = try_capture( + ZJNODE(node), mode, cap_flags, 1/* can copy on capture*/); + spin_unlock_znode(node); + WLOCK_ZLOCK(lock); + if (unlikely(ret != 0)) { + /* In the failure case, the txnmgr releases + the znode's lock (or in some cases, it was + released a while ago). There's no need to + reacquire it so we should return here, + avoid releasing the lock. */ + owner->request.mode = 0; + /* next requestor may not fail */ + wake_up_next = 1; + break; + } + + /* Check the lock's availability again -- this is + because under some circumstances the capture code + has to release and reacquire the znode spinlock. */ + ret = can_lock_object(owner); + } + + /* This time, a return of (ret == 0) means we can lock, so we + should break out of the loop. */ + if (likely(ret != -E_REPEAT || non_blocking)) { + break; + } + + /* Lock is unavailable, we have to wait. */ + + /* By having semaphore initialization here we cannot lose + wakeup signal even if it comes after `nr_signaled' field + check. */ + ret = prepare_to_sleep(owner); + if (unlikely(ret != 0)) { + break; + } + + assert("nikita-1837", rw_zlock_is_locked(&node->lock)); + if (hipri) { + /* If we are going in high priority direction then + increase high priority requests counter for the + node */ + lock->nr_hipri_requests++; + /* If there are no high priority owners for a node, + then immediately wake up low priority owners, so + they can detect possible deadlock */ + if (lock->nr_hipri_owners == 0) + wake_up_all_lopri_owners(node); + /* And prepare a lock request */ + requestors_list_push_front(&lock->requestors, owner); + } else { + /* If we are going in low priority direction then we + set low priority to our process. This is the only + case when a process may become low priority */ + /* And finally prepare a lock request */ + requestors_list_push_back(&lock->requestors, owner); + } + + /* Ok, here we have prepared a lock request, so unlock + a znode ...*/ + WUNLOCK_ZLOCK(lock); + /* ... and sleep */ + go_to_sleep(owner); + + WLOCK_ZLOCK(lock); + + if (hipri) { + assert("nikita-1838", lock->nr_hipri_requests > 0); + lock->nr_hipri_requests--; + } + + requestors_list_remove(owner); + } + + assert("jmacd-807/a", rw_zlock_is_locked(&node->lock)); + return lock_tail(owner, wake_up_next, ret, mode); +} + +/* lock object invalidation means changing of lock object state to `INVALID' + and waiting for all other processes to cancel theirs lock requests. */ +reiser4_internal void +invalidate_lock(lock_handle * handle /* path to lock + * owner and lock + * object is being + * invalidated. */ ) +{ + znode *node = handle->node; + lock_stack *owner = handle->owner; + lock_stack *rq; + + assert("zam-325", owner == get_current_lock_stack()); + assert("zam-103", znode_is_write_locked(node)); + assert("nikita-1393", !ZF_ISSET(node, JNODE_LEFT_CONNECTED)); + assert("nikita-1793", !ZF_ISSET(node, JNODE_RIGHT_CONNECTED)); + assert("nikita-1394", ZF_ISSET(node, JNODE_HEARD_BANSHEE)); + assert("nikita-3097", znode_is_wlocked_once(node)); + assert("nikita-3338", rw_zlock_is_locked(&node->lock)); + + if (handle->signaled) + atomic_dec(&owner->nr_signaled); + + ZF_SET(node, JNODE_IS_DYING); + unlink_object(handle); + node->lock.nr_readers = 0; + + /* all requestors will be informed that lock is invalidated. */ + for_all_type_safe_list(requestors, &node->lock.requestors, rq) { + reiser4_wake_up(rq); + } + + /* We use that each unlock() will wakeup first item from requestors + list; our lock stack is the last one. */ + while (!requestors_list_empty(&node->lock.requestors)) { + requestors_list_push_back(&node->lock.requestors, owner); + + prepare_to_sleep(owner); + + WUNLOCK_ZLOCK(&node->lock); + go_to_sleep(owner); + WLOCK_ZLOCK(&node->lock); + + requestors_list_remove(owner); + } + + WUNLOCK_ZLOCK(&node->lock); +} + +/* Initializes lock_stack. */ +reiser4_internal void +init_lock_stack(lock_stack * owner /* pointer to + * allocated + * structure. */ ) +{ + /* xmemset(,0,) is done already as a part of reiser4 context + * initialization */ + /* xmemset(owner, 0, sizeof (lock_stack)); */ + locks_list_init(&owner->locks); + requestors_list_clean(owner); + spin_stack_init(owner); + owner->curpri = 1; + sema_init(&owner->sema, 0); +} + +/* Initializes lock object. */ +reiser4_internal void +reiser4_init_lock(zlock * lock /* pointer on allocated + * uninitialized lock object + * structure. */ ) +{ + memset(lock, 0, sizeof (zlock)); + rw_zlock_init(lock); + requestors_list_init(&lock->requestors); + owners_list_init(&lock->owners); +} + +/* lock handle initialization */ +reiser4_internal void +init_lh(lock_handle * handle) +{ + memset(handle, 0, sizeof *handle); + locks_list_clean(handle); + owners_list_clean(handle); +} + +/* freeing of lock handle resources */ +reiser4_internal void +done_lh(lock_handle * handle) +{ + assert("zam-342", handle != NULL); + if (handle->owner != NULL) + longterm_unlock_znode(handle); +} + +/* Transfer a lock handle (presumably so that variables can be moved between stack and + heap locations). */ +static void +move_lh_internal(lock_handle * new, lock_handle * old, int unlink_old) +{ + znode *node = old->node; + lock_stack *owner = old->owner; + int signaled; + + /* locks_list, modified by link_object() is not protected by + anything. This is valid because only current thread ever modifies + locks_list of its lock_stack. + */ + assert("nikita-1827", owner == get_current_lock_stack()); + assert("nikita-1831", new->owner == NULL); + + WLOCK_ZLOCK(&node->lock); + + signaled = old->signaled; + if (unlink_old) { + unlink_object(old); + } else { + if (node->lock.nr_readers > 0) { + node->lock.nr_readers += 1; + } else { + node->lock.nr_readers -= 1; + } + if (signaled) { + atomic_inc(&owner->nr_signaled); + } + if (owner->curpri) { + node->lock.nr_hipri_owners += 1; + } + LOCK_CNT_INC(long_term_locked_znode); + + zref(node); + } + link_object(new, owner, node); + new->signaled = signaled; + + WUNLOCK_ZLOCK(&node->lock); +} + +reiser4_internal void +move_lh(lock_handle * new, lock_handle * old) +{ + move_lh_internal(new, old, /*unlink_old */ 1); +} + +reiser4_internal void +copy_lh(lock_handle * new, lock_handle * old) +{ + move_lh_internal(new, old, /*unlink_old */ 0); +} + +/* after getting -E_DEADLOCK we unlock znodes until this function returns false */ +reiser4_internal int +check_deadlock(void) +{ + lock_stack *owner = get_current_lock_stack(); + return atomic_read(&owner->nr_signaled) != 0; +} + +/* Before going to sleep we re-check "release lock" requests which might come from threads with hi-pri lock + priorities. */ +reiser4_internal int +prepare_to_sleep(lock_stack * owner) +{ + assert("nikita-1847", owner == get_current_lock_stack()); + /* NOTE(Zam): We cannot reset the lock semaphore here because it may + clear wake-up signal. The initial design was to re-check all + conditions under which we continue locking, release locks or sleep + until conditions are changed. However, even lock.c does not follow + that design. So, wake-up signal which is stored in semaphore state + could we loosen by semaphore reset. The less complex scheme without + resetting the semaphore is enough to not to loose wake-ups. + + if (0) { + + NOTE-NIKITA: I commented call to sema_init() out hoping + that it is the reason or thread sleeping in + down(&owner->sema) without any other thread running. + + Anyway, it is just an optimization: is semaphore is not + reinitialised at this point, in the worst case + longterm_lock_znode() would have to iterate its loop once + more. + spin_lock_stack(owner); + sema_init(&owner->sema, 0); + spin_unlock_stack(owner); + } + */ + + /* We return -E_DEADLOCK if one or more "give me the lock" messages are + * counted in nr_signaled */ + if (unlikely(atomic_read(&owner->nr_signaled) != 0)) { + assert("zam-959", !owner->curpri); + return RETERR(-E_DEADLOCK); + } + return 0; +} + +/* Wakes up a single thread */ +reiser4_internal void +__reiser4_wake_up(lock_stack * owner) +{ + up(&owner->sema); +} + +/* Puts a thread to sleep */ +reiser4_internal void +go_to_sleep(lock_stack * owner) +{ + /* Well, we might sleep here, so holding of any spinlocks is no-no */ + assert("nikita-3027", schedulable()); + /* return down_interruptible(&owner->sema); */ + down(&owner->sema); +} + +reiser4_internal int +lock_stack_isclean(lock_stack * owner) +{ + if (locks_list_empty(&owner->locks)) { + assert("zam-353", atomic_read(&owner->nr_signaled) == 0); + return 1; + } + + return 0; +} + +#if REISER4_DEBUG +/* Debugging help */ +reiser4_internal void +print_lock_stack(const char *prefix, lock_stack * owner) +{ + lock_handle *handle; + + spin_lock_stack(owner); + + printk("%s:\n", prefix); + printk(".... nr_signaled %d\n", atomic_read(&owner->nr_signaled)); + printk(".... curpri %s\n", owner->curpri ? "high" : "low"); + + if (owner->request.mode != 0) { + printk(".... current request: %s", owner->request.mode == ZNODE_WRITE_LOCK ? "write" : "read"); + print_address("", znode_get_block(owner->request.node)); + } + + printk(".... current locks:\n"); + + for_all_type_safe_list(locks, &owner->locks, handle) { + if (handle->node != NULL) + print_address(znode_is_rlocked(handle->node) ? + "...... read" : "...... write", znode_get_block(handle->node)); + } + + spin_unlock_stack(owner); +} + +/* + * debugging functions + */ + +/* check consistency of locking data-structures hanging of the @stack */ +static void +check_lock_stack(lock_stack * stack) +{ + spin_lock_stack(stack); + /* check that stack->locks is not corrupted */ + locks_list_check(&stack->locks); + spin_unlock_stack(stack); +} + +/* check consistency of locking data structures */ +void +check_lock_data(void) +{ + check_lock_stack(&get_current_context()->stack); +} + +/* check consistency of locking data structures for @node */ +void +check_lock_node_data(znode * node) +{ + RLOCK_ZLOCK(&node->lock); + owners_list_check(&node->lock.owners); + requestors_list_check(&node->lock.requestors); + RUNLOCK_ZLOCK(&node->lock); +} + +/* check that given lock request is dead lock safe. This check is, of course, + * not exhaustive. */ +static int +request_is_deadlock_safe(znode * node, znode_lock_mode mode, + znode_lock_request request) +{ + lock_stack *owner; + + owner = get_current_lock_stack(); + /* + * check that hipri lock request is not issued when there are locked + * nodes at the higher levels. + */ + if (request & ZNODE_LOCK_HIPRI && !(request & ZNODE_LOCK_NONBLOCK) && + znode_get_level(node) != 0) { + lock_handle *item; + + for_all_type_safe_list(locks, &owner->locks, item) { + znode *other = item->node; + + if (znode_get_level(other) == 0) + continue; + if (znode_get_level(other) > znode_get_level(node)) + return 0; + } + } + return 1; +} + +#endif + +/* return pointer to static storage with name of lock_mode. For + debugging */ +reiser4_internal const char * +lock_mode_name(znode_lock_mode lock /* lock mode to get name of */ ) +{ + if (lock == ZNODE_READ_LOCK) + return "read"; + else if (lock == ZNODE_WRITE_LOCK) + return "write"; + else { + static char buf[30]; + + sprintf(buf, "unknown: %i", lock); + return buf; + } +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/lock.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/lock.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,251 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Long term locking data structures. See lock.c for details. */ + +#ifndef __LOCK_H__ +#define __LOCK_H__ + +#include "forward.h" +#include "debug.h" +#include "dformat.h" +#include "spin_macros.h" +#include "key.h" +#include "coord.h" +#include "type_safe_list.h" +#include "plugin/node/node.h" +#include "jnode.h" +#include "readahead.h" + +#include +#include +#include /* for PAGE_CACHE_SIZE */ +#include +#include + +/* per-znode lock requests queue; list items are lock owner objects + which want to lock given znode. + + Locking: protected by znode spin lock. */ +TYPE_SAFE_LIST_DECLARE(requestors); +/* per-znode list of lock handles for this znode + + Locking: protected by znode spin lock. */ +TYPE_SAFE_LIST_DECLARE(owners); +/* per-owner list of lock handles that point to locked znodes which + belong to one lock owner + + Locking: this list is only accessed by the thread owning the lock stack this + list is attached to. Hence, no locking is necessary. +*/ +TYPE_SAFE_LIST_DECLARE(locks); + +/* Per-znode lock object */ +struct zlock { + reiser4_rw_data guard; + /* The number of readers if positive; the number of recursively taken + write locks if negative. Protected by zlock spin lock. */ + int nr_readers; + /* A number of processes (lock_stacks) that have this object + locked with high priority */ + unsigned nr_hipri_owners; + /* A number of attempts to lock znode in high priority direction */ + unsigned nr_hipri_requests; + /* A linked list of lock_handle objects that contains pointers + for all lock_stacks which have this lock object locked */ + owners_list_head owners; + /* A linked list of lock_stacks that wait for this lock */ + requestors_list_head requestors; +}; + +#define rw_ordering_pred_zlock(lock) \ + (lock_counters()->spin_locked_stack == 0) + +/* Define spin_lock_zlock, spin_unlock_zlock, etc. */ +RW_LOCK_FUNCTIONS(zlock, zlock, guard); + +#define lock_is_locked(lock) ((lock)->nr_readers != 0) +#define lock_is_rlocked(lock) ((lock)->nr_readers > 0) +#define lock_is_wlocked(lock) ((lock)->nr_readers < 0) +#define lock_is_wlocked_once(lock) ((lock)->nr_readers == -1) +#define lock_can_be_rlocked(lock) ((lock)->nr_readers >=0) +#define lock_mode_compatible(lock, mode) \ + (((mode) == ZNODE_WRITE_LOCK && !lock_is_locked(lock)) \ + || ((mode) == ZNODE_READ_LOCK && lock_can_be_rlocked(lock))) + + +/* Since we have R/W znode locks we need additional bidirectional `link' + objects to implement n<->m relationship between lock owners and lock + objects. We call them `lock handles'. + + Locking: see lock.c/"SHORT-TERM LOCKING" +*/ +struct lock_handle { + /* This flag indicates that a signal to yield a lock was passed to + lock owner and counted in owner->nr_signalled + + Locking: this is accessed under spin lock on ->node. + */ + int signaled; + /* A link to owner of a lock */ + lock_stack *owner; + /* A link to znode locked */ + znode *node; + /* A list of all locks for a process */ + locks_list_link locks_link; + /* A list of all owners for a znode */ + owners_list_link owners_link; +}; + +typedef struct lock_request { + /* A pointer to uninitialized link object */ + lock_handle *handle; + /* A pointer to the object we want to lock */ + znode *node; + /* Lock mode (ZNODE_READ_LOCK or ZNODE_WRITE_LOCK) */ + znode_lock_mode mode; +} lock_request; + +/* A lock stack structure for accumulating locks owned by a process */ +struct lock_stack { + /* A guard lock protecting a lock stack */ + reiser4_spin_data sguard; + /* number of znodes which were requested by high priority processes */ + atomic_t nr_signaled; + /* Current priority of a process + + This is only accessed by the current thread and thus requires no + locking. + */ + int curpri; + /* A list of all locks owned by this process. Elements can be added to + * this list only by the current thread. ->node pointers in this list + * can be only changed by the current thread. */ + locks_list_head locks; + int nr_locks; /* number of lock handles in the above list */ + /* When lock_stack waits for the lock, it puts itself on double-linked + requestors list of that lock */ + requestors_list_link requestors_link; + /* Current lock request info. + + This is only accessed by the current thread and thus requires no + locking. + */ + lock_request request; + /* It is a lock_stack's synchronization object for when process sleeps + when requested lock not on this lock_stack but which it wishes to + add to this lock_stack is not immediately available. It is used + instead of wait_queue_t object due to locking problems (lost wake + up). "lost wakeup" occurs when process is waken up before he actually + becomes 'sleepy' (through sleep_on()). Using of semaphore object is + simplest way to avoid that problem. + + A semaphore is used in the following way: only the process that is + the owner of the lock_stack initializes it (to zero) and calls + down(sema) on it. Usually this causes the process to sleep on the + semaphore. Other processes may wake him up by calling up(sema). The + advantage to a semaphore is that up() and down() calls are not + required to preserve order. Unlike wait_queue it works when process + is woken up before getting to sleep. + + NOTE-NIKITA: Transaction manager is going to have condition variables + (&kcondvar_t) anyway, so this probably will be replaced with + one in the future. + + After further discussion, Nikita has shown me that Zam's implementation is + exactly a condition variable. The znode's {zguard,requestors_list} represents + condition variable and the lock_stack's {sguard,semaphore} guards entry and + exit from the condition variable's wait queue. But the existing code can't + just be replaced with a more general abstraction, and I think its fine the way + it is. */ + struct semaphore sema; +}; + +/* defining of list manipulation functions for lists above */ +TYPE_SAFE_LIST_DEFINE(requestors, lock_stack, requestors_link); +TYPE_SAFE_LIST_DEFINE(owners, lock_handle, owners_link); +TYPE_SAFE_LIST_DEFINE(locks, lock_handle, locks_link); + +/* + User-visible znode locking functions +*/ + +extern int longterm_lock_znode (lock_handle * handle, + znode * node, + znode_lock_mode mode, + znode_lock_request request); + +extern void longterm_unlock_znode(lock_handle * handle); + +extern int check_deadlock(void); + +extern lock_stack *get_current_lock_stack(void); + +extern void init_lock_stack(lock_stack * owner); +extern void reiser4_init_lock(zlock * lock); + +extern void init_lh(lock_handle *); +extern void move_lh(lock_handle * new, lock_handle * old); +extern void copy_lh(lock_handle * new, lock_handle * old); +extern void done_lh(lock_handle *); + +extern int prepare_to_sleep(lock_stack * owner); +extern void go_to_sleep(lock_stack * owner); +extern void __reiser4_wake_up(lock_stack * owner); + +extern int lock_stack_isclean(lock_stack * owner); + +/* zlock object state check macros: only used in assertions. Both forms imply that the + lock is held by the current thread. */ +extern int znode_is_write_locked(const znode * node); + +#if REISER4_DEBUG +#define spin_ordering_pred_stack_addendum (1) +#else +#define spin_ordering_pred_stack_addendum \ + ((lock_counters()->rw_locked_dk == 0) && \ + (lock_counters()->rw_locked_tree == 0)) +#endif +/* lock ordering is: first take zlock spin lock, then lock stack spin lock */ +#define spin_ordering_pred_stack(stack) \ + ((lock_counters()->spin_locked_stack == 0) && \ + (lock_counters()->spin_locked_txnmgr == 0) && \ + (lock_counters()->spin_locked_super == 0) && \ + (lock_counters()->spin_locked_inode_object == 0) && \ + (lock_counters()->rw_locked_cbk_cache == 0) && \ + (lock_counters()->spin_locked_epoch == 0) && \ + (lock_counters()->spin_locked_super_eflush == 0) && \ + spin_ordering_pred_stack_addendum) + +/* Same for lock_stack */ +SPIN_LOCK_FUNCTIONS(stack, lock_stack, sguard); + +static inline void +reiser4_wake_up(lock_stack * owner) +{ + spin_lock_stack(owner); + __reiser4_wake_up(owner); + spin_unlock_stack(owner); +} + +const char *lock_mode_name(znode_lock_mode lock); + +#if REISER4_DEBUG +extern void check_lock_data(void); +extern void check_lock_node_data(znode * node); +#else +#define check_lock_data() noop +#define check_lock_node_data() noop +#endif + +/* __LOCK_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/Makefile --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/Makefile Mon Jun 13 15:05:23 2005 @@ -0,0 +1,96 @@ +# +# reiser4/Makefile +# + +obj-$(CONFIG_REISER4_FS) += reiser4.o + +reiser4-y := \ + debug.o \ + jnode.o \ + znode.o \ + key.o \ + pool.o \ + tree_mod.o \ + estimate.o \ + carry.o \ + carry_ops.o \ + lock.o \ + tree.o \ + context.o \ + tap.o \ + coord.o \ + block_alloc.o \ + txnmgr.o \ + kassign.o \ + flush.o \ + wander.o \ + eottl.o \ + search.o \ + page_cache.o \ + kcond.o \ + seal.o \ + dscale.o \ + flush_queue.o \ + ktxnmgrd.o \ + blocknrset.o \ + super.o \ + oid.o \ + tree_walk.o \ + inode.o \ + vfs_ops.o \ + inode_ops.o \ + file_ops.o \ + as_ops.o \ + emergency_flush.o \ + entd.o\ + readahead.o \ + crypt.o \ + status_flags.o \ + init_super.o \ + safe_link.o \ + \ + plugin/plugin.o \ + plugin/plugin_set.o \ + plugin/node/node.o \ + plugin/object.o \ + plugin/symlink.o \ + plugin/cryptcompress.o \ + plugin/digest.o \ + plugin/node/node40.o \ + \ + plugin/compress/minilzo.o \ + plugin/compress/compress.o \ + \ + plugin/item/static_stat.o \ + plugin/item/sde.o \ + plugin/item/cde.o \ + plugin/item/blackbox.o \ + plugin/item/internal.o \ + plugin/item/tail.o \ + plugin/item/ctail.o \ + plugin/item/extent.o \ + plugin/item/extent_item_ops.o \ + plugin/item/extent_file_ops.o \ + plugin/item/extent_flush_ops.o \ + \ + plugin/hash.o \ + plugin/fibration.o \ + plugin/tail_policy.o \ + plugin/item/item.o \ + \ + plugin/dir/hashed_dir.o \ + plugin/dir/pseudo_dir.o \ + plugin/dir/dir.o \ + \ + plugin/security/perm.o \ + \ + plugin/pseudo/pseudo.o \ + \ + plugin/space/bitmap.o \ + \ + plugin/disk_format/disk_format40.o \ + plugin/disk_format/disk_format.o \ + \ + plugin/file/pseudo.o \ + plugin/file/file.o \ + plugin/file/tail_conversion.o diff -puN /dev/null fs/reiser4/oid.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/oid.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,166 @@ +/* Copyright 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "debug.h" +#include "super.h" +#include "txnmgr.h" + +/* we used to have oid allocation plugin. It was removed because it + was recognized as providing unneeded level of abstraction. If one + ever will find it useful - look at yet_unneeded_abstractions/oid +*/ + +/* + * initialize in-memory data for oid allocator at @super. @nr_files and @next + * are provided by disk format plugin that reads them from the disk during + * mount. + */ +reiser4_internal int +oid_init_allocator(struct super_block *super, oid_t nr_files, oid_t next) +{ + reiser4_super_info_data *sbinfo; + + sbinfo = get_super_private(super); + + sbinfo->next_to_use = next; + sbinfo->oids_in_use = nr_files; + return 0; +} + +/* + * allocate oid and return it. ABSOLUTE_MAX_OID is returned when allocator + * runs out of oids. + */ +reiser4_internal oid_t +oid_allocate(struct super_block *super) +{ + reiser4_super_info_data *sbinfo; + oid_t oid; + + sbinfo = get_super_private(super); + + reiser4_spin_lock_sb(sbinfo); + if (sbinfo->next_to_use != ABSOLUTE_MAX_OID) { + oid = sbinfo->next_to_use ++; + sbinfo->oids_in_use ++; + } else + oid = ABSOLUTE_MAX_OID; + reiser4_spin_unlock_sb(sbinfo); + return oid; +} + +/* + * Tell oid allocator that @oid is now free. + */ +reiser4_internal int +oid_release(struct super_block *super, oid_t oid UNUSED_ARG) +{ + reiser4_super_info_data *sbinfo; + + sbinfo = get_super_private(super); + + reiser4_spin_lock_sb(sbinfo); + sbinfo->oids_in_use --; + reiser4_spin_unlock_sb(sbinfo); + return 0; +} + +/* + * return next @oid that would be allocated (i.e., returned by oid_allocate()) + * without actually allocating it. This is used by disk format plugin to save + * oid allocator state on the disk. + */ +reiser4_internal oid_t oid_next(const struct super_block *super) +{ + reiser4_super_info_data *sbinfo; + oid_t oid; + + sbinfo = get_super_private(super); + + reiser4_spin_lock_sb(sbinfo); + oid = sbinfo->next_to_use; + reiser4_spin_unlock_sb(sbinfo); + return oid; +} + +/* + * returns number of currently used oids. This is used by statfs(2) to report + * number of "inodes" and by disk format plugin to save oid allocator state on + * the disk. + */ +reiser4_internal long oids_used(const struct super_block *super) +{ + reiser4_super_info_data *sbinfo; + oid_t used; + + sbinfo = get_super_private(super); + + reiser4_spin_lock_sb(sbinfo); + used = sbinfo->oids_in_use; + reiser4_spin_unlock_sb(sbinfo); + if (used < (__u64) ((long) ~0) >> 1) + return (long) used; + else + return (long) -1; +} + +/* + * return number of "free" oids. This is used by statfs(2) to report "free" + * inodes. + */ +reiser4_internal long oids_free(const struct super_block *super) +{ + reiser4_super_info_data *sbinfo; + oid_t oids; + + sbinfo = get_super_private(super); + + reiser4_spin_lock_sb(sbinfo); + oids = ABSOLUTE_MAX_OID - OIDS_RESERVED - sbinfo->next_to_use; + reiser4_spin_unlock_sb(sbinfo); + if (oids < (__u64) ((long) ~0) >> 1) + return (long) oids; + else + return (long) -1; +} + +/* + * Count oid as allocated in atom. This is done after call to oid_allocate() + * at the point when we are irrevocably committed to creation of the new file + * (i.e., when oid allocation cannot be any longer rolled back due to some + * error). + */ +reiser4_internal void +oid_count_allocated(void) +{ + txn_atom *atom; + + atom = get_current_atom_locked(); + atom->nr_objects_created++; + UNLOCK_ATOM(atom); +} + +/* + * Count oid as free in atom. This is done after call to oid_release() at the + * point when we are irrevocably committed to the deletion of the file (i.e., + * when oid release cannot be any longer rolled back due to some error). + */ +reiser4_internal void +oid_count_released(void) +{ + txn_atom *atom; + + atom = get_current_atom_locked(); + atom->nr_objects_deleted++; + UNLOCK_ATOM(atom); +} + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/page_cache.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/page_cache.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,779 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Memory pressure hooks. Fake inodes handling. */ +/* We store all file system meta data (and data, of course) in the page cache. + + What does this mean? In stead of using bread/brelse we create special + "fake" inode (one per super block) and store content of formatted nodes + into pages bound to this inode in the page cache. In newer kernels bread() + already uses inode attached to block device (bd_inode). Advantage of having + our own fake inode is that we can install appropriate methods in its + address_space operations. Such methods are called by VM on memory pressure + (or during background page flushing) and we can use them to react + appropriately. + + In initial version we only support one block per page. Support for multiple + blocks per page is complicated by relocation. + + To each page, used by reiser4, jnode is attached. jnode is analogous to + buffer head. Difference is that jnode is bound to the page permanently: + jnode cannot be removed from memory until its backing page is. + + jnode contain pointer to page (->pg field) and page contain pointer to + jnode in ->private field. Pointer from jnode to page is protected to by + jnode's spinlock and pointer from page to jnode is protected by page lock + (PG_locked bit). Lock ordering is: first take page lock, then jnode spin + lock. To go into reverse direction use jnode_lock_page() function that uses + standard try-lock-and-release device. + + Properties: + + 1. when jnode-to-page mapping is established (by jnode_attach_page()), page + reference counter is increased. + + 2. when jnode-to-page mapping is destroyed (by jnode_detach_page() and + page_detach_jnode()), page reference counter is decreased. + + 3. on jload() reference counter on jnode page is increased, page is + kmapped and `referenced'. + + 4. on jrelse() inverse operations are performed. + + 5. kmapping/kunmapping of unformatted pages is done by read/write methods. + + + DEADLOCKS RELATED TO MEMORY PRESSURE. [OUTDATED. Only interesting + historically.] + + [In the following discussion, `lock' invariably means long term lock on + znode.] (What about page locks?) + + There is some special class of deadlock possibilities related to memory + pressure. Locks acquired by other reiser4 threads are accounted for in + deadlock prevention mechanism (lock.c), but when ->vm_writeback() is + invoked additional hidden arc is added to the locking graph: thread that + tries to allocate memory waits for ->vm_writeback() to finish. If this + thread keeps lock and ->vm_writeback() tries to acquire this lock, deadlock + prevention is useless. + + Another related problem is possibility for ->vm_writeback() to run out of + memory itself. This is not a problem for ext2 and friends, because their + ->vm_writeback() don't allocate much memory, but reiser4 flush is + definitely able to allocate huge amounts of memory. + + It seems that there is no reliable way to cope with the problems above. In + stead it was decided that ->vm_writeback() (as invoked in the kswapd + context) wouldn't perform any flushing itself, but rather should just wake + up some auxiliary thread dedicated for this purpose (or, the same thread + that does periodic commit of old atoms (ktxnmgrd.c)). + + Details: + + 1. Page is called `reclaimable' against particular reiser4 mount F if this + page can be ultimately released by try_to_free_pages() under presumptions + that: + + a. ->vm_writeback() for F is no-op, and + + b. none of the threads accessing F are making any progress, and + + c. other reiser4 mounts obey the same memory reservation protocol as F + (described below). + + For example, clean un-pinned page, or page occupied by ext2 data are + reclaimable against any reiser4 mount. + + When there is more than one reiser4 mount in a system, condition (c) makes + reclaim-ability not easily verifiable beyond trivial cases mentioned above. + + + + + + + + + THIS COMMENT IS VALID FOR "MANY BLOCKS ON PAGE" CASE + + Fake inode is used to bound formatted nodes and each node is indexed within + fake inode by its block number. If block size of smaller than page size, it + may so happen that block mapped to the page with formatted node is occupied + by unformatted node or is unallocated. This lead to some complications, + because flushing whole page can lead to an incorrect overwrite of + unformatted node that is moreover, can be cached in some other place as + part of the file body. To avoid this, buffers for unformatted nodes are + never marked dirty. Also pages in the fake are never marked dirty. This + rules out usage of ->writepage() as memory pressure hook. In stead + ->releasepage() is used. + + Josh is concerned that page->buffer is going to die. This should not pose + significant problem though, because we need to add some data structures to + the page anyway (jnode) and all necessary book keeping can be put there. + +*/ + +/* Life cycle of pages/nodes. + + jnode contains reference to page and page contains reference back to + jnode. This reference is counted in page ->count. Thus, page bound to jnode + cannot be released back into free pool. + + 1. Formatted nodes. + + 1. formatted node is represented by znode. When new znode is created its + ->pg pointer is NULL initially. + + 2. when node content is loaded into znode (by call to zload()) for the + first time following happens (in call to ->read_node() or + ->allocate_node()): + + 1. new page is added to the page cache. + + 2. this page is attached to znode and its ->count is increased. + + 3. page is kmapped. + + 3. if more calls to zload() follow (without corresponding zrelses), page + counter is left intact and in its stead ->d_count is increased in znode. + + 4. each call to zrelse decreases ->d_count. When ->d_count drops to zero + ->release_node() is called and page is kunmapped as result. + + 5. at some moment node can be captured by a transaction. Its ->x_count + is then increased by transaction manager. + + 6. if node is removed from the tree (empty node with JNODE_HEARD_BANSHEE + bit set) following will happen (also see comment at the top of znode.c): + + 1. when last lock is released, node will be uncaptured from + transaction. This released reference that transaction manager acquired + at the step 5. + + 2. when last reference is released, zput() detects that node is + actually deleted and calls ->delete_node() + operation. page_cache_delete_node() implementation detaches jnode from + page and releases page. + + 7. otherwise (node wasn't removed from the tree), last reference to + znode will be released after transaction manager committed transaction + node was in. This implies squallocing of this node (see + flush.c). Nothing special happens at this point. Znode is still in the + hash table and page is still attached to it. + + 8. znode is actually removed from the memory because of the memory + pressure, or during umount (znodes_tree_done()). Anyway, znode is + removed by the call to zdrop(). At this moment, page is detached from + znode and removed from the inode address space. + +*/ + +#include "debug.h" +#include "dformat.h" +#include "key.h" +#include "txnmgr.h" +#include "jnode.h" +#include "znode.h" +#include "block_alloc.h" +#include "tree.h" +#include "vfs_ops.h" +#include "inode.h" +#include "super.h" +#include "entd.h" +#include "page_cache.h" +#include "ktxnmgrd.h" + +#include +#include +#include /* for struct page */ +#include /* for struct page */ +#include +#include +#include +#include + +static struct bio *page_bio(struct page *page, jnode * node, int rw, int gfp); + +static struct address_space_operations formatted_fake_as_ops; + +static const oid_t fake_ino = 0x1; +static const oid_t bitmap_ino = 0x2; +static const oid_t cc_ino = 0x3; + +/* one-time initialization of fake inodes handling functions. */ +reiser4_internal int +init_fakes() +{ + return 0; +} + +static void +init_fake_inode(struct super_block *super, struct inode *fake, struct inode **pfake) +{ + assert("nikita-2168", fake->i_state & I_NEW); + fake->i_mapping->a_ops = &formatted_fake_as_ops; + fake->i_blkbits = super->s_blocksize_bits; + fake->i_size = ~0ull; + fake->i_rdev = super->s_bdev->bd_dev; + fake->i_bdev = super->s_bdev; + *pfake = fake; + /* NOTE-NIKITA something else? */ + unlock_new_inode(fake); +} + +/* initialize fake inode to which formatted nodes are bound in the page cache. */ +reiser4_internal int +init_formatted_fake(struct super_block *super) +{ + struct inode *fake; + struct inode *bitmap; + struct inode *cc; + reiser4_super_info_data *sinfo; + + assert("nikita-1703", super != NULL); + + sinfo = get_super_private_nocheck(super); + fake = iget_locked(super, oid_to_ino(fake_ino)); + + if (fake != NULL) { + init_fake_inode(super, fake, &sinfo->fake); + + bitmap = iget_locked(super, oid_to_ino(bitmap_ino)); + if (bitmap != NULL) { + init_fake_inode(super, bitmap, &sinfo->bitmap); + + cc = iget_locked(super, oid_to_ino(cc_ino)); + if (cc != NULL) { + init_fake_inode(super, cc, &sinfo->cc); + return 0; + } else { + iput(sinfo->fake); + iput(sinfo->bitmap); + sinfo->fake = NULL; + sinfo->bitmap = NULL; + } + } else { + iput(sinfo->fake); + sinfo->fake = NULL; + } + } + return RETERR(-ENOMEM); +} + +/* release fake inode for @super */ +reiser4_internal int +done_formatted_fake(struct super_block *super) +{ + reiser4_super_info_data *sinfo; + + sinfo = get_super_private_nocheck(super); + + if (sinfo->fake != NULL) { + assert("vs-1426", sinfo->fake->i_data.nrpages == 0); + iput(sinfo->fake); + sinfo->fake = NULL; + } + + if (sinfo->bitmap != NULL) { + iput(sinfo->bitmap); + sinfo->bitmap = NULL; + } + + if (sinfo->cc != NULL) { + iput(sinfo->cc); + sinfo->cc = NULL; + } + return 0; +} + +reiser4_internal void reiser4_wait_page_writeback (struct page * page) +{ + assert ("zam-783", PageLocked(page)); + + do { + unlock_page(page); + wait_on_page_writeback(page); + lock_page(page); + } while (PageWriteback(page)); +} + +/* return tree @page is in */ +reiser4_internal reiser4_tree * +tree_by_page(const struct page *page /* page to query */ ) +{ + assert("nikita-2461", page != NULL); + return &get_super_private(page->mapping->host->i_sb)->tree; +} + +/* completion handler for single page bio-based read. + + mpage_end_io_read() would also do. But it's static. + +*/ +static int +end_bio_single_page_read(struct bio *bio, unsigned int bytes_done UNUSED_ARG, int err UNUSED_ARG) +{ + struct page *page; + + if (bio->bi_size != 0) { + warning("nikita-3332", "Truncated single page read: %i", + bio->bi_size); + return 1; + } + + page = bio->bi_io_vec[0].bv_page; + + if (test_bit(BIO_UPTODATE, &bio->bi_flags)) + SetPageUptodate(page); + else { + ClearPageUptodate(page); + SetPageError(page); + } + unlock_page(page); + bio_put(bio); + return 0; +} + +/* completion handler for single page bio-based write. + + mpage_end_io_write() would also do. But it's static. + +*/ +static int +end_bio_single_page_write(struct bio *bio, unsigned int bytes_done UNUSED_ARG, int err UNUSED_ARG) +{ + struct page *page; + + if (bio->bi_size != 0) { + warning("nikita-3333", "Truncated single page write: %i", + bio->bi_size); + return 1; + } + + page = bio->bi_io_vec[0].bv_page; + + if (!test_bit(BIO_UPTODATE, &bio->bi_flags)) + SetPageError(page); + end_page_writeback(page); + bio_put(bio); + return 0; +} + +/* ->readpage() method for formatted nodes */ +static int +formatted_readpage(struct file *f UNUSED_ARG, struct page *page /* page to read */ ) +{ + assert("nikita-2412", PagePrivate(page) && jprivate(page)); + return page_io(page, jprivate(page), READ, GFP_KERNEL); +} + +/* submit single-page bio request */ +reiser4_internal int +page_io(struct page *page /* page to perform io for */ , + jnode * node /* jnode of page */ , + int rw /* read or write */ , int gfp /* GFP mask */ ) +{ + struct bio *bio; + int result; + + assert("nikita-2094", page != NULL); + assert("nikita-2226", PageLocked(page)); + assert("nikita-2634", node != NULL); + assert("nikita-2893", rw == READ || rw == WRITE); + + if (rw) { + if (unlikely(page->mapping->host->i_sb->s_flags & MS_RDONLY)) { + unlock_page(page); + return 0; + } + } + + bio = page_bio(page, node, rw, gfp); + if (!IS_ERR(bio)) { + if (rw == WRITE) { + SetPageWriteback(page); + unlock_page(page); + } + reiser4_submit_bio(rw, bio); + result = 0; + } else { + unlock_page(page); + result = PTR_ERR(bio); + } + + return result; +} + +/* helper function to construct bio for page */ +static struct bio * +page_bio(struct page *page, jnode * node, int rw, int gfp) +{ + struct bio *bio; + assert("nikita-2092", page != NULL); + assert("nikita-2633", node != NULL); + + /* Simple implementation in the assumption that blocksize == pagesize. + + We only have to submit one block, but submit_bh() will allocate bio + anyway, so lets use all the bells-and-whistles of bio code. + */ + + bio = bio_alloc(gfp, 1); + if (bio != NULL) { + int blksz; + struct super_block *super; + reiser4_block_nr blocknr; + + super = page->mapping->host->i_sb; + assert("nikita-2029", super != NULL); + blksz = super->s_blocksize; + assert("nikita-2028", blksz == (int) PAGE_CACHE_SIZE); + + blocknr = *UNDER_SPIN(jnode, node, jnode_get_io_block(node)); + + assert("nikita-2275", blocknr != (reiser4_block_nr) 0); + assert("nikita-2276", !blocknr_is_fake(&blocknr)); + + bio->bi_bdev = super->s_bdev; + /* fill bio->bi_sector before calling bio_add_page(), because + * q->merge_bvec_fn may want to inspect it (see + * drivers/md/linear.c:linear_mergeable_bvec() for example. */ + bio->bi_sector = blocknr * (blksz >> 9); + + if (!bio_add_page(bio, page, blksz, 0)) { + warning("nikita-3452", + "Single page bio cannot be constructed"); + return ERR_PTR(RETERR(-EINVAL)); + } + + /* bio -> bi_idx is filled by bio_init() */ + bio->bi_end_io = (rw == READ) ? + end_bio_single_page_read : end_bio_single_page_write; + + return bio; + } else + return ERR_PTR(RETERR(-ENOMEM)); +} + + +/* this function is internally called by jnode_make_dirty() */ +int set_page_dirty_internal (struct page * page, int tag_as_moved) +{ + struct address_space *mapping; + + mapping = page->mapping; + BUG_ON(mapping == NULL); + + if (!TestSetPageDirty(page)) { + if (mapping_cap_account_dirty(mapping)) + inc_page_state(nr_dirty); + + write_lock_irq(&mapping->tree_lock); + BUG_ON(page->mapping != mapping); + if (tag_as_moved) { + /* write_page_by_ent wants to set this bit on. FIXME: + * MOVED bit must be set already */ + assert("vs-1731", REISER4_USE_ENTD); + radix_tree_tag_set( + &mapping->page_tree, page->index, + PAGECACHE_TAG_REISER4_MOVED); + } else { + radix_tree_tag_clear( + &mapping->page_tree, page->index, + PAGECACHE_TAG_REISER4_MOVED); + } + write_unlock_irq(&mapping->tree_lock); + __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); + } + return 0; +} + + +/* Common memory pressure notification. */ +reiser4_internal int +reiser4_writepage(struct page *page /* page to start writeback from */, + struct writeback_control *wbc) +{ + struct super_block *s; + reiser4_context ctx; + reiser4_tree *tree; + txn_atom * atom; + jnode *node; + int result; + + s = page->mapping->host->i_sb; + init_context(&ctx, s); + + assert("vs-828", PageLocked(page)); + +#if REISER4_USE_ENTD + + /* Throttle memory allocations if we were not in reiser4 */ + if (ctx.parent == &ctx) { + write_page_by_ent(page, wbc); + result = 1; + goto out; + } +#endif /* REISER4_USE_ENTD */ + + tree = &get_super_private(s)->tree; + node = jnode_of_page(page); + if (!IS_ERR(node)) { + int phantom; + + assert("nikita-2419", node != NULL); + + LOCK_JNODE(node); + /* + * page was dirty, but jnode is not. This is (only?) + * possible if page was modified through mmap(). We + * want to handle such jnodes specially. + */ + phantom = !jnode_is_dirty(node); + atom = jnode_get_atom(node); + if (atom != NULL) { + if (!(atom->flags & ATOM_FORCE_COMMIT)) { + atom->flags |= ATOM_FORCE_COMMIT; + ktxnmgrd_kick(&get_super_private(s)->tmgr); + } + UNLOCK_ATOM(atom); + } + UNLOCK_JNODE(node); + + result = emergency_flush(page); + if (result == 0) + if (phantom && jnode_is_unformatted(node)) + JF_SET(node, JNODE_KEEPME); + jput(node); + } else { + result = PTR_ERR(node); + } + if (result != 0) { + /* + * shrink list doesn't move page to another mapping + * list when clearing dirty flag. So it is enough to + * just set dirty bit. + */ + set_page_dirty_internal(page, 0); + unlock_page(page); + } + out: + reiser4_exit_context(&ctx); + return result; +} + +/* ->set_page_dirty() method of formatted address_space */ +static int +formatted_set_page_dirty(struct page *page /* page to mark + * dirty */ ) +{ + assert("nikita-2173", page != NULL); + return __set_page_dirty_nobuffers(page); +} + +/* address space operations for the fake inode */ +static struct address_space_operations formatted_fake_as_ops = { + /* Perform a writeback of a single page as a memory-freeing + * operation. */ + .writepage = reiser4_writepage, + /* this is called to read formatted node */ + .readpage = formatted_readpage, + /* ->sync_page() method of fake inode address space operations. Called + from wait_on_page() and lock_page(). + + This is most annoyingly misnomered method. Actually it is called + from wait_on_page_bit() and lock_page() and its purpose is to + actually start io by jabbing device drivers. + */ + .sync_page = reiser4_start_up_io, + /* Write back some dirty pages from this mapping. Called from sync. + called during sync (pdflush) */ + .writepages = reiser4_writepages, + /* Set a page dirty */ + .set_page_dirty = formatted_set_page_dirty, + /* used for read-ahead. Not applicable */ + .readpages = NULL, + .prepare_write = NULL, + .commit_write = NULL, + .bmap = NULL, + /* called just before page is being detached from inode mapping and + removed from memory. Called on truncate, cut/squeeze, and + umount. */ + .invalidatepage = reiser4_invalidatepage, + /* this is called by shrink_cache() so that file system can try to + release objects (jnodes, buffers, journal heads) attached to page + and, may be made page itself free-able. + */ + .releasepage = reiser4_releasepage, + .direct_IO = NULL +}; + +/* called just before page is released (no longer used by reiser4). Callers: + jdelete() and extent2tail(). */ +reiser4_internal void +drop_page(struct page *page) +{ + assert("nikita-2181", PageLocked(page)); + clear_page_dirty(page); + ClearPageUptodate(page); +#if defined(PG_skipped) + ClearPageSkipped(page); +#endif + if (page->mapping != NULL) { + remove_from_page_cache(page); + unlock_page(page); + page_cache_release(page); + } else + unlock_page(page); +} + + +/* this is called by truncate_jnodes_range which in its turn is always called + after truncate_mapping_pages_range. Therefore, here jnode can not have + page. New pages can not be created because truncate_jnodes_range goes under + exclusive access on file obtained, where as new page creation requires + non-exclusive access obtained */ +static void +invalidate_unformatted(jnode *node) +{ + struct page *page; + + LOCK_JNODE(node); + page = node->pg; + if (page) { + loff_t from, to; + + page_cache_get(page); + UNLOCK_JNODE(node); + /* FIXME: use truncate_complete_page instead */ + from = (loff_t)page->index << PAGE_CACHE_SHIFT; + to = from + PAGE_CACHE_SIZE - 1; + truncate_inode_pages_range(page->mapping, from, to); + page_cache_release(page); + } else { + JF_SET(node, JNODE_HEARD_BANSHEE); + uncapture_jnode(node); + unhash_unformatted_jnode(node); + } +} + +#define JNODE_GANG_SIZE (16) + +/* find all eflushed jnodes from range specified and invalidate them */ +static int +truncate_jnodes_range(struct inode *inode, pgoff_t from, pgoff_t count) +{ + reiser4_inode *info; + int truncated_jnodes; + reiser4_tree *tree; + unsigned long index; + unsigned long end; + + truncated_jnodes = 0; + + info = reiser4_inode_data(inode); + tree = tree_by_inode(inode); + + index = from; + end = from + count; + + while (1) { + jnode *gang[JNODE_GANG_SIZE]; + int taken; + int i; + jnode *node; + + assert("nikita-3466", index <= end); + + RLOCK_TREE(tree); + taken = radix_tree_gang_lookup(jnode_tree_by_reiser4_inode(info), (void **)gang, + index, JNODE_GANG_SIZE); + for (i = 0; i < taken; ++i) { + node = gang[i]; + if (index_jnode(node) < end) + jref(node); + else + gang[i] = NULL; + } + RUNLOCK_TREE(tree); + + for (i = 0; i < taken; ++i) { + node = gang[i]; + if (node != NULL) { + index = max(index, index_jnode(node)); + invalidate_unformatted(node); + truncated_jnodes ++; + jput(node); + } else + break; + } + if (i != taken || taken == 0) + break; + } + return truncated_jnodes; +} + +reiser4_internal void +reiser4_invalidate_pages(struct address_space *mapping, pgoff_t from, + unsigned long count, int even_cows) +{ + loff_t from_bytes, count_bytes; + + if (count == 0) + return; + from_bytes = ((loff_t)from) << PAGE_CACHE_SHIFT; + count_bytes = ((loff_t)count) << PAGE_CACHE_SHIFT; + + unmap_mapping_range(mapping, from_bytes, count_bytes, even_cows); + truncate_inode_pages_range(mapping, from_bytes, from_bytes + count_bytes - 1); + truncate_jnodes_range(mapping->host, from, count); +} + + +#if REISER4_DEBUG + +#define page_flag_name( page, flag ) \ + ( test_bit( ( flag ), &( page ) -> flags ) ? ((#flag "|")+3) : "" ) + +reiser4_internal void +print_page(const char *prefix, struct page *page) +{ + if (page == NULL) { + printk("null page\n"); + return; + } + printk("%s: page index: %lu mapping: %p count: %i private: %lx\n", + prefix, page->index, page->mapping, page_count(page), page->private); + printk("\tflags: %s%s%s%s %s%s%s %s%s%s %s%s%s\n", + page_flag_name(page, PG_locked), + page_flag_name(page, PG_error), + page_flag_name(page, PG_referenced), + page_flag_name(page, PG_uptodate), + + page_flag_name(page, PG_dirty), + page_flag_name(page, PG_lru), + page_flag_name(page, PG_slab), + + page_flag_name(page, PG_highmem), + page_flag_name(page, PG_checked), + page_flag_name(page, PG_reserved), + + page_flag_name(page, PG_private), page_flag_name(page, PG_writeback), page_flag_name(page, PG_nosave)); + if (jprivate(page) != NULL) { + print_jnode("\tpage jnode", jprivate(page)); + printk("\n"); + } +} + +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/page_cache.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/page_cache.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,62 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ +/* Memory pressure hooks. Fake inodes handling. See page_cache.c. */ + +#if !defined( __REISER4_PAGE_CACHE_H__ ) +#define __REISER4_PAGE_CACHE_H__ + +#include "forward.h" +#include "debug.h" + +#include /* for struct super_block, address_space */ +#include /* for struct page */ +#include /* for lock_page() */ + +extern int init_fakes(void); +extern int init_formatted_fake(struct super_block *super); +extern int done_formatted_fake(struct super_block *super); + +extern reiser4_tree *tree_by_page(const struct page *page); + +extern int set_page_dirty_internal (struct page * page, int tag_as_moved); + +#define reiser4_submit_bio(rw, bio) submit_bio((rw), (bio)) + +extern void reiser4_wait_page_writeback (struct page * page); +static inline void lock_and_wait_page_writeback (struct page * page) +{ + lock_page(page); + if (unlikely(PageWriteback(page))) + reiser4_wait_page_writeback(page); +} + +#define jprivate(page) ((jnode *) (page)->private) + +extern int page_io(struct page *page, jnode * node, int rw, int gfp); +extern int reiser4_writepage(struct page *page, struct writeback_control *wbc); +extern void drop_page(struct page *page); +extern void reiser4_invalidate_pages(struct address_space *, pgoff_t from, + unsigned long count, int even_cows); +extern void capture_reiser4_inodes (struct super_block *, struct writeback_control *); + +#define PAGECACHE_TAG_REISER4_MOVED PAGECACHE_TAG_DIRTY + +#if REISER4_DEBUG +extern void print_page(const char *prefix, struct page *page); +#else +#define print_page(prf, p) noop +#endif + +/* __REISER4_PAGE_CACHE_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/compress/compress.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/compress/compress.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,429 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ +/* reiser4 compression transform plugins */ + +#include "../../debug.h" +#include "../plugin.h" +#include "../cryptcompress.h" +#include "minilzo.h" + +#include +#include +#include +#include + +/******************************************************************************/ +/* null compression */ +/******************************************************************************/ + +#define NONE_NRCOPY 1 + +static int +null_min_tfm_size(void) +{ + return 1; +} + +static void +null_compress(coa_t coa, __u8 * src_first, unsigned src_len, + __u8 * dst_first, unsigned *dst_len) +{ + int i; + assert("edward-793", coa == NULL); + assert("edward-794", src_first != NULL); + assert("edward-795", dst_first != NULL); + assert("edward-796", src_len != 0); + assert("edward-797", dst_len != NULL); + + for (i = 0; i < NONE_NRCOPY; i++) + memcpy(dst_first, src_first, src_len); + *dst_len = src_len; + return; +} + +static void +null_decompress(coa_t coa, __u8 * src_first, unsigned src_len, + __u8 * dst_first, unsigned *dst_len) +{ + impossible("edward-798", "trying to decompress uncompressed data"); +} + +/******************************************************************************/ +/* gzip1 compression */ +/******************************************************************************/ + +#define GZIP1_DEF_LEVEL Z_BEST_SPEED +#define GZIP1_DEF_WINBITS 15 +#define GZIP1_DEF_MEMLEVEL MAX_MEM_LEVEL + +static int gzip1_overrun(unsigned src_len UNUSED_ARG) +{ + return 0; +} + +static coa_t +gzip1_alloc(tfm_action act) +{ + coa_t coa = NULL; + int ret = -ENXIO; +#if REISER4_GZIP_TFM + ret = 0; + switch (act) { + case TFM_WRITE: /* compress */ + coa = vmalloc(zlib_deflate_workspacesize()); + if (!coa) { + ret = -ENOMEM; + break; + } + xmemset(coa, 0, zlib_deflate_workspacesize()); + break; + case TFM_READ: /* decompress */ + coa = vmalloc(zlib_inflate_workspacesize()); + if (!coa) { + ret = -ENOMEM; + break; + } + xmemset(coa, 0, zlib_inflate_workspacesize()); + break; + default: + impossible("edward-767", + "trying to alloc workspace for unknown tfm action"); + } +#endif + if (ret) { + warning("edward-768", + "alloc workspace for gzip1 (tfm action = %d) failed\n", + act); + return ERR_PTR(ret); + } + return coa; +} + +static void gzip1_free(coa_t coa, tfm_action act) +{ +#if REISER4_GZIP_TFM + assert("edward-769", coa != NULL); + + switch (act) { + case TFM_WRITE: /* compress */ + vfree(coa); + break; + case TFM_READ: /* decompress */ + vfree(coa); + break; + default: + impossible("edward-770", + "free workspace for unknown tfm action"); + } +#endif + return; +} + +static int +gzip1_min_tfm_size(void) +{ + return 64; +} + +static void +gzip1_compress(coa_t coa, __u8 * src_first, unsigned src_len, + __u8 * dst_first, unsigned *dst_len) +{ +#if REISER4_GZIP_TFM + int ret = 0; + struct z_stream_s stream; + + xmemset(&stream, 0, sizeof(stream)); + + assert("edward-842", coa != NULL); + assert("edward-875", src_len != 0); + + stream.workspace = coa; + ret = zlib_deflateInit2(&stream, GZIP1_DEF_LEVEL, Z_DEFLATED, + -GZIP1_DEF_WINBITS, GZIP1_DEF_MEMLEVEL, + Z_DEFAULT_STRATEGY); + if (ret != Z_OK) { + warning("edward-771", "zlib_deflateInit2 returned %d\n", ret); + goto rollback; + } + ret = zlib_deflateReset(&stream); + if (ret != Z_OK) { + warning("edward-772", "zlib_deflateReset returned %d\n", ret); + goto rollback; + } + stream.next_in = src_first; + stream.avail_in = src_len; + stream.next_out = dst_first; + stream.avail_out = *dst_len; + + ret = zlib_deflate(&stream, Z_FINISH); + if (ret != Z_STREAM_END) { + warning("edward-773", "zlib_deflate returned %d\n", ret); + goto rollback; + } + *dst_len = stream.total_out; + return; + rollback: + *dst_len = src_len; +#endif + return; +} + +static void +gzip1_decompress(coa_t coa, __u8 * src_first, unsigned src_len, + __u8 * dst_first, unsigned *dst_len) +{ +#if REISER4_GZIP_TFM + int ret = 0; + struct z_stream_s stream; + + xmemset(&stream, 0, sizeof(stream)); + + assert("edward-843", coa != NULL); + assert("edward-876", src_len != 0); + + stream.workspace = coa; + ret = zlib_inflateInit2(&stream, -GZIP1_DEF_WINBITS); + if (ret != Z_OK) { + warning("edward-774", "zlib_inflateInit2 returned %d\n", ret); + return; + } + ret = zlib_inflateReset(&stream); + if (ret != Z_OK) { + warning("edward-775", "zlib_inflateReset returned %d\n", ret); + return; + } + + stream.next_in = src_first; + stream.avail_in = src_len; + stream.next_out = dst_first; + stream.avail_out = *dst_len; + + ret = zlib_inflate(&stream, Z_SYNC_FLUSH); + /* + * Work around a bug in zlib, which sometimes wants to taste an extra + * byte when being used in the (undocumented) raw deflate mode. + * (From USAGI). + */ + if (ret == Z_OK && !stream.avail_in && stream.avail_out) { + u8 zerostuff = 0; + stream.next_in = &zerostuff; + stream.avail_in = 1; + ret = zlib_inflate(&stream, Z_FINISH); + } + if (ret != Z_STREAM_END) { + warning("edward-776", "zlib_inflate returned %d\n", ret); + return; + } + *dst_len = stream.total_out; +#endif + return; +} + +/******************************************************************************/ +/* none compression */ +/******************************************************************************/ + +static int none_overrun(unsigned src_len UNUSED_ARG) +{ + return 0; +} + +/******************************************************************************/ +/* lzo1 compression */ +/******************************************************************************/ + +static int lzo1_overrun(unsigned in_len) +{ + return in_len / 64 + 16 + 3; +} + +#define LZO_HEAP_SIZE(size) \ + sizeof(lzo_align_t) * (((size) + (sizeof(lzo_align_t) - 1)) / sizeof(lzo_align_t)) + +static coa_t +lzo1_alloc(tfm_action act) +{ + int ret = 0; + coa_t coa = NULL; + + switch (act) { + case TFM_WRITE: /* compress */ + coa = vmalloc(LZO_HEAP_SIZE(LZO1X_1_MEM_COMPRESS)); + if (!coa) { + ret = -ENOMEM; + break; + } + memset(coa, 0, LZO_HEAP_SIZE(LZO1X_1_MEM_COMPRESS)); + case TFM_READ: /* decompress */ + break; + default: + impossible("edward-877", + "trying to alloc workspace for unknown tfm action"); + } + if (ret) { + warning("edward-878", + "alloc workspace for lzo1 (tfm action = %d) failed\n", + act); + return ERR_PTR(ret); + } + return coa; +} + +static void +lzo1_free(coa_t coa, tfm_action act) +{ + assert("edward-879", coa != NULL); + + switch (act) { + case TFM_WRITE: /* compress */ + vfree(coa); + case TFM_READ: /* decompress */ + break; + default: + impossible("edward-880", + "trying to free workspace for unknown tfm action"); + } + return; +} + +static int +lzo1_min_tfm_size(void) +{ + return 256; +} + +static void +lzo1_compress(coa_t coa, __u8 * src_first, unsigned src_len, + __u8 * dst_first, unsigned *dst_len) +{ + int result; + + assert("edward-846", coa != NULL); + assert("edward-847", src_len != 0); + + result = lzo_init(); + + if (result != LZO_E_OK) { + warning("edward-848", "lzo_init() failed\n"); + goto out; + } + + result = + lzo1x_1_compress(src_first, src_len, dst_first, dst_len, coa); + if (result != LZO_E_OK) { + warning("edward-849", "lzo1x_1_compress failed\n"); + goto out; + } + if (*dst_len >= src_len) { + //warning("edward-850", "lzo1x_1_compress: incompressible data\n"); + goto out; + } + return; + out: + *dst_len = src_len; + return; +} + +static void +lzo1_decompress(coa_t coa, __u8 * src_first, unsigned src_len, + __u8 * dst_first, unsigned *dst_len) +{ + int result; + + assert("edward-851", coa == NULL); + assert("edward-852", src_len != 0); + + result = lzo_init(); + + if (result != LZO_E_OK) { + warning("edward-888", "lzo_init() failed\n"); + return; + } + + result = lzo1x_decompress(src_first, src_len, dst_first, dst_len, NULL); + if (result != LZO_E_OK) + warning("edward-853", "lzo1x_1_decompress failed\n"); + return; +} + +compression_plugin compression_plugins[LAST_COMPRESSION_ID] = { + [NONE_COMPRESSION_ID] = { + .h = { + .type_id = + REISER4_COMPRESSION_PLUGIN_TYPE, + .id = NONE_COMPRESSION_ID, + .pops = NULL, + .label = "none", + .desc = + "absence of any compression transform", + .linkage = TYPE_SAFE_LIST_LINK_ZERO} + , + .overrun = none_overrun, + .alloc = NULL, + .free = NULL, + .min_tfm_size = NULL, + .compress = NULL, + .decompress = NULL} + , + [NULL_COMPRESSION_ID] = { + .h = { + .type_id = + REISER4_COMPRESSION_PLUGIN_TYPE, + .id = NULL_COMPRESSION_ID, + .pops = NULL, + .label = "null", + .desc = "NONE_NRCOPY times of memcpy", + .linkage = TYPE_SAFE_LIST_LINK_ZERO} + , + .overrun = none_overrun, + .alloc = NULL, + .free = NULL, + .min_tfm_size = null_min_tfm_size, + .compress = null_compress, + .decompress = null_decompress} + , + [LZO1_COMPRESSION_ID] = { + .h = { + .type_id = + REISER4_COMPRESSION_PLUGIN_TYPE, + .id = LZO1_COMPRESSION_ID, + .pops = NULL, + .label = "lzo1", + .desc = "lzo1 compression transform", + .linkage = TYPE_SAFE_LIST_LINK_ZERO} + , + .overrun = lzo1_overrun, + .alloc = lzo1_alloc, + .free = lzo1_free, + .min_tfm_size = lzo1_min_tfm_size, + .compress = lzo1_compress, + .decompress = lzo1_decompress} + , + [GZIP1_COMPRESSION_ID] = { + .h = { + .type_id = + REISER4_COMPRESSION_PLUGIN_TYPE, + .id = GZIP1_COMPRESSION_ID, + .pops = NULL, + .label = "gzip1", + .desc = "gzip1 compression transform", + .linkage = TYPE_SAFE_LIST_LINK_ZERO} + , + .overrun = gzip1_overrun, + .alloc = gzip1_alloc, + .free = gzip1_free, + .min_tfm_size = gzip1_min_tfm_size, + .compress = gzip1_compress, + .decompress = gzip1_decompress} +}; + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/compress/compress.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/compress/compress.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,36 @@ +#if !defined( __FS_REISER4_COMPRESS_H__ ) +#define __FS_REISER4_COMPRESS_H__ + +#include +#include + +typedef enum { + TFM_READ, + TFM_WRITE +} tfm_action; + +/* builtin compression plugins */ + +typedef enum { + NONE_COMPRESSION_ID, + NULL_COMPRESSION_ID, + LZO1_COMPRESSION_ID, + GZIP1_COMPRESSION_ID, + LAST_COMPRESSION_ID, +} reiser4_compression_id; + +typedef void * coa_t; +typedef coa_t coa_set[LAST_COMPRESSION_ID]; + +#endif /* __FS_REISER4_COMPRESS_H__ */ + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/compress/lzoconf.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/compress/lzoconf.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,433 @@ +/* lzoconf.h -- configuration for the LZO real-time data compression library + adopted for reiser4 compression tramsform plugin + + This file is part of the LZO real-time data compression library. + + Copyright (C) 2002 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 2001 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 2000 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 1999 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 1998 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 1997 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 1996 Markus Franz Xaver Johannes Oberhumer + All Rights Reserved. + + The LZO library is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License as + published by the Free Software Foundation; either version 2 of + the License, or (at your option) any later version. + + The LZO library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with the LZO library; see the file COPYING. + If not, write to the Free Software Foundation, Inc., + 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + + Markus F.X.J. Oberhumer + + http://www.oberhumer.com/opensource/lzo/ + */ + +#include /* for UINT_MAX, ULONG_MAX - edward */ + +#ifndef __LZOCONF_H +#define __LZOCONF_H + +#define LZO_VERSION 0x1080 +#define LZO_VERSION_STRING "1.08" +#define LZO_VERSION_DATE "Jul 12 2002" + +/* internal Autoconf configuration file - only used when building LZO */ +#if defined(LZO_HAVE_CONFIG_H) +# include +#endif +#ifdef __cplusplus +extern "C" { +#endif + + +/*********************************************************************** +// LZO requires a conforming +************************************************************************/ + +#define CHAR_BIT 8 /* -edward */ +#define USHRT_MAX 0xffff /* -edward */ + +#if 0 /* -edward */ +#if !defined(CHAR_BIT) || (CHAR_BIT != 8) +# error "invalid CHAR_BIT" +#endif +#if !defined(UCHAR_MAX) || !defined(UINT_MAX) || !defined(ULONG_MAX) +# error "check your compiler installation" +#endif +#if (USHRT_MAX < 1) || (UINT_MAX < 1) || (ULONG_MAX < 1) +# error "your limits.h macros are broken" +#endif +#endif /* -edward */ +/* workaround a cpp bug under hpux 10.20 */ +#define LZO_0xffffffffL 4294967295ul + +#if 0 /* -edward */ +#if !defined(LZO_UINT32_C) +# if (UINT_MAX < LZO_0xffffffffL) +# define LZO_UINT32_C(c) c ## UL +# else +# define LZO_UINT32_C(c) c ## U +# endif +#endif +#endif /* -edward */ + +/*********************************************************************** +// architecture defines +************************************************************************/ + +#if !defined(__LZO_WIN) && !defined(__LZO_DOS) && !defined(__LZO_OS2) +# if defined(__WINDOWS__) || defined(_WINDOWS) || defined(_Windows) +# define __LZO_WIN +# elif defined(__WIN32__) || defined(_WIN32) || defined(WIN32) +# define __LZO_WIN +# elif defined(__NT__) || defined(__NT_DLL__) || defined(__WINDOWS_386__) +# define __LZO_WIN +# elif defined(__DOS__) || defined(__MSDOS__) || defined(MSDOS) +# define __LZO_DOS +# elif defined(__OS2__) || defined(__OS2V2__) || defined(OS2) +# define __LZO_OS2 +# elif defined(__palmos__) +# define __LZO_PALMOS +# elif defined(__TOS__) || defined(__atarist__) +# define __LZO_TOS +# endif +#endif + +#if (UINT_MAX < LZO_0xffffffffL) +# if defined(__LZO_WIN) +# define __LZO_WIN16 +# elif defined(__LZO_DOS) +# define __LZO_DOS16 +# elif defined(__LZO_PALMOS) +# define __LZO_PALMOS16 +# elif defined(__LZO_TOS) +# define __LZO_TOS16 +# elif defined(__C166__) +# else + /* porting hint: for pure 16-bit architectures try compiling + * everything with -D__LZO_STRICT_16BIT */ +# error "16-bit target not supported - contact me for porting hints" +# endif +#endif + +#if !defined(__LZO_i386) +# if defined(__LZO_DOS) || defined(__LZO_WIN16) +# define __LZO_i386 +# elif defined(__i386__) || defined(__386__) || defined(_M_IX86) +# define __LZO_i386 +# endif +#endif + +#if defined(__LZO_STRICT_16BIT) +# if (UINT_MAX < LZO_0xffffffffL) +# include +# endif +#endif + +/* memory checkers */ +#if !defined(__LZO_CHECKER) +# if defined(__BOUNDS_CHECKING_ON) +# define __LZO_CHECKER +# elif defined(__CHECKER__) +# define __LZO_CHECKER +# elif defined(__INSURE__) +# define __LZO_CHECKER +# elif defined(__PURIFY__) +# define __LZO_CHECKER +# endif +#endif + + +/*********************************************************************** +// integral and pointer types +************************************************************************/ + +/* Integral types with 32 bits or more */ +#if !defined(LZO_UINT32_MAX) +# if (UINT_MAX >= LZO_0xffffffffL) + typedef unsigned int lzo_uint32; + typedef int lzo_int32; +# define LZO_UINT32_MAX UINT_MAX +# define LZO_INT32_MAX INT_MAX +# define LZO_INT32_MIN INT_MIN +# elif (ULONG_MAX >= LZO_0xffffffffL) + typedef unsigned long lzo_uint32; + typedef long lzo_int32; +# define LZO_UINT32_MAX ULONG_MAX +# define LZO_INT32_MAX LONG_MAX +# define LZO_INT32_MIN LONG_MIN +# else +# error "lzo_uint32" +# endif +#endif + +/* lzo_uint is used like size_t */ +#if !defined(LZO_UINT_MAX) +# if (UINT_MAX >= LZO_0xffffffffL) + typedef unsigned int lzo_uint; + typedef int lzo_int; +# define LZO_UINT_MAX UINT_MAX +# define LZO_INT_MAX INT_MAX +# define LZO_INT_MIN INT_MIN +# elif (ULONG_MAX >= LZO_0xffffffffL) + typedef unsigned long lzo_uint; + typedef long lzo_int; +# define LZO_UINT_MAX ULONG_MAX +# define LZO_INT_MAX LONG_MAX +# define LZO_INT_MIN LONG_MIN +# else +# error "lzo_uint" +# endif +#endif + +typedef int lzo_bool; + + +/*********************************************************************** +// memory models +************************************************************************/ + +/* Memory model for the public code segment. */ +#if !defined(__LZO_CMODEL) +# if defined(__LZO_DOS16) || defined(__LZO_WIN16) +# define __LZO_CMODEL __far +# elif defined(__LZO_i386) && defined(__WATCOMC__) +# define __LZO_CMODEL __near +# else +# define __LZO_CMODEL +# endif +#endif + +/* Memory model for the public data segment. */ +#if !defined(__LZO_DMODEL) +# if defined(__LZO_DOS16) || defined(__LZO_WIN16) +# define __LZO_DMODEL __far +# elif defined(__LZO_i386) && defined(__WATCOMC__) +# define __LZO_DMODEL __near +# else +# define __LZO_DMODEL +# endif +#endif + +/* Memory model that allows to access memory at offsets of lzo_uint. */ +#if !defined(__LZO_MMODEL) +# if (LZO_UINT_MAX <= UINT_MAX) +# define __LZO_MMODEL +# elif defined(__LZO_DOS16) || defined(__LZO_WIN16) +# define __LZO_MMODEL __huge +# define LZO_999_UNSUPPORTED +# elif defined(__LZO_PALMOS16) || defined(__LZO_TOS16) +# define __LZO_MMODEL +# else +# error "__LZO_MMODEL" +# endif +#endif + +/* no typedef here because of const-pointer issues */ +#define lzo_byte unsigned char __LZO_MMODEL +#define lzo_bytep unsigned char __LZO_MMODEL * +#define lzo_charp char __LZO_MMODEL * +#define lzo_voidp void __LZO_MMODEL * +#define lzo_shortp short __LZO_MMODEL * +#define lzo_ushortp unsigned short __LZO_MMODEL * +#define lzo_uint32p lzo_uint32 __LZO_MMODEL * +#define lzo_int32p lzo_int32 __LZO_MMODEL * +#define lzo_uintp lzo_uint __LZO_MMODEL * +#define lzo_intp lzo_int __LZO_MMODEL * +#define lzo_voidpp lzo_voidp __LZO_MMODEL * +#define lzo_bytepp lzo_bytep __LZO_MMODEL * + +#ifndef lzo_sizeof_dict_t +# define lzo_sizeof_dict_t sizeof(lzo_bytep) +#endif + + +/*********************************************************************** +// calling conventions and function types +************************************************************************/ + +/* linkage */ +#if !defined(__LZO_EXTERN_C) +# ifdef __cplusplus +# define __LZO_EXTERN_C extern "C" +# else +# define __LZO_EXTERN_C extern +# endif +#endif + +/* calling convention */ +#if !defined(__LZO_CDECL) +# if defined(__LZO_DOS16) || defined(__LZO_WIN16) +# define __LZO_CDECL __LZO_CMODEL __cdecl +# elif defined(__LZO_i386) && defined(_MSC_VER) +# define __LZO_CDECL __LZO_CMODEL __cdecl +# elif defined(__LZO_i386) && defined(__WATCOMC__) +# define __LZO_CDECL __LZO_CMODEL __cdecl +# else +# define __LZO_CDECL __LZO_CMODEL +# endif +#endif +#if !defined(__LZO_ENTRY) +# define __LZO_ENTRY __LZO_CDECL +#endif + +/* C++ exception specification for extern "C" function types */ +#if !defined(__cplusplus) +# undef LZO_NOTHROW +# define LZO_NOTHROW +#elif !defined(LZO_NOTHROW) +# define LZO_NOTHROW +#endif + + +typedef int +(__LZO_ENTRY *lzo_compress_t) ( const lzo_byte *src, lzo_uint src_len, + lzo_byte *dst, lzo_uintp dst_len, + lzo_voidp wrkmem ); + +typedef int +(__LZO_ENTRY *lzo_decompress_t) ( const lzo_byte *src, lzo_uint src_len, + lzo_byte *dst, lzo_uintp dst_len, + lzo_voidp wrkmem ); + +typedef int +(__LZO_ENTRY *lzo_optimize_t) ( lzo_byte *src, lzo_uint src_len, + lzo_byte *dst, lzo_uintp dst_len, + lzo_voidp wrkmem ); + +typedef int +(__LZO_ENTRY *lzo_compress_dict_t)(const lzo_byte *src, lzo_uint src_len, + lzo_byte *dst, lzo_uintp dst_len, + lzo_voidp wrkmem, + const lzo_byte *dict, lzo_uint dict_len ); + +typedef int +(__LZO_ENTRY *lzo_decompress_dict_t)(const lzo_byte *src, lzo_uint src_len, + lzo_byte *dst, lzo_uintp dst_len, + lzo_voidp wrkmem, + const lzo_byte *dict, lzo_uint dict_len ); + + +/* assembler versions always use __cdecl */ +typedef int +(__LZO_CDECL *lzo_compress_asm_t)( const lzo_byte *src, lzo_uint src_len, + lzo_byte *dst, lzo_uintp dst_len, + lzo_voidp wrkmem ); + +typedef int +(__LZO_CDECL *lzo_decompress_asm_t)( const lzo_byte *src, lzo_uint src_len, + lzo_byte *dst, lzo_uintp dst_len, + lzo_voidp wrkmem ); + + +/* a progress indicator callback function */ +typedef void (__LZO_ENTRY *lzo_progress_callback_t) (lzo_uint, lzo_uint); + + +/*********************************************************************** +// export information +************************************************************************/ + +/* DLL export information */ +#if !defined(__LZO_EXPORT1) +# define __LZO_EXPORT1 +#endif +#if !defined(__LZO_EXPORT2) +# define __LZO_EXPORT2 +#endif + +/* exported calling convention for C functions */ +#if !defined(LZO_PUBLIC) +# define LZO_PUBLIC(_rettype) \ + __LZO_EXPORT1 _rettype __LZO_EXPORT2 __LZO_ENTRY +#endif +#if !defined(LZO_EXTERN) +# define LZO_EXTERN(_rettype) __LZO_EXTERN_C LZO_PUBLIC(_rettype) +#endif +#if !defined(LZO_PRIVATE) +# define LZO_PRIVATE(_rettype) static _rettype __LZO_ENTRY +#endif + +/* exported __cdecl calling convention for assembler functions */ +#if !defined(LZO_PUBLIC_CDECL) +# define LZO_PUBLIC_CDECL(_rettype) \ + __LZO_EXPORT1 _rettype __LZO_EXPORT2 __LZO_CDECL +#endif +#if !defined(LZO_EXTERN_CDECL) +# define LZO_EXTERN_CDECL(_rettype) __LZO_EXTERN_C LZO_PUBLIC_CDECL(_rettype) +#endif + +/* exported global variables (LZO currently uses no static variables and + * is fully thread safe) */ +#if !defined(LZO_PUBLIC_VAR) +# define LZO_PUBLIC_VAR(_type) \ + __LZO_EXPORT1 _type __LZO_EXPORT2 __LZO_DMODEL +#endif +#if !defined(LZO_EXTERN_VAR) +# define LZO_EXTERN_VAR(_type) extern LZO_PUBLIC_VAR(_type) +#endif + + +/*********************************************************************** +// error codes and prototypes +************************************************************************/ + +/* Error codes for the compression/decompression functions. Negative + * values are errors, positive values will be used for special but + * normal events. + */ +#define LZO_E_OK 0 +#define LZO_E_ERROR (-1) +#define LZO_E_OUT_OF_MEMORY (-2) /* not used right now */ +#define LZO_E_NOT_COMPRESSIBLE (-3) /* not used right now */ +#define LZO_E_INPUT_OVERRUN (-4) +#define LZO_E_OUTPUT_OVERRUN (-5) +#define LZO_E_LOOKBEHIND_OVERRUN (-6) +#define LZO_E_EOF_NOT_FOUND (-7) +#define LZO_E_INPUT_NOT_CONSUMED (-8) + + +/* lzo_init() should be the first function you call. + * Check the return code ! + * + * lzo_init() is a macro to allow checking that the library and the + * compiler's view of various types are consistent. + */ +#define lzo_init() __lzo_init2(LZO_VERSION,(int)sizeof(short),(int)sizeof(int),\ + (int)sizeof(long),(int)sizeof(lzo_uint32),(int)sizeof(lzo_uint),\ + (int)lzo_sizeof_dict_t,(int)sizeof(char *),(int)sizeof(lzo_voidp),\ + (int)sizeof(lzo_compress_t)) +LZO_EXTERN(int) __lzo_init2(unsigned,int,int,int,int,int,int,int,int,int); + +/* checksum functions */ +LZO_EXTERN(lzo_uint32) +lzo_crc32(lzo_uint32 _c, const lzo_byte *_buf, lzo_uint _len); + +/* misc. */ +typedef union { lzo_bytep p; lzo_uint u; } __lzo_pu_u; +typedef union { lzo_bytep p; lzo_uint32 u32; } __lzo_pu32_u; +typedef union { void *vp; lzo_bytep bp; lzo_uint32 u32; long l; } lzo_align_t; + +#define LZO_PTR_ALIGN_UP(_ptr,_size) \ + ((_ptr) + (lzo_uint) __lzo_align_gap((const lzo_voidp)(_ptr),(lzo_uint)(_size))) + +/* deprecated - only for backward compatibility */ +#define LZO_ALIGN(_ptr,_size) LZO_PTR_ALIGN_UP(_ptr,_size) + + +#ifdef __cplusplus +} /* extern "C" */ +#endif + +#endif /* already included */ + diff -puN /dev/null fs/reiser4/plugin/compress/minilzo.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/compress/minilzo.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,2388 @@ +/* minilzo.c -- mini subset of the LZO real-time data compression library + Adopted for reiser4 compression transform plugin. + + This file is part of the LZO real-time data compression library. + + Copyright (C) 2002 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 2001 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 2000 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 1999 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 1998 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 1997 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 1996 Markus Franz Xaver Johannes Oberhumer + All Rights Reserved. + + The LZO library is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License as + published by the Free Software Foundation; either version 2 of + the License, or (at your option) any later version. + + The LZO library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with the LZO library; see the file COPYING. + If not, write to the Free Software Foundation, Inc., + 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + + Markus F.X.J. Oberhumer + + http://www.oberhumer.com/opensource/lzo/ + */ + +/* + * NOTE: + * the full LZO package can be found at + * http://www.oberhumer.com/opensource/lzo/ + */ + +#include "../../debug.h" /* for reiser4 assert macro -edward */ + +#define __LZO_IN_MINILZO +#define LZO_BUILD + +#ifdef MINILZO_HAVE_CONFIG_H +# include +#endif + +#undef LZO_HAVE_CONFIG_H +#include "minilzo.h" + +#if !defined(MINILZO_VERSION) || (MINILZO_VERSION != 0x1080) +# error "version mismatch in miniLZO source files" +#endif + +#ifdef MINILZO_HAVE_CONFIG_H +# define LZO_HAVE_CONFIG_H +#endif + +#if 0 /* -edward */ +#if !defined(LZO_NO_SYS_TYPES_H) +# include +#endif +#include +#endif /* -edward */ + +#ifndef __LZO_CONF_H +#define __LZO_CONF_H + +#if !defined(__LZO_IN_MINILZO) +# ifndef __LZOCONF_H +# include +# endif +#endif + +#if defined(__BOUNDS_CHECKING_ON) +# include +#else +# define BOUNDS_CHECKING_OFF_DURING(stmt) stmt +# define BOUNDS_CHECKING_OFF_IN_EXPR(expr) (expr) +#endif + +#if 0 /* edward */ +#if !defined(LZO_HAVE_CONFIG_H) +# include +# include +# if !defined(NO_STDLIB_H) +# include +# endif +#endif /* edward */ +# define HAVE_MEMCMP +# define HAVE_MEMCPY +# define HAVE_MEMMOVE +# define HAVE_MEMSET +#if 0 /* edward */ +#else +# include +# if defined(HAVE_STDDEF_H) +# include +# endif +# if defined(STDC_HEADERS) +# include +# include +# endif +#endif +#endif /* edward */ + +#if defined(__LZO_DOS16) || defined(__LZO_WIN16) +# define HAVE_MALLOC_H +# define HAVE_HALLOC +#endif + +#undef NDEBUG +#if !defined(LZO_DEBUG) +# define NDEBUG +#endif +#if defined(LZO_DEBUG) || !defined(NDEBUG) +# if !defined(NO_STDIO_H) +# include +# endif +#endif +# if 0 /* edward */ +#include +#endif /* edward */ + +#if !defined(LZO_COMPILE_TIME_ASSERT) +# define LZO_COMPILE_TIME_ASSERT(expr) \ + { typedef int __lzo_compile_time_assert_fail[1 - 2 * !(expr)]; } +#endif + +#if !defined(LZO_UNUSED) +# if 1 +# define LZO_UNUSED(var) ((void)&var) +# elif 0 +# define LZO_UNUSED(var) { typedef int __lzo_unused[sizeof(var) ? 2 : 1]; } +# else +# define LZO_UNUSED(parm) (parm = parm) +# endif +#endif + +#if !defined(__inline__) && !defined(__GNUC__) +# if defined(__cplusplus) +# define __inline__ inline +# else +# define __inline__ +# endif +#endif + +#if defined(NO_MEMCMP) +# undef HAVE_MEMCMP +#endif + +#if !defined(HAVE_MEMSET) +# undef memset +# define memset lzo_memset +#endif + +#if 0 +# define LZO_BYTE(x) ((unsigned char) (x)) +#else +# define LZO_BYTE(x) ((unsigned char) ((x) & 0xff)) +#endif + +#define LZO_MAX(a,b) ((a) >= (b) ? (a) : (b)) +#define LZO_MIN(a,b) ((a) <= (b) ? (a) : (b)) +#define LZO_MAX3(a,b,c) ((a) >= (b) ? LZO_MAX(a,c) : LZO_MAX(b,c)) +#define LZO_MIN3(a,b,c) ((a) <= (b) ? LZO_MIN(a,c) : LZO_MIN(b,c)) + +#define lzo_sizeof(type) ((lzo_uint) (sizeof(type))) + +#define LZO_HIGH(array) ((lzo_uint) (sizeof(array)/sizeof(*(array)))) + +#define LZO_SIZE(bits) (1u << (bits)) +#define LZO_MASK(bits) (LZO_SIZE(bits) - 1) + +#define LZO_LSIZE(bits) (1ul << (bits)) +#define LZO_LMASK(bits) (LZO_LSIZE(bits) - 1) + +#define LZO_USIZE(bits) ((lzo_uint) 1 << (bits)) +#define LZO_UMASK(bits) (LZO_USIZE(bits) - 1) + +#define LZO_STYPE_MAX(b) (((1l << (8*(b)-2)) - 1l) + (1l << (8*(b)-2))) +#define LZO_UTYPE_MAX(b) (((1ul << (8*(b)-1)) - 1ul) + (1ul << (8*(b)-1))) + +#if !defined(SIZEOF_UNSIGNED) +# if (UINT_MAX == 0xffff) +# define SIZEOF_UNSIGNED 2 +# elif (UINT_MAX == LZO_0xffffffffL) +# define SIZEOF_UNSIGNED 4 +# elif (UINT_MAX >= LZO_0xffffffffL) +# define SIZEOF_UNSIGNED 8 +# else +# error "SIZEOF_UNSIGNED" +# endif +#endif + +#if !defined(SIZEOF_UNSIGNED_LONG) +# if (ULONG_MAX == LZO_0xffffffffL) +# define SIZEOF_UNSIGNED_LONG 4 +# elif (ULONG_MAX >= LZO_0xffffffffL) +# define SIZEOF_UNSIGNED_LONG 8 +# else +# error "SIZEOF_UNSIGNED_LONG" +# endif +#endif + +#if !defined(SIZEOF_SIZE_T) +# define SIZEOF_SIZE_T SIZEOF_UNSIGNED +#endif +#if !defined(SIZE_T_MAX) +# define SIZE_T_MAX LZO_UTYPE_MAX(SIZEOF_SIZE_T) +#endif + +#if 1 && defined(__LZO_i386) && (UINT_MAX == LZO_0xffffffffL) +# if !defined(LZO_UNALIGNED_OK_2) && (USHRT_MAX == 0xffff) +# define LZO_UNALIGNED_OK_2 +# endif +# if !defined(LZO_UNALIGNED_OK_4) && (LZO_UINT32_MAX == LZO_0xffffffffL) +# define LZO_UNALIGNED_OK_4 +# endif +#endif + +#if defined(LZO_UNALIGNED_OK_2) || defined(LZO_UNALIGNED_OK_4) +# if !defined(LZO_UNALIGNED_OK) +# define LZO_UNALIGNED_OK +# endif +#endif + +#if defined(__LZO_NO_UNALIGNED) +# undef LZO_UNALIGNED_OK +# undef LZO_UNALIGNED_OK_2 +# undef LZO_UNALIGNED_OK_4 +#endif + +#if defined(LZO_UNALIGNED_OK_2) && (USHRT_MAX != 0xffff) +# error "LZO_UNALIGNED_OK_2 must not be defined on this system" +#endif +#if defined(LZO_UNALIGNED_OK_4) && (LZO_UINT32_MAX != LZO_0xffffffffL) +# error "LZO_UNALIGNED_OK_4 must not be defined on this system" +#endif + +#if defined(__LZO_NO_ALIGNED) +# undef LZO_ALIGNED_OK_4 +#endif + +#if defined(LZO_ALIGNED_OK_4) && (LZO_UINT32_MAX != LZO_0xffffffffL) +# error "LZO_ALIGNED_OK_4 must not be defined on this system" +#endif + +#define LZO_LITTLE_ENDIAN 1234 +#define LZO_BIG_ENDIAN 4321 +#define LZO_PDP_ENDIAN 3412 + +#if !defined(LZO_BYTE_ORDER) +# if defined(MFX_BYTE_ORDER) +# define LZO_BYTE_ORDER MFX_BYTE_ORDER +# elif defined(__LZO_i386) +# define LZO_BYTE_ORDER LZO_LITTLE_ENDIAN +# elif defined(BYTE_ORDER) +# define LZO_BYTE_ORDER BYTE_ORDER +# elif defined(__BYTE_ORDER) +# define LZO_BYTE_ORDER __BYTE_ORDER +# endif +#endif + +#if defined(LZO_BYTE_ORDER) +# if (LZO_BYTE_ORDER != LZO_LITTLE_ENDIAN) && \ + (LZO_BYTE_ORDER != LZO_BIG_ENDIAN) +# error "invalid LZO_BYTE_ORDER" +# endif +#endif + +#if defined(LZO_UNALIGNED_OK) && !defined(LZO_BYTE_ORDER) +# error "LZO_BYTE_ORDER is not defined" +#endif + +#define LZO_OPTIMIZE_GNUC_i386_IS_BUGGY + +#if defined(NDEBUG) && !defined(LZO_DEBUG) && !defined(__LZO_CHECKER) +# if defined(__GNUC__) && defined(__i386__) +# if !defined(LZO_OPTIMIZE_GNUC_i386_IS_BUGGY) +# define LZO_OPTIMIZE_GNUC_i386 +# endif +# endif +#endif + +__LZO_EXTERN_C const lzo_uint32 _lzo_crc32_table[256]; + +#define _LZO_STRINGIZE(x) #x +#define _LZO_MEXPAND(x) _LZO_STRINGIZE(x) + +#define _LZO_CONCAT2(a,b) a ## b +#define _LZO_CONCAT3(a,b,c) a ## b ## c +#define _LZO_CONCAT4(a,b,c,d) a ## b ## c ## d +#define _LZO_CONCAT5(a,b,c,d,e) a ## b ## c ## d ## e + +#define _LZO_ECONCAT2(a,b) _LZO_CONCAT2(a,b) +#define _LZO_ECONCAT3(a,b,c) _LZO_CONCAT3(a,b,c) +#define _LZO_ECONCAT4(a,b,c,d) _LZO_CONCAT4(a,b,c,d) +#define _LZO_ECONCAT5(a,b,c,d,e) _LZO_CONCAT5(a,b,c,d,e) + +#if 0 + +#define __LZO_IS_COMPRESS_QUERY(i,il,o,ol,w) ((lzo_voidp)(o) == (w)) +#define __LZO_QUERY_COMPRESS(i,il,o,ol,w,n,s) \ + (*ol = (n)*(s), LZO_E_OK) + +#define __LZO_IS_DECOMPRESS_QUERY(i,il,o,ol,w) ((lzo_voidp)(o) == (w)) +#define __LZO_QUERY_DECOMPRESS(i,il,o,ol,w,n,s) \ + (*ol = (n)*(s), LZO_E_OK) + +#define __LZO_IS_OPTIMIZE_QUERY(i,il,o,ol,w) ((lzo_voidp)(o) == (w)) +#define __LZO_QUERY_OPTIMIZE(i,il,o,ol,w,n,s) \ + (*ol = (n)*(s), LZO_E_OK) + +#endif + +#ifndef __LZO_PTR_H +#define __LZO_PTR_H + +#ifdef __cplusplus +extern "C" { +#endif + +#if defined(__LZO_DOS16) || defined(__LZO_WIN16) +# include +# if 1 && defined(__WATCOMC__) +# include + __LZO_EXTERN_C unsigned char _HShift; +# define __LZO_HShift _HShift +# elif 1 && defined(_MSC_VER) + __LZO_EXTERN_C unsigned short __near _AHSHIFT; +# define __LZO_HShift ((unsigned) &_AHSHIFT) +# elif defined(__LZO_WIN16) +# define __LZO_HShift 3 +# else +# define __LZO_HShift 12 +# endif +# if !defined(_FP_SEG) && defined(FP_SEG) +# define _FP_SEG FP_SEG +# endif +# if !defined(_FP_OFF) && defined(FP_OFF) +# define _FP_OFF FP_OFF +# endif +#endif + +#if !defined(lzo_ptrdiff_t) +# if (UINT_MAX >= LZO_0xffffffffL) + typedef ptrdiff_t lzo_ptrdiff_t; +# else + typedef long lzo_ptrdiff_t; +# endif +#endif + +#if !defined(__LZO_HAVE_PTR_T) +# if defined(lzo_ptr_t) +# define __LZO_HAVE_PTR_T +# endif +#endif +#if !defined(__LZO_HAVE_PTR_T) +# if defined(SIZEOF_CHAR_P) && defined(SIZEOF_UNSIGNED_LONG) +# if (SIZEOF_CHAR_P == SIZEOF_UNSIGNED_LONG) + typedef unsigned long lzo_ptr_t; + typedef long lzo_sptr_t; +# define __LZO_HAVE_PTR_T +# endif +# endif +#endif +#if !defined(__LZO_HAVE_PTR_T) +# if defined(SIZEOF_CHAR_P) && defined(SIZEOF_UNSIGNED) +# if (SIZEOF_CHAR_P == SIZEOF_UNSIGNED) + typedef unsigned int lzo_ptr_t; + typedef int lzo_sptr_t; +# define __LZO_HAVE_PTR_T +# endif +# endif +#endif +#if !defined(__LZO_HAVE_PTR_T) +# if defined(SIZEOF_CHAR_P) && defined(SIZEOF_UNSIGNED_SHORT) +# if (SIZEOF_CHAR_P == SIZEOF_UNSIGNED_SHORT) + typedef unsigned short lzo_ptr_t; + typedef short lzo_sptr_t; +# define __LZO_HAVE_PTR_T +# endif +# endif +#endif +#if !defined(__LZO_HAVE_PTR_T) +# if defined(LZO_HAVE_CONFIG_H) || defined(SIZEOF_CHAR_P) +# error "no suitable type for lzo_ptr_t" +# else + typedef unsigned long lzo_ptr_t; + typedef long lzo_sptr_t; +# define __LZO_HAVE_PTR_T +# endif +#endif + +#if defined(__LZO_DOS16) || defined(__LZO_WIN16) +#define PTR(a) ((lzo_bytep) (a)) +#define PTR_ALIGNED_4(a) ((_FP_OFF(a) & 3) == 0) +#define PTR_ALIGNED2_4(a,b) (((_FP_OFF(a) | _FP_OFF(b)) & 3) == 0) +#else +#define PTR(a) ((lzo_ptr_t) (a)) +#define PTR_LINEAR(a) PTR(a) +#define PTR_ALIGNED_4(a) ((PTR_LINEAR(a) & 3) == 0) +#define PTR_ALIGNED_8(a) ((PTR_LINEAR(a) & 7) == 0) +#define PTR_ALIGNED2_4(a,b) (((PTR_LINEAR(a) | PTR_LINEAR(b)) & 3) == 0) +#define PTR_ALIGNED2_8(a,b) (((PTR_LINEAR(a) | PTR_LINEAR(b)) & 7) == 0) +#endif + +#define PTR_LT(a,b) (PTR(a) < PTR(b)) +#define PTR_GE(a,b) (PTR(a) >= PTR(b)) +#define PTR_DIFF(a,b) ((lzo_ptrdiff_t) (PTR(a) - PTR(b))) +#define pd(a,b) ((lzo_uint) ((a)-(b))) + +typedef union +{ + char a_char; + unsigned char a_uchar; + short a_short; + unsigned short a_ushort; + int a_int; + unsigned int a_uint; + long a_long; + unsigned long a_ulong; + lzo_int a_lzo_int; + lzo_uint a_lzo_uint; + lzo_int32 a_lzo_int32; + lzo_uint32 a_lzo_uint32; + ptrdiff_t a_ptrdiff_t; + lzo_ptrdiff_t a_lzo_ptrdiff_t; + lzo_ptr_t a_lzo_ptr_t; + lzo_voidp a_lzo_voidp; + void * a_void_p; + lzo_bytep a_lzo_bytep; + lzo_bytepp a_lzo_bytepp; + lzo_uintp a_lzo_uintp; + lzo_uint * a_lzo_uint_p; + lzo_uint32p a_lzo_uint32p; + lzo_uint32 * a_lzo_uint32_p; + unsigned char * a_uchar_p; + char * a_char_p; +} +lzo_full_align_t; + +#ifdef __cplusplus +} +#endif + +#endif + +#define LZO_DETERMINISTIC + +#define LZO_DICT_USE_PTR +#if defined(__LZO_DOS16) || defined(__LZO_WIN16) || defined(__LZO_STRICT_16BIT) +# undef LZO_DICT_USE_PTR +#endif + +#if defined(LZO_DICT_USE_PTR) +# define lzo_dict_t const lzo_bytep +# define lzo_dict_p lzo_dict_t __LZO_MMODEL * +#else +# define lzo_dict_t lzo_uint +# define lzo_dict_p lzo_dict_t __LZO_MMODEL * +#endif + +#if !defined(lzo_moff_t) +#define lzo_moff_t lzo_uint +#endif + +#endif + +static lzo_ptr_t +__lzo_ptr_linear(const lzo_voidp ptr) +{ + lzo_ptr_t p; + +#if defined(__LZO_DOS16) || defined(__LZO_WIN16) + p = (((lzo_ptr_t)(_FP_SEG(ptr))) << (16 - __LZO_HShift)) + (_FP_OFF(ptr)); +#else + p = PTR_LINEAR(ptr); +#endif + + return p; +} + +static unsigned +__lzo_align_gap(const lzo_voidp ptr, lzo_uint size) +{ + lzo_ptr_t p, s, n; + + assert("lzo-01", size > 0); + + p = __lzo_ptr_linear(ptr); + s = (lzo_ptr_t) (size - 1); +#if 0 + assert((size & (size - 1)) == 0); + n = ((p + s) & ~s) - p; +#else + n = (((p + s) / size) * size) - p; +#endif + + assert("lzo-02", (long)n >= 0); + assert("lzo-03", n <= s); + + return (unsigned)n; +} + +#ifndef __LZO_UTIL_H +#define __LZO_UTIL_H + +#ifndef __LZO_CONF_H +#endif + +#ifdef __cplusplus +extern "C" { +#endif + +#if 1 && defined(HAVE_MEMCPY) +#if !defined(__LZO_DOS16) && !defined(__LZO_WIN16) + +#define MEMCPY8_DS(dest,src,len) \ + memcpy(dest,src,len); \ + dest += len; \ + src += len + +#endif +#endif + +#if 0 && !defined(MEMCPY8_DS) + +#define MEMCPY8_DS(dest,src,len) \ + { do { \ + *dest++ = *src++; \ + *dest++ = *src++; \ + *dest++ = *src++; \ + *dest++ = *src++; \ + *dest++ = *src++; \ + *dest++ = *src++; \ + *dest++ = *src++; \ + *dest++ = *src++; \ + len -= 8; \ + } while (len > 0); } + +#endif + +#if !defined(MEMCPY8_DS) + +#define MEMCPY8_DS(dest,src,len) \ + { register lzo_uint __l = (len) / 8; \ + do { \ + *dest++ = *src++; \ + *dest++ = *src++; \ + *dest++ = *src++; \ + *dest++ = *src++; \ + *dest++ = *src++; \ + *dest++ = *src++; \ + *dest++ = *src++; \ + *dest++ = *src++; \ + } while (--__l > 0); } + +#endif + +#define MEMCPY_DS(dest,src,len) \ + do *dest++ = *src++; \ + while (--len > 0) + +#define MEMMOVE_DS(dest,src,len) \ + do *dest++ = *src++; \ + while (--len > 0) + +#if 0 && defined(LZO_OPTIMIZE_GNUC_i386) + +#define BZERO8_PTR(s,l,n) \ +__asm__ __volatile__( \ + "movl %0,%%eax \n" \ + "movl %1,%%edi \n" \ + "movl %2,%%ecx \n" \ + "cld \n" \ + "rep \n" \ + "stosl %%eax,(%%edi) \n" \ + : \ + :"g" (0),"g" (s),"g" (n) \ + :"eax","edi","ecx", "memory", "cc" \ +) + +#elif (LZO_UINT_MAX <= SIZE_T_MAX) && defined(HAVE_MEMSET) + +#if 1 +#define BZERO8_PTR(s,l,n) memset((s),0,(lzo_uint)(l)*(n)) +#else +#define BZERO8_PTR(s,l,n) memset((lzo_voidp)(s),0,(lzo_uint)(l)*(n)) +#endif + +#else + +#define BZERO8_PTR(s,l,n) \ + lzo_memset((lzo_voidp)(s),0,(lzo_uint)(l)*(n)) + +#endif + +#if 0 +#if defined(__GNUC__) && defined(__i386__) + +unsigned char lzo_rotr8(unsigned char value, int shift); +extern __inline__ unsigned char lzo_rotr8(unsigned char value, int shift) +{ + unsigned char result; + + __asm__ __volatile__ ("movb %b1, %b0; rorb %b2, %b0" + : "=a"(result) : "g"(value), "c"(shift)); + return result; +} + +unsigned short lzo_rotr16(unsigned short value, int shift); +extern __inline__ unsigned short lzo_rotr16(unsigned short value, int shift) +{ + unsigned short result; + + __asm__ __volatile__ ("movw %b1, %b0; rorw %b2, %b0" + : "=a"(result) : "g"(value), "c"(shift)); + return result; +} + +#endif +#endif + +#ifdef __cplusplus +} +#endif + +#endif + +/* If you use the LZO library in a product, you *must* keep this + * copyright string in the executable of your product. + */ + +const lzo_byte __lzo_copyright[] = +#if !defined(__LZO_IN_MINLZO) + LZO_VERSION_STRING; +#else + "\n\n\n" + "LZO real-time data compression library.\n" + "Copyright (C) 1996, 1997, 1998, 1999, 2000, 2001, 2002 Markus Franz Xaver Johannes Oberhumer\n" + "\n" + "http://www.oberhumer.com/opensource/lzo/\n" + "\n" + "LZO version: v" LZO_VERSION_STRING ", " LZO_VERSION_DATE "\n" + "LZO build date: " __DATE__ " " __TIME__ "\n\n" + "LZO special compilation options:\n" +#ifdef __cplusplus + " __cplusplus\n" +#endif +#if defined(__PIC__) + " __PIC__\n" +#elif defined(__pic__) + " __pic__\n" +#endif +#if (UINT_MAX < LZO_0xffffffffL) + " 16BIT\n" +#endif +#if defined(__LZO_STRICT_16BIT) + " __LZO_STRICT_16BIT\n" +#endif +#if (UINT_MAX > LZO_0xffffffffL) + " UINT_MAX=" _LZO_MEXPAND(UINT_MAX) "\n" +#endif +#if (ULONG_MAX > LZO_0xffffffffL) + " ULONG_MAX=" _LZO_MEXPAND(ULONG_MAX) "\n" +#endif +#if defined(LZO_BYTE_ORDER) + " LZO_BYTE_ORDER=" _LZO_MEXPAND(LZO_BYTE_ORDER) "\n" +#endif +#if defined(LZO_UNALIGNED_OK_2) + " LZO_UNALIGNED_OK_2\n" +#endif +#if defined(LZO_UNALIGNED_OK_4) + " LZO_UNALIGNED_OK_4\n" +#endif +#if defined(LZO_ALIGNED_OK_4) + " LZO_ALIGNED_OK_4\n" +#endif +#if defined(LZO_DICT_USE_PTR) + " LZO_DICT_USE_PTR\n" +#endif +#if defined(__LZO_QUERY_COMPRESS) + " __LZO_QUERY_COMPRESS\n" +#endif +#if defined(__LZO_QUERY_DECOMPRESS) + " __LZO_QUERY_DECOMPRESS\n" +#endif +#if defined(__LZO_IN_MINILZO) + " __LZO_IN_MINILZO\n" +#endif + "\n\n" + "$Id: LZO " LZO_VERSION_STRING " built " __DATE__ " " __TIME__ +#if defined(__GNUC__) && defined(__VERSION__) + " by gcc " __VERSION__ +#elif defined(__BORLANDC__) + " by Borland C " _LZO_MEXPAND(__BORLANDC__) +#elif defined(_MSC_VER) + " by Microsoft C " _LZO_MEXPAND(_MSC_VER) +#elif defined(__PUREC__) + " by Pure C " _LZO_MEXPAND(__PUREC__) +#elif defined(__SC__) + " by Symantec C " _LZO_MEXPAND(__SC__) +#elif defined(__TURBOC__) + " by Turbo C " _LZO_MEXPAND(__TURBOC__) +#elif defined(__WATCOMC__) + " by Watcom C " _LZO_MEXPAND(__WATCOMC__) +#endif + " $\n" + "$Copyright: LZO (C) 1996, 1997, 1998, 1999, 2000, 2001, 2002 Markus Franz Xaver Johannes Oberhumer $\n"; +#endif + + +#define LZO_BASE 65521u +#define LZO_NMAX 5552 + +#define LZO_DO1(buf,i) {s1 += buf[i]; s2 += s1;} +#define LZO_DO2(buf,i) LZO_DO1(buf,i); LZO_DO1(buf,i+1); +#define LZO_DO4(buf,i) LZO_DO2(buf,i); LZO_DO2(buf,i+2); +#define LZO_DO8(buf,i) LZO_DO4(buf,i); LZO_DO4(buf,i+4); +#define LZO_DO16(buf,i) LZO_DO8(buf,i); LZO_DO8(buf,i+8); + +static lzo_voidp +lzo_memset(lzo_voidp s, int c, lzo_uint len) +{ +#if (LZO_UINT_MAX <= SIZE_T_MAX) && defined(HAVE_MEMSET) + return memset(s,c,len); +#else + lzo_byte *p = (lzo_byte *) s; + + if (len > 0) do + *p++ = LZO_BYTE(c); + while (--len > 0); + return s; +#endif +} + +#if 0 +# define IS_SIGNED(type) (((type) (1ul << (8 * sizeof(type) - 1))) < 0) +# define IS_UNSIGNED(type) (((type) (1ul << (8 * sizeof(type) - 1))) > 0) +#else +# define IS_SIGNED(type) (((type) (-1)) < ((type) 0)) +# define IS_UNSIGNED(type) (((type) (-1)) > ((type) 0)) +#endif + +#define IS_POWER_OF_2(x) (((x) & ((x) - 1)) == 0) + +static lzo_bool schedule_insns_bug(void); +static lzo_bool strength_reduce_bug(int *); + +#if 0 || defined(LZO_DEBUG) +#include +static lzo_bool __lzo_assert_fail(const char *s, unsigned line) +{ +#if defined(__palmos__) + printf("LZO assertion failed in line %u: '%s'\n",line,s); +#else + fprintf(stderr,"LZO assertion failed in line %u: '%s'\n",line,s); +#endif + return 0; +} +# define __lzo_assert(x) ((x) ? 1 : __lzo_assert_fail(#x,__LINE__)) +#else +# define __lzo_assert(x) ((x) ? 1 : 0) +#endif + +#undef COMPILE_TIME_ASSERT +#if 0 +# define COMPILE_TIME_ASSERT(expr) r &= __lzo_assert(expr) +#else +# define COMPILE_TIME_ASSERT(expr) LZO_COMPILE_TIME_ASSERT(expr) +#endif + +static lzo_bool basic_integral_check(void) +{ + lzo_bool r = 1; + + COMPILE_TIME_ASSERT(CHAR_BIT == 8); + COMPILE_TIME_ASSERT(sizeof(char) == 1); + COMPILE_TIME_ASSERT(sizeof(short) >= 2); + COMPILE_TIME_ASSERT(sizeof(long) >= 4); + COMPILE_TIME_ASSERT(sizeof(int) >= sizeof(short)); + COMPILE_TIME_ASSERT(sizeof(long) >= sizeof(int)); + + COMPILE_TIME_ASSERT(sizeof(lzo_uint) == sizeof(lzo_int)); + COMPILE_TIME_ASSERT(sizeof(lzo_uint32) == sizeof(lzo_int32)); + + COMPILE_TIME_ASSERT(sizeof(lzo_uint32) >= 4); + COMPILE_TIME_ASSERT(sizeof(lzo_uint32) >= sizeof(unsigned)); +#if defined(__LZO_STRICT_16BIT) + COMPILE_TIME_ASSERT(sizeof(lzo_uint) == 2); +#else + COMPILE_TIME_ASSERT(sizeof(lzo_uint) >= 4); + COMPILE_TIME_ASSERT(sizeof(lzo_uint) >= sizeof(unsigned)); +#endif + +#if (USHRT_MAX == 65535u) + COMPILE_TIME_ASSERT(sizeof(short) == 2); +#elif (USHRT_MAX == LZO_0xffffffffL) + COMPILE_TIME_ASSERT(sizeof(short) == 4); +#elif (USHRT_MAX >= LZO_0xffffffffL) + COMPILE_TIME_ASSERT(sizeof(short) > 4); +#endif +#if 0 /* to make gcc happy -edward */ +#if (UINT_MAX == 65535u) + COMPILE_TIME_ASSERT(sizeof(int) == 2); +#elif (UINT_MAX == LZO_0xffffffffL) + COMPILE_TIME_ASSERT(sizeof(int) == 4); +#elif (UINT_MAX >= LZO_0xffffffffL) + COMPILE_TIME_ASSERT(sizeof(int) > 4); +#endif +#if (ULONG_MAX == 65535ul) + COMPILE_TIME_ASSERT(sizeof(long) == 2); +#elif (ULONG_MAX == LZO_0xffffffffL) + COMPILE_TIME_ASSERT(sizeof(long) == 4); +#elif (ULONG_MAX >= LZO_0xffffffffL) + COMPILE_TIME_ASSERT(sizeof(long) > 4); +#endif +#if defined(SIZEOF_UNSIGNED) + COMPILE_TIME_ASSERT(SIZEOF_UNSIGNED == sizeof(unsigned)); +#endif +#if defined(SIZEOF_UNSIGNED_LONG) + COMPILE_TIME_ASSERT(SIZEOF_UNSIGNED_LONG == sizeof(unsigned long)); +#endif +#if defined(SIZEOF_UNSIGNED_SHORT) + COMPILE_TIME_ASSERT(SIZEOF_UNSIGNED_SHORT == sizeof(unsigned short)); +#endif +#if !defined(__LZO_IN_MINILZO) +#if defined(SIZEOF_SIZE_T) + COMPILE_TIME_ASSERT(SIZEOF_SIZE_T == sizeof(size_t)); +#endif +#endif +#endif /* -edward */ + + COMPILE_TIME_ASSERT(IS_UNSIGNED(unsigned char)); + COMPILE_TIME_ASSERT(IS_UNSIGNED(unsigned short)); + COMPILE_TIME_ASSERT(IS_UNSIGNED(unsigned)); + COMPILE_TIME_ASSERT(IS_UNSIGNED(unsigned long)); + COMPILE_TIME_ASSERT(IS_SIGNED(short)); + COMPILE_TIME_ASSERT(IS_SIGNED(int)); + COMPILE_TIME_ASSERT(IS_SIGNED(long)); + + COMPILE_TIME_ASSERT(IS_UNSIGNED(lzo_uint32)); + COMPILE_TIME_ASSERT(IS_UNSIGNED(lzo_uint)); + COMPILE_TIME_ASSERT(IS_SIGNED(lzo_int32)); + COMPILE_TIME_ASSERT(IS_SIGNED(lzo_int)); + + COMPILE_TIME_ASSERT(INT_MAX == LZO_STYPE_MAX(sizeof(int))); + COMPILE_TIME_ASSERT(UINT_MAX == LZO_UTYPE_MAX(sizeof(unsigned))); + COMPILE_TIME_ASSERT(LONG_MAX == LZO_STYPE_MAX(sizeof(long))); + COMPILE_TIME_ASSERT(ULONG_MAX == LZO_UTYPE_MAX(sizeof(unsigned long))); + // COMPILE_TIME_ASSERT(SHRT_MAX == LZO_STYPE_MAX(sizeof(short))); /* edward */ + COMPILE_TIME_ASSERT(USHRT_MAX == LZO_UTYPE_MAX(sizeof(unsigned short))); + COMPILE_TIME_ASSERT(LZO_UINT32_MAX == LZO_UTYPE_MAX(sizeof(lzo_uint32))); + COMPILE_TIME_ASSERT(LZO_UINT_MAX == LZO_UTYPE_MAX(sizeof(lzo_uint))); +#if !defined(__LZO_IN_MINILZO) + COMPILE_TIME_ASSERT(SIZE_T_MAX == LZO_UTYPE_MAX(sizeof(size_t))); +#endif + + r &= __lzo_assert(LZO_BYTE(257) == 1); + + return r; +} + +static lzo_bool basic_ptr_check(void) +{ + lzo_bool r = 1; + + COMPILE_TIME_ASSERT(sizeof(char *) >= sizeof(int)); + COMPILE_TIME_ASSERT(sizeof(lzo_byte *) >= sizeof(char *)); + + COMPILE_TIME_ASSERT(sizeof(lzo_voidp) == sizeof(lzo_byte *)); + COMPILE_TIME_ASSERT(sizeof(lzo_voidp) == sizeof(lzo_voidpp)); + COMPILE_TIME_ASSERT(sizeof(lzo_voidp) == sizeof(lzo_bytepp)); + COMPILE_TIME_ASSERT(sizeof(lzo_voidp) >= sizeof(lzo_uint)); + + COMPILE_TIME_ASSERT(sizeof(lzo_ptr_t) == sizeof(lzo_voidp)); + COMPILE_TIME_ASSERT(sizeof(lzo_ptr_t) == sizeof(lzo_sptr_t)); + COMPILE_TIME_ASSERT(sizeof(lzo_ptr_t) >= sizeof(lzo_uint)); + + COMPILE_TIME_ASSERT(sizeof(lzo_ptrdiff_t) >= 4); + COMPILE_TIME_ASSERT(sizeof(lzo_ptrdiff_t) >= sizeof(ptrdiff_t)); + + COMPILE_TIME_ASSERT(sizeof(ptrdiff_t) >= sizeof(size_t)); + COMPILE_TIME_ASSERT(sizeof(lzo_ptrdiff_t) >= sizeof(lzo_uint)); + +#if defined(SIZEOF_CHAR_P) + COMPILE_TIME_ASSERT(SIZEOF_CHAR_P == sizeof(char *)); +#endif +#if defined(SIZEOF_PTRDIFF_T) + COMPILE_TIME_ASSERT(SIZEOF_PTRDIFF_T == sizeof(ptrdiff_t)); +#endif + + COMPILE_TIME_ASSERT(IS_SIGNED(ptrdiff_t)); + COMPILE_TIME_ASSERT(IS_UNSIGNED(size_t)); + COMPILE_TIME_ASSERT(IS_SIGNED(lzo_ptrdiff_t)); + COMPILE_TIME_ASSERT(IS_SIGNED(lzo_sptr_t)); + COMPILE_TIME_ASSERT(IS_UNSIGNED(lzo_ptr_t)); + COMPILE_TIME_ASSERT(IS_UNSIGNED(lzo_moff_t)); + + return r; +} + +static lzo_bool ptr_check(void) +{ + lzo_bool r = 1; + int i; + char _wrkmem[10 * sizeof(lzo_byte *) + sizeof(lzo_full_align_t)]; + lzo_bytep wrkmem; + lzo_bytepp dict; + unsigned char x[4 * sizeof(lzo_full_align_t)]; + long d; + lzo_full_align_t a; + lzo_full_align_t u; + + for (i = 0; i < (int) sizeof(x); i++) + x[i] = LZO_BYTE(i); + + wrkmem = LZO_PTR_ALIGN_UP((lzo_byte *)_wrkmem,sizeof(lzo_full_align_t)); + +#if 0 + dict = (lzo_bytepp) wrkmem; +#else + + u.a_lzo_bytep = wrkmem; dict = u.a_lzo_bytepp; +#endif + + d = (long) ((const lzo_bytep) dict - (const lzo_bytep) _wrkmem); + r &= __lzo_assert(d >= 0); + r &= __lzo_assert(d < (long) sizeof(lzo_full_align_t)); + + memset(&a,0,sizeof(a)); + r &= __lzo_assert(a.a_lzo_voidp == NULL); + + memset(&a,0xff,sizeof(a)); + r &= __lzo_assert(a.a_ushort == USHRT_MAX); + r &= __lzo_assert(a.a_uint == UINT_MAX); + r &= __lzo_assert(a.a_ulong == ULONG_MAX); + r &= __lzo_assert(a.a_lzo_uint == LZO_UINT_MAX); + r &= __lzo_assert(a.a_lzo_uint32 == LZO_UINT32_MAX); + + if (r == 1) + { + for (i = 0; i < 8; i++) + r &= __lzo_assert((const lzo_voidp) (&dict[i]) == (const lzo_voidp) (&wrkmem[i * sizeof(lzo_byte *)])); + } + + memset(&a,0,sizeof(a)); + r &= __lzo_assert(a.a_char_p == NULL); + r &= __lzo_assert(a.a_lzo_bytep == NULL); + r &= __lzo_assert(NULL == (void *)0); + if (r == 1) + { + for (i = 0; i < 10; i++) + dict[i] = wrkmem; + BZERO8_PTR(dict+1,sizeof(dict[0]),8); + r &= __lzo_assert(dict[0] == wrkmem); + for (i = 1; i < 9; i++) + r &= __lzo_assert(dict[i] == NULL); + r &= __lzo_assert(dict[9] == wrkmem); + } + + if (r == 1) + { + unsigned k = 1; + const unsigned n = (unsigned) sizeof(lzo_uint32); + lzo_byte *p0; + lzo_byte *p1; + + k += __lzo_align_gap(&x[k],n); + p0 = (lzo_bytep) &x[k]; +#if defined(PTR_LINEAR) + r &= __lzo_assert((PTR_LINEAR(p0) & (n-1)) == 0); +#else + r &= __lzo_assert(n == 4); + r &= __lzo_assert(PTR_ALIGNED_4(p0)); +#endif + + r &= __lzo_assert(k >= 1); + p1 = (lzo_bytep) &x[1]; + r &= __lzo_assert(PTR_GE(p0,p1)); + + r &= __lzo_assert(k < 1+n); + p1 = (lzo_bytep) &x[1+n]; + r &= __lzo_assert(PTR_LT(p0,p1)); + + if (r == 1) + { + lzo_uint32 v0, v1; +#if 0 + v0 = * (lzo_uint32 *) &x[k]; + v1 = * (lzo_uint32 *) &x[k+n]; +#else + + u.a_uchar_p = &x[k]; + v0 = *u.a_lzo_uint32_p; + u.a_uchar_p = &x[k+n]; + v1 = *u.a_lzo_uint32_p; +#endif + r &= __lzo_assert(v0 > 0); + r &= __lzo_assert(v1 > 0); + } + } + + return r; +} + +static int +_lzo_config_check(void) +{ + lzo_bool r = 1; + int i; + union { + lzo_uint32 a; + unsigned short b; + lzo_uint32 aa[4]; + unsigned char x[4*sizeof(lzo_full_align_t)]; + } u; + + COMPILE_TIME_ASSERT( (int) ((unsigned char) ((signed char) -1)) == 255); + COMPILE_TIME_ASSERT( (((unsigned char)128) << (int)(8*sizeof(int)-8)) < 0); + +#if 0 + r &= __lzo_assert((const void *)&u == (const void *)&u.a); + r &= __lzo_assert((const void *)&u == (const void *)&u.b); + r &= __lzo_assert((const void *)&u == (const void *)&u.x[0]); + r &= __lzo_assert((const void *)&u == (const void *)&u.aa[0]); +#endif + + r &= basic_integral_check(); + r &= basic_ptr_check(); + if (r != 1) + return LZO_E_ERROR; + + u.a = 0; u.b = 0; + for (i = 0; i < (int) sizeof(u.x); i++) + u.x[i] = LZO_BYTE(i); + +#if defined(LZO_BYTE_ORDER) + if (r == 1) + { +# if (LZO_BYTE_ORDER == LZO_LITTLE_ENDIAN) + lzo_uint32 a = (lzo_uint32) (u.a & LZO_0xffffffffL); + unsigned short b = (unsigned short) (u.b & 0xffff); + r &= __lzo_assert(a == 0x03020100L); + r &= __lzo_assert(b == 0x0100); +# elif (LZO_BYTE_ORDER == LZO_BIG_ENDIAN) + lzo_uint32 a = u.a >> (8 * sizeof(u.a) - 32); + unsigned short b = u.b >> (8 * sizeof(u.b) - 16); + r &= __lzo_assert(a == 0x00010203L); + r &= __lzo_assert(b == 0x0001); +# else +# error "invalid LZO_BYTE_ORDER" +# endif + } +#endif + +#if defined(LZO_UNALIGNED_OK_2) + COMPILE_TIME_ASSERT(sizeof(short) == 2); + if (r == 1) + { + unsigned short b[4]; + + for (i = 0; i < 4; i++) + b[i] = * (const unsigned short *) &u.x[i]; + +# if (LZO_BYTE_ORDER == LZO_LITTLE_ENDIAN) + r &= __lzo_assert(b[0] == 0x0100); + r &= __lzo_assert(b[1] == 0x0201); + r &= __lzo_assert(b[2] == 0x0302); + r &= __lzo_assert(b[3] == 0x0403); +# elif (LZO_BYTE_ORDER == LZO_BIG_ENDIAN) + r &= __lzo_assert(b[0] == 0x0001); + r &= __lzo_assert(b[1] == 0x0102); + r &= __lzo_assert(b[2] == 0x0203); + r &= __lzo_assert(b[3] == 0x0304); +# endif + } +#endif + +#if defined(LZO_UNALIGNED_OK_4) + COMPILE_TIME_ASSERT(sizeof(lzo_uint32) == 4); + if (r == 1) + { + lzo_uint32 a[4]; + + for (i = 0; i < 4; i++) + a[i] = * (const lzo_uint32 *) &u.x[i]; + +# if (LZO_BYTE_ORDER == LZO_LITTLE_ENDIAN) + r &= __lzo_assert(a[0] == 0x03020100L); + r &= __lzo_assert(a[1] == 0x04030201L); + r &= __lzo_assert(a[2] == 0x05040302L); + r &= __lzo_assert(a[3] == 0x06050403L); +# elif (LZO_BYTE_ORDER == LZO_BIG_ENDIAN) + r &= __lzo_assert(a[0] == 0x00010203L); + r &= __lzo_assert(a[1] == 0x01020304L); + r &= __lzo_assert(a[2] == 0x02030405L); + r &= __lzo_assert(a[3] == 0x03040506L); +# endif + } +#endif + +#if defined(LZO_ALIGNED_OK_4) + COMPILE_TIME_ASSERT(sizeof(lzo_uint32) == 4); +#endif + + COMPILE_TIME_ASSERT(lzo_sizeof_dict_t == sizeof(lzo_dict_t)); + +#if defined(__LZO_IN_MINLZO) + if (r == 1) + { + lzo_uint32 adler; + adler = lzo_adler32(0, NULL, 0); + adler = lzo_adler32(adler, lzo_copyright(), 200); + r &= __lzo_assert(adler == 0xc76f1751L); + } +#endif + + if (r == 1) + { + r &= __lzo_assert(!schedule_insns_bug()); + } + + if (r == 1) + { + static int x[3]; + static unsigned xn = 3; + register unsigned j; + + for (j = 0; j < xn; j++) + x[j] = (int)j - 3; + r &= __lzo_assert(!strength_reduce_bug(x)); + } + + if (r == 1) + { + r &= ptr_check(); + } + + return r == 1 ? LZO_E_OK : LZO_E_ERROR; +} + +static lzo_bool schedule_insns_bug(void) +{ +#if defined(__LZO_CHECKER) + return 0; +#else + const int clone[] = {1, 2, 0}; + const int *q; + q = clone; + return (*q) ? 0 : 1; +#endif +} + +static lzo_bool strength_reduce_bug(int *x) +{ + return x[0] != -3 || x[1] != -2 || x[2] != -1; +} + +#undef COMPILE_TIME_ASSERT + +LZO_PUBLIC(int) +__lzo_init2(unsigned v, int s1, int s2, int s3, int s4, int s5, + int s6, int s7, int s8, int s9) +{ + int r; + + if (v == 0) + return LZO_E_ERROR; + + r = (s1 == -1 || s1 == (int) sizeof(short)) && + (s2 == -1 || s2 == (int) sizeof(int)) && + (s3 == -1 || s3 == (int) sizeof(long)) && + (s4 == -1 || s4 == (int) sizeof(lzo_uint32)) && + (s5 == -1 || s5 == (int) sizeof(lzo_uint)) && + (s6 == -1 || s6 == (int) lzo_sizeof_dict_t) && + (s7 == -1 || s7 == (int) sizeof(char *)) && + (s8 == -1 || s8 == (int) sizeof(lzo_voidp)) && + (s9 == -1 || s9 == (int) sizeof(lzo_compress_t)); + if (!r) + return LZO_E_ERROR; + + r = _lzo_config_check(); + if (r != LZO_E_OK) + return r; + + return r; +} + +#if !defined(__LZO_IN_MINILZO) + +LZO_EXTERN(int) +__lzo_init(unsigned v,int s1,int s2,int s3,int s4,int s5,int s6,int s7); + +LZO_PUBLIC(int) +__lzo_init(unsigned v,int s1,int s2,int s3,int s4,int s5,int s6,int s7) +{ + if (v == 0 || v > 0x1010) + return LZO_E_ERROR; + return __lzo_init2(v,s1,s2,s3,s4,s5,-1,-1,s6,s7); +} + +#endif + +#define do_compress _lzo1x_1_do_compress + +#define LZO_NEED_DICT_H +#define D_BITS 14 +#define D_INDEX1(d,p) d = DM((0x21*DX3(p,5,5,6)) >> 5) +#define D_INDEX2(d,p) d = (d & (D_MASK & 0x7ff)) ^ (D_HIGH | 0x1f) + +#ifndef __LZO_CONFIG1X_H +#define __LZO_CONFIG1X_H + +#if !defined(LZO1X) && !defined(LZO1Y) && !defined(LZO1Z) +# define LZO1X +#endif + +#if !defined(__LZO_IN_MINILZO) +#include +#endif + +#define LZO_EOF_CODE +#undef LZO_DETERMINISTIC + +#define M1_MAX_OFFSET 0x0400 +#ifndef M2_MAX_OFFSET +#define M2_MAX_OFFSET 0x0800 +#endif +#define M3_MAX_OFFSET 0x4000 +#define M4_MAX_OFFSET 0xbfff + +#define MX_MAX_OFFSET (M1_MAX_OFFSET + M2_MAX_OFFSET) + +#define M1_MIN_LEN 2 +#define M1_MAX_LEN 2 +#define M2_MIN_LEN 3 +#ifndef M2_MAX_LEN +#define M2_MAX_LEN 8 +#endif +#define M3_MIN_LEN 3 +#define M3_MAX_LEN 33 +#define M4_MIN_LEN 3 +#define M4_MAX_LEN 9 + +#define M1_MARKER 0 +#define M2_MARKER 64 +#define M3_MARKER 32 +#define M4_MARKER 16 + +#ifndef MIN_LOOKAHEAD +#define MIN_LOOKAHEAD (M2_MAX_LEN + 1) +#endif + +#if defined(LZO_NEED_DICT_H) + +#ifndef LZO_HASH +#define LZO_HASH LZO_HASH_LZO_INCREMENTAL_B +#endif +#define DL_MIN_LEN M2_MIN_LEN + +#ifndef __LZO_DICT_H +#define __LZO_DICT_H + +#ifdef __cplusplus +extern "C" { +#endif + +#if !defined(D_BITS) && defined(DBITS) +# define D_BITS DBITS +#endif +#if !defined(D_BITS) +# error "D_BITS is not defined" +#endif +#if (D_BITS < 16) +# define D_SIZE LZO_SIZE(D_BITS) +# define D_MASK LZO_MASK(D_BITS) +#else +# define D_SIZE LZO_USIZE(D_BITS) +# define D_MASK LZO_UMASK(D_BITS) +#endif +#define D_HIGH ((D_MASK >> 1) + 1) + +#if !defined(DD_BITS) +# define DD_BITS 0 +#endif +#define DD_SIZE LZO_SIZE(DD_BITS) +#define DD_MASK LZO_MASK(DD_BITS) + +#if !defined(DL_BITS) +# define DL_BITS (D_BITS - DD_BITS) +#endif +#if (DL_BITS < 16) +# define DL_SIZE LZO_SIZE(DL_BITS) +# define DL_MASK LZO_MASK(DL_BITS) +#else +# define DL_SIZE LZO_USIZE(DL_BITS) +# define DL_MASK LZO_UMASK(DL_BITS) +#endif + +#if (D_BITS != DL_BITS + DD_BITS) +# error "D_BITS does not match" +#endif +#if (D_BITS < 8 || D_BITS > 18) +# error "invalid D_BITS" +#endif +#if (DL_BITS < 8 || DL_BITS > 20) +# error "invalid DL_BITS" +#endif +#if (DD_BITS < 0 || DD_BITS > 6) +# error "invalid DD_BITS" +#endif + +#if !defined(DL_MIN_LEN) +# define DL_MIN_LEN 3 +#endif +#if !defined(DL_SHIFT) +# define DL_SHIFT ((DL_BITS + (DL_MIN_LEN - 1)) / DL_MIN_LEN) +#endif + +#define LZO_HASH_GZIP 1 +#define LZO_HASH_GZIP_INCREMENTAL 2 +#define LZO_HASH_LZO_INCREMENTAL_A 3 +#define LZO_HASH_LZO_INCREMENTAL_B 4 + +#if !defined(LZO_HASH) +# error "choose a hashing strategy" +#endif + +#if (DL_MIN_LEN == 3) +# define _DV2_A(p,shift1,shift2) \ + (((( (lzo_uint32)((p)[0]) << shift1) ^ (p)[1]) << shift2) ^ (p)[2]) +# define _DV2_B(p,shift1,shift2) \ + (((( (lzo_uint32)((p)[2]) << shift1) ^ (p)[1]) << shift2) ^ (p)[0]) +# define _DV3_B(p,shift1,shift2,shift3) \ + ((_DV2_B((p)+1,shift1,shift2) << (shift3)) ^ (p)[0]) +#elif (DL_MIN_LEN == 2) +# define _DV2_A(p,shift1,shift2) \ + (( (lzo_uint32)(p[0]) << shift1) ^ p[1]) +# define _DV2_B(p,shift1,shift2) \ + (( (lzo_uint32)(p[1]) << shift1) ^ p[2]) +#else +# error "invalid DL_MIN_LEN" +#endif +#define _DV_A(p,shift) _DV2_A(p,shift,shift) +#define _DV_B(p,shift) _DV2_B(p,shift,shift) +#define DA2(p,s1,s2) \ + (((((lzo_uint32)((p)[2]) << (s2)) + (p)[1]) << (s1)) + (p)[0]) +#define DS2(p,s1,s2) \ + (((((lzo_uint32)((p)[2]) << (s2)) - (p)[1]) << (s1)) - (p)[0]) +#define DX2(p,s1,s2) \ + (((((lzo_uint32)((p)[2]) << (s2)) ^ (p)[1]) << (s1)) ^ (p)[0]) +#define DA3(p,s1,s2,s3) ((DA2((p)+1,s2,s3) << (s1)) + (p)[0]) +#define DS3(p,s1,s2,s3) ((DS2((p)+1,s2,s3) << (s1)) - (p)[0]) +#define DX3(p,s1,s2,s3) ((DX2((p)+1,s2,s3) << (s1)) ^ (p)[0]) +#define DMS(v,s) ((lzo_uint) (((v) & (D_MASK >> (s))) << (s))) +#define DM(v) DMS(v,0) + +#if (LZO_HASH == LZO_HASH_GZIP) +# define _DINDEX(dv,p) (_DV_A((p),DL_SHIFT)) + +#elif (LZO_HASH == LZO_HASH_GZIP_INCREMENTAL) +# define __LZO_HASH_INCREMENTAL +# define DVAL_FIRST(dv,p) dv = _DV_A((p),DL_SHIFT) +# define DVAL_NEXT(dv,p) dv = (((dv) << DL_SHIFT) ^ p[2]) +# define _DINDEX(dv,p) (dv) +# define DVAL_LOOKAHEAD DL_MIN_LEN + +#elif (LZO_HASH == LZO_HASH_LZO_INCREMENTAL_A) +# define __LZO_HASH_INCREMENTAL +# define DVAL_FIRST(dv,p) dv = _DV_A((p),5) +# define DVAL_NEXT(dv,p) \ + dv ^= (lzo_uint32)(p[-1]) << (2*5); dv = (((dv) << 5) ^ p[2]) +# define _DINDEX(dv,p) ((0x9f5f * (dv)) >> 5) +# define DVAL_LOOKAHEAD DL_MIN_LEN + +#elif (LZO_HASH == LZO_HASH_LZO_INCREMENTAL_B) +# define __LZO_HASH_INCREMENTAL +# define DVAL_FIRST(dv,p) dv = _DV_B((p),5) +# define DVAL_NEXT(dv,p) \ + dv ^= p[-1]; dv = (((dv) >> 5) ^ ((lzo_uint32)(p[2]) << (2*5))) +# define _DINDEX(dv,p) ((0x9f5f * (dv)) >> 5) +# define DVAL_LOOKAHEAD DL_MIN_LEN + +#else +# error "choose a hashing strategy" +#endif + +#ifndef DINDEX +#define DINDEX(dv,p) ((lzo_uint)((_DINDEX(dv,p)) & DL_MASK) << DD_BITS) +#endif +#if !defined(DINDEX1) && defined(D_INDEX1) +#define DINDEX1 D_INDEX1 +#endif +#if !defined(DINDEX2) && defined(D_INDEX2) +#define DINDEX2 D_INDEX2 +#endif + +#if !defined(__LZO_HASH_INCREMENTAL) +# define DVAL_FIRST(dv,p) ((void) 0) +# define DVAL_NEXT(dv,p) ((void) 0) +# define DVAL_LOOKAHEAD 0 +#endif + +#if !defined(DVAL_ASSERT) +#if defined(__LZO_HASH_INCREMENTAL) && !defined(NDEBUG) +static void DVAL_ASSERT(lzo_uint32 dv, const lzo_byte *p) +{ + lzo_uint32 df; + DVAL_FIRST(df,(p)); + assert(DINDEX(dv,p) == DINDEX(df,p)); +} +#else +# define DVAL_ASSERT(dv,p) ((void) 0) +#endif +#endif + +#if defined(LZO_DICT_USE_PTR) +# define DENTRY(p,in) (p) +# define GINDEX(m_pos,m_off,dict,dindex,in) m_pos = dict[dindex] +#else +# define DENTRY(p,in) ((lzo_uint) ((p)-(in))) +# define GINDEX(m_pos,m_off,dict,dindex,in) m_off = dict[dindex] +#endif + +#if (DD_BITS == 0) + +# define UPDATE_D(dict,drun,dv,p,in) dict[ DINDEX(dv,p) ] = DENTRY(p,in) +# define UPDATE_I(dict,drun,index,p,in) dict[index] = DENTRY(p,in) +# define UPDATE_P(ptr,drun,p,in) (ptr)[0] = DENTRY(p,in) + +#else + +# define UPDATE_D(dict,drun,dv,p,in) \ + dict[ DINDEX(dv,p) + drun++ ] = DENTRY(p,in); drun &= DD_MASK +# define UPDATE_I(dict,drun,index,p,in) \ + dict[ (index) + drun++ ] = DENTRY(p,in); drun &= DD_MASK +# define UPDATE_P(ptr,drun,p,in) \ + (ptr) [ drun++ ] = DENTRY(p,in); drun &= DD_MASK + +#endif + +#if defined(LZO_DICT_USE_PTR) + +#define LZO_CHECK_MPOS_DET(m_pos,m_off,in,ip,max_offset) \ + (m_pos == NULL || (m_off = (lzo_moff_t) (ip - m_pos)) > max_offset) + +#define LZO_CHECK_MPOS_NON_DET(m_pos,m_off,in,ip,max_offset) \ + (BOUNDS_CHECKING_OFF_IN_EXPR( \ + (PTR_LT(m_pos,in) || \ + (m_off = (lzo_moff_t) PTR_DIFF(ip,m_pos)) <= 0 || \ + m_off > max_offset) )) + +#else + +#define LZO_CHECK_MPOS_DET(m_pos,m_off,in,ip,max_offset) \ + (m_off == 0 || \ + ((m_off = (lzo_moff_t) ((ip)-(in)) - m_off) > max_offset) || \ + (m_pos = (ip) - (m_off), 0) ) + +#define LZO_CHECK_MPOS_NON_DET(m_pos,m_off,in,ip,max_offset) \ + ((lzo_moff_t) ((ip)-(in)) <= m_off || \ + ((m_off = (lzo_moff_t) ((ip)-(in)) - m_off) > max_offset) || \ + (m_pos = (ip) - (m_off), 0) ) + +#endif + +#if defined(LZO_DETERMINISTIC) +# define LZO_CHECK_MPOS LZO_CHECK_MPOS_DET +#else +# define LZO_CHECK_MPOS LZO_CHECK_MPOS_NON_DET +#endif + +#ifdef __cplusplus +} +#endif + +#endif + +#endif + +#endif + +#define DO_COMPRESS lzo1x_1_compress + +static +lzo_uint do_compress ( const lzo_byte *in , lzo_uint in_len, + lzo_byte *out, lzo_uintp out_len, + lzo_voidp wrkmem ) +{ +#if 0 && defined(__GNUC__) && defined(__i386__) + register const lzo_byte *ip __asm__("%esi"); +#else + register const lzo_byte *ip; +#endif + lzo_byte *op; + const lzo_byte * const in_end = in + in_len; + const lzo_byte * const ip_end = in + in_len - M2_MAX_LEN - 5; + const lzo_byte *ii; + lzo_dict_p const dict = (lzo_dict_p) wrkmem; + + op = out; + ip = in; + ii = ip; + + ip += 4; + for (;;) + { +#if 0 && defined(__GNUC__) && defined(__i386__) + register const lzo_byte *m_pos __asm__("%edi"); +#else + register const lzo_byte *m_pos; +#endif + lzo_moff_t m_off; + lzo_uint m_len; + lzo_uint dindex; + + DINDEX1(dindex,ip); + GINDEX(m_pos,m_off,dict,dindex,in); + if (LZO_CHECK_MPOS_NON_DET(m_pos,m_off,in,ip,M4_MAX_OFFSET)) + goto literal; +#if 1 + if (m_off <= M2_MAX_OFFSET || m_pos[3] == ip[3]) + goto try_match; + DINDEX2(dindex,ip); +#endif + GINDEX(m_pos,m_off,dict,dindex,in); + if (LZO_CHECK_MPOS_NON_DET(m_pos,m_off,in,ip,M4_MAX_OFFSET)) + goto literal; + if (m_off <= M2_MAX_OFFSET || m_pos[3] == ip[3]) + goto try_match; + goto literal; + +try_match: +#if 1 && defined(LZO_UNALIGNED_OK_2) + if (* (const lzo_ushortp) m_pos != * (const lzo_ushortp) ip) +#else + if (m_pos[0] != ip[0] || m_pos[1] != ip[1]) +#endif + { + } + else + { + if (m_pos[2] == ip[2]) + { +#if 0 + if (m_off <= M2_MAX_OFFSET) + goto match; + if (lit <= 3) + goto match; + if (lit == 3) + { + assert(op - 2 > out); op[-2] |= LZO_BYTE(3); + *op++ = *ii++; *op++ = *ii++; *op++ = *ii++; + goto code_match; + } + if (m_pos[3] == ip[3]) +#endif + goto match; + } + else + { +#if 0 +#if 0 + if (m_off <= M1_MAX_OFFSET && lit > 0 && lit <= 3) +#else + if (m_off <= M1_MAX_OFFSET && lit == 3) +#endif + { + register lzo_uint t; + + t = lit; + assert(op - 2 > out); op[-2] |= LZO_BYTE(t); + do *op++ = *ii++; while (--t > 0); + assert(ii == ip); + m_off -= 1; + *op++ = LZO_BYTE(M1_MARKER | ((m_off & 3) << 2)); + *op++ = LZO_BYTE(m_off >> 2); + ip += 2; + goto match_done; + } +#endif + } + } + +literal: + UPDATE_I(dict,0,dindex,ip,in); + ++ip; + if (ip >= ip_end) + break; + continue; + +match: + UPDATE_I(dict,0,dindex,ip,in); + if (pd(ip,ii) > 0) + { + register lzo_uint t = pd(ip,ii); + + if (t <= 3) + { + assert("lzo-04", op - 2 > out); + op[-2] |= LZO_BYTE(t); + } + else if (t <= 18) + *op++ = LZO_BYTE(t - 3); + else + { + register lzo_uint tt = t - 18; + + *op++ = 0; + while (tt > 255) + { + tt -= 255; + *op++ = 0; + } + assert("lzo-05", tt > 0); + *op++ = LZO_BYTE(tt); + } + do *op++ = *ii++; while (--t > 0); + } + + assert("lzo-06", ii == ip); + ip += 3; + if (m_pos[3] != *ip++ || m_pos[4] != *ip++ || m_pos[5] != *ip++ || + m_pos[6] != *ip++ || m_pos[7] != *ip++ || m_pos[8] != *ip++ +#ifdef LZO1Y + || m_pos[ 9] != *ip++ || m_pos[10] != *ip++ || m_pos[11] != *ip++ + || m_pos[12] != *ip++ || m_pos[13] != *ip++ || m_pos[14] != *ip++ +#endif + ) + { + --ip; + m_len = ip - ii; + assert("lzo-07", m_len >= 3); assert("lzo-08", m_len <= M2_MAX_LEN); + + if (m_off <= M2_MAX_OFFSET) + { + m_off -= 1; +#if defined(LZO1X) + *op++ = LZO_BYTE(((m_len - 1) << 5) | ((m_off & 7) << 2)); + *op++ = LZO_BYTE(m_off >> 3); +#elif defined(LZO1Y) + *op++ = LZO_BYTE(((m_len + 1) << 4) | ((m_off & 3) << 2)); + *op++ = LZO_BYTE(m_off >> 2); +#endif + } + else if (m_off <= M3_MAX_OFFSET) + { + m_off -= 1; + *op++ = LZO_BYTE(M3_MARKER | (m_len - 2)); + goto m3_m4_offset; + } + else +#if defined(LZO1X) + { + m_off -= 0x4000; + assert("lzo-09", m_off > 0); assert("lzo-10", m_off <= 0x7fff); + *op++ = LZO_BYTE(M4_MARKER | + ((m_off & 0x4000) >> 11) | (m_len - 2)); + goto m3_m4_offset; + } +#elif defined(LZO1Y) + goto m4_match; +#endif + } + else + { + { + const lzo_byte *end = in_end; + const lzo_byte *m = m_pos + M2_MAX_LEN + 1; + while (ip < end && *m == *ip) + m++, ip++; + m_len = (ip - ii); + } + assert("lzo-11", m_len > M2_MAX_LEN); + + if (m_off <= M3_MAX_OFFSET) + { + m_off -= 1; + if (m_len <= 33) + *op++ = LZO_BYTE(M3_MARKER | (m_len - 2)); + else + { + m_len -= 33; + *op++ = M3_MARKER | 0; + goto m3_m4_len; + } + } + else + { +#if defined(LZO1Y) +m4_match: +#endif + m_off -= 0x4000; + assert("lzo-12", m_off > 0); assert("lzo-13", m_off <= 0x7fff); + if (m_len <= M4_MAX_LEN) + *op++ = LZO_BYTE(M4_MARKER | + ((m_off & 0x4000) >> 11) | (m_len - 2)); + else + { + m_len -= M4_MAX_LEN; + *op++ = LZO_BYTE(M4_MARKER | ((m_off & 0x4000) >> 11)); +m3_m4_len: + while (m_len > 255) + { + m_len -= 255; + *op++ = 0; + } + assert("lzo-14", m_len > 0); + *op++ = LZO_BYTE(m_len); + } + } + +m3_m4_offset: + *op++ = LZO_BYTE((m_off & 63) << 2); + *op++ = LZO_BYTE(m_off >> 6); + } + +#if 0 +match_done: +#endif + ii = ip; + if (ip >= ip_end) + break; + } + + *out_len = op - out; + return pd(in_end,ii); +} + +LZO_PUBLIC(int) +DO_COMPRESS ( const lzo_byte *in , lzo_uint in_len, + lzo_byte *out, lzo_uintp out_len, + lzo_voidp wrkmem ) +{ + lzo_byte *op = out; + lzo_uint t; + +#if defined(__LZO_QUERY_COMPRESS) + if (__LZO_IS_COMPRESS_QUERY(in,in_len,out,out_len,wrkmem)) + return __LZO_QUERY_COMPRESS(in,in_len,out,out_len,wrkmem,D_SIZE,lzo_sizeof(lzo_dict_t)); +#endif + + if (in_len <= M2_MAX_LEN + 5) + t = in_len; + else + { + t = do_compress(in,in_len,op,out_len,wrkmem); + op += *out_len; + } + + if (t > 0) + { + const lzo_byte *ii = in + in_len - t; + + if (op == out && t <= 238) + *op++ = LZO_BYTE(17 + t); + else if (t <= 3) + op[-2] |= LZO_BYTE(t); + else if (t <= 18) + *op++ = LZO_BYTE(t - 3); + else + { + lzo_uint tt = t - 18; + + *op++ = 0; + while (tt > 255) + { + tt -= 255; + *op++ = 0; + } + assert("lzo-15", tt > 0); + *op++ = LZO_BYTE(tt); + } + do *op++ = *ii++; while (--t > 0); + } + + *op++ = M4_MARKER | 1; + *op++ = 0; + *op++ = 0; + + *out_len = op - out; + return LZO_E_OK; +} + +#undef do_compress +#undef DO_COMPRESS +#undef LZO_HASH + +#undef LZO_TEST_DECOMPRESS_OVERRUN +#undef LZO_TEST_DECOMPRESS_OVERRUN_INPUT +#undef LZO_TEST_DECOMPRESS_OVERRUN_OUTPUT +#undef LZO_TEST_DECOMPRESS_OVERRUN_LOOKBEHIND +#undef DO_DECOMPRESS +#define DO_DECOMPRESS lzo1x_decompress + +#if defined(LZO_TEST_DECOMPRESS_OVERRUN) +# if !defined(LZO_TEST_DECOMPRESS_OVERRUN_INPUT) +# define LZO_TEST_DECOMPRESS_OVERRUN_INPUT 2 +# endif +# if !defined(LZO_TEST_DECOMPRESS_OVERRUN_OUTPUT) +# define LZO_TEST_DECOMPRESS_OVERRUN_OUTPUT 2 +# endif +# if !defined(LZO_TEST_DECOMPRESS_OVERRUN_LOOKBEHIND) +# define LZO_TEST_DECOMPRESS_OVERRUN_LOOKBEHIND +# endif +#endif + +#undef TEST_IP +#undef TEST_OP +#undef TEST_LOOKBEHIND +#undef NEED_IP +#undef NEED_OP +#undef HAVE_TEST_IP +#undef HAVE_TEST_OP +#undef HAVE_NEED_IP +#undef HAVE_NEED_OP +#undef HAVE_ANY_IP +#undef HAVE_ANY_OP + +#if defined(LZO_TEST_DECOMPRESS_OVERRUN_INPUT) +# if (LZO_TEST_DECOMPRESS_OVERRUN_INPUT >= 1) +# define TEST_IP (ip < ip_end) +# endif +# if (LZO_TEST_DECOMPRESS_OVERRUN_INPUT >= 2) +# define NEED_IP(x) \ + if ((lzo_uint)(ip_end - ip) < (lzo_uint)(x)) goto input_overrun +# endif +#endif + +#if defined(LZO_TEST_DECOMPRESS_OVERRUN_OUTPUT) +# if (LZO_TEST_DECOMPRESS_OVERRUN_OUTPUT >= 1) +# define TEST_OP (op <= op_end) +# endif +# if (LZO_TEST_DECOMPRESS_OVERRUN_OUTPUT >= 2) +# undef TEST_OP +# define NEED_OP(x) \ + if ((lzo_uint)(op_end - op) < (lzo_uint)(x)) goto output_overrun +# endif +#endif + +#if defined(LZO_TEST_DECOMPRESS_OVERRUN_LOOKBEHIND) +# define TEST_LOOKBEHIND(m_pos,out) if (m_pos < out) goto lookbehind_overrun +#else +# define TEST_LOOKBEHIND(m_pos,op) ((void) 0) +#endif + +#if !defined(LZO_EOF_CODE) && !defined(TEST_IP) +# define TEST_IP (ip < ip_end) +#endif + +#if defined(TEST_IP) +# define HAVE_TEST_IP +#else +# define TEST_IP 1 +#endif +#if defined(TEST_OP) +# define HAVE_TEST_OP +#else +# define TEST_OP 1 +#endif + +#if defined(NEED_IP) +# define HAVE_NEED_IP +#else +# define NEED_IP(x) ((void) 0) +#endif +#if defined(NEED_OP) +# define HAVE_NEED_OP +#else +# define NEED_OP(x) ((void) 0) +#endif + +#if defined(HAVE_TEST_IP) || defined(HAVE_NEED_IP) +# define HAVE_ANY_IP +#endif +#if defined(HAVE_TEST_OP) || defined(HAVE_NEED_OP) +# define HAVE_ANY_OP +#endif + +#undef __COPY4 +#define __COPY4(dst,src) * (lzo_uint32p)(dst) = * (const lzo_uint32p)(src) + +#undef COPY4 +#if defined(LZO_UNALIGNED_OK_4) +# define COPY4(dst,src) __COPY4(dst,src) +#elif defined(LZO_ALIGNED_OK_4) +# define COPY4(dst,src) __COPY4((lzo_ptr_t)(dst),(lzo_ptr_t)(src)) +#endif + +#if defined(DO_DECOMPRESS) +LZO_PUBLIC(int) +DO_DECOMPRESS ( const lzo_byte *in , lzo_uint in_len, + lzo_byte *out, lzo_uintp out_len, + lzo_voidp wrkmem ) +#endif +{ + register lzo_byte *op; + register const lzo_byte *ip; + register lzo_uint t; +#if defined(COPY_DICT) + lzo_uint m_off; + const lzo_byte *dict_end; +#else + register const lzo_byte *m_pos; +#endif + + const lzo_byte * const ip_end = in + in_len; +#if defined(HAVE_ANY_OP) + lzo_byte * const op_end = out + *out_len; +#endif +#if defined(LZO1Z) + lzo_uint last_m_off = 0; +#endif + + LZO_UNUSED(wrkmem); + +#if defined(__LZO_QUERY_DECOMPRESS) + if (__LZO_IS_DECOMPRESS_QUERY(in,in_len,out,out_len,wrkmem)) + return __LZO_QUERY_DECOMPRESS(in,in_len,out,out_len,wrkmem,0,0); +#endif + +#if defined(COPY_DICT) + if (dict) + { + if (dict_len > M4_MAX_OFFSET) + { + dict += dict_len - M4_MAX_OFFSET; + dict_len = M4_MAX_OFFSET; + } + dict_end = dict + dict_len; + } + else + { + dict_len = 0; + dict_end = NULL; + } +#endif + + *out_len = 0; + + op = out; + ip = in; + + if (*ip > 17) + { + t = *ip++ - 17; + if (t < 4) + goto match_next; + assert("lzo-16", t > 0); NEED_OP(t); NEED_IP(t+1); + do *op++ = *ip++; while (--t > 0); + goto first_literal_run; + } + + while (TEST_IP && TEST_OP) + { + t = *ip++; + if (t >= 16) + goto match; + if (t == 0) + { + NEED_IP(1); + while (*ip == 0) + { + t += 255; + ip++; + NEED_IP(1); + } + t += 15 + *ip++; + } + assert("lzo-17", t > 0); NEED_OP(t+3); NEED_IP(t+4); +#if defined(LZO_UNALIGNED_OK_4) || defined(LZO_ALIGNED_OK_4) +#if !defined(LZO_UNALIGNED_OK_4) + if (PTR_ALIGNED2_4(op,ip)) + { +#endif + COPY4(op,ip); + op += 4; ip += 4; + if (--t > 0) + { + if (t >= 4) + { + do { + COPY4(op,ip); + op += 4; ip += 4; t -= 4; + } while (t >= 4); + if (t > 0) do *op++ = *ip++; while (--t > 0); + } + else + do *op++ = *ip++; while (--t > 0); + } +#if !defined(LZO_UNALIGNED_OK_4) + } + else +#endif +#endif +#if !defined(LZO_UNALIGNED_OK_4) + { + *op++ = *ip++; *op++ = *ip++; *op++ = *ip++; + do *op++ = *ip++; while (--t > 0); + } +#endif + +first_literal_run: + + t = *ip++; + if (t >= 16) + goto match; +#if defined(COPY_DICT) +#if defined(LZO1Z) + m_off = (1 + M2_MAX_OFFSET) + (t << 6) + (*ip++ >> 2); + last_m_off = m_off; +#else + m_off = (1 + M2_MAX_OFFSET) + (t >> 2) + (*ip++ << 2); +#endif + NEED_OP(3); + t = 3; COPY_DICT(t,m_off) +#else +#if defined(LZO1Z) + t = (1 + M2_MAX_OFFSET) + (t << 6) + (*ip++ >> 2); + m_pos = op - t; + last_m_off = t; +#else + m_pos = op - (1 + M2_MAX_OFFSET); + m_pos -= t >> 2; + m_pos -= *ip++ << 2; +#endif + TEST_LOOKBEHIND(m_pos,out); NEED_OP(3); + *op++ = *m_pos++; *op++ = *m_pos++; *op++ = *m_pos; +#endif + goto match_done; + + while (TEST_IP && TEST_OP) + { +match: + if (t >= 64) + { +#if defined(COPY_DICT) +#if defined(LZO1X) + m_off = 1 + ((t >> 2) & 7) + (*ip++ << 3); + t = (t >> 5) - 1; +#elif defined(LZO1Y) + m_off = 1 + ((t >> 2) & 3) + (*ip++ << 2); + t = (t >> 4) - 3; +#elif defined(LZO1Z) + m_off = t & 0x1f; + if (m_off >= 0x1c) + m_off = last_m_off; + else + { + m_off = 1 + (m_off << 6) + (*ip++ >> 2); + last_m_off = m_off; + } + t = (t >> 5) - 1; +#endif +#else +#if defined(LZO1X) + m_pos = op - 1; + m_pos -= (t >> 2) & 7; + m_pos -= *ip++ << 3; + t = (t >> 5) - 1; +#elif defined(LZO1Y) + m_pos = op - 1; + m_pos -= (t >> 2) & 3; + m_pos -= *ip++ << 2; + t = (t >> 4) - 3; +#elif defined(LZO1Z) + { + lzo_uint off = t & 0x1f; + m_pos = op; + if (off >= 0x1c) + { + assert(last_m_off > 0); + m_pos -= last_m_off; + } + else + { + off = 1 + (off << 6) + (*ip++ >> 2); + m_pos -= off; + last_m_off = off; + } + } + t = (t >> 5) - 1; +#endif + TEST_LOOKBEHIND(m_pos,out); assert("lzo-18", t > 0); NEED_OP(t+3-1); + goto copy_match; +#endif + } + else if (t >= 32) + { + t &= 31; + if (t == 0) + { + NEED_IP(1); + while (*ip == 0) + { + t += 255; + ip++; + NEED_IP(1); + } + t += 31 + *ip++; + } +#if defined(COPY_DICT) +#if defined(LZO1Z) + m_off = 1 + (ip[0] << 6) + (ip[1] >> 2); + last_m_off = m_off; +#else + m_off = 1 + (ip[0] >> 2) + (ip[1] << 6); +#endif +#else +#if defined(LZO1Z) + { + lzo_uint off = 1 + (ip[0] << 6) + (ip[1] >> 2); + m_pos = op - off; + last_m_off = off; + } +#elif defined(LZO_UNALIGNED_OK_2) && (LZO_BYTE_ORDER == LZO_LITTLE_ENDIAN) + m_pos = op - 1; + m_pos -= (* (const lzo_ushortp) ip) >> 2; +#else + m_pos = op - 1; + m_pos -= (ip[0] >> 2) + (ip[1] << 6); +#endif +#endif + ip += 2; + } + else if (t >= 16) + { +#if defined(COPY_DICT) + m_off = (t & 8) << 11; +#else + m_pos = op; + m_pos -= (t & 8) << 11; +#endif + t &= 7; + if (t == 0) + { + NEED_IP(1); + while (*ip == 0) + { + t += 255; + ip++; + NEED_IP(1); + } + t += 7 + *ip++; + } +#if defined(COPY_DICT) +#if defined(LZO1Z) + m_off += (ip[0] << 6) + (ip[1] >> 2); +#else + m_off += (ip[0] >> 2) + (ip[1] << 6); +#endif + ip += 2; + if (m_off == 0) + goto eof_found; + m_off += 0x4000; +#if defined(LZO1Z) + last_m_off = m_off; +#endif +#else +#if defined(LZO1Z) + m_pos -= (ip[0] << 6) + (ip[1] >> 2); +#elif defined(LZO_UNALIGNED_OK_2) && (LZO_BYTE_ORDER == LZO_LITTLE_ENDIAN) + m_pos -= (* (const lzo_ushortp) ip) >> 2; +#else + m_pos -= (ip[0] >> 2) + (ip[1] << 6); +#endif + ip += 2; + if (m_pos == op) + goto eof_found; + m_pos -= 0x4000; +#if defined(LZO1Z) + last_m_off = op - m_pos; +#endif +#endif + } + else + { +#if defined(COPY_DICT) +#if defined(LZO1Z) + m_off = 1 + (t << 6) + (*ip++ >> 2); + last_m_off = m_off; +#else + m_off = 1 + (t >> 2) + (*ip++ << 2); +#endif + NEED_OP(2); + t = 2; COPY_DICT(t,m_off) +#else +#if defined(LZO1Z) + t = 1 + (t << 6) + (*ip++ >> 2); + m_pos = op - t; + last_m_off = t; +#else + m_pos = op - 1; + m_pos -= t >> 2; + m_pos -= *ip++ << 2; +#endif + TEST_LOOKBEHIND(m_pos,out); NEED_OP(2); + *op++ = *m_pos++; *op++ = *m_pos; +#endif + goto match_done; + } + +#if defined(COPY_DICT) + + NEED_OP(t+3-1); + t += 3-1; COPY_DICT(t,m_off) + +#else + + TEST_LOOKBEHIND(m_pos,out); assert("lzo-19", t > 0); NEED_OP(t+3-1); +#if defined(LZO_UNALIGNED_OK_4) || defined(LZO_ALIGNED_OK_4) +#if !defined(LZO_UNALIGNED_OK_4) + if (t >= 2 * 4 - (3 - 1) && PTR_ALIGNED2_4(op,m_pos)) + { + assert((op - m_pos) >= 4); +#else + if (t >= 2 * 4 - (3 - 1) && (op - m_pos) >= 4) + { +#endif + COPY4(op,m_pos); + op += 4; m_pos += 4; t -= 4 - (3 - 1); + do { + COPY4(op,m_pos); + op += 4; m_pos += 4; t -= 4; + } while (t >= 4); + if (t > 0) do *op++ = *m_pos++; while (--t > 0); + } + else +#endif + { +copy_match: + *op++ = *m_pos++; *op++ = *m_pos++; + do *op++ = *m_pos++; while (--t > 0); + } + +#endif + +match_done: +#if defined(LZO1Z) + t = ip[-1] & 3; +#else + t = ip[-2] & 3; +#endif + if (t == 0) + break; + +match_next: + assert("lzo-20", t > 0); NEED_OP(t); NEED_IP(t+1); + do *op++ = *ip++; while (--t > 0); + t = *ip++; + } + } + +#if defined(HAVE_TEST_IP) || defined(HAVE_TEST_OP) + *out_len = op - out; + return LZO_E_EOF_NOT_FOUND; +#endif + +eof_found: + assert("lzo-21", t == 1); + *out_len = op - out; + return (ip == ip_end ? LZO_E_OK : + (ip < ip_end ? LZO_E_INPUT_NOT_CONSUMED : LZO_E_INPUT_OVERRUN)); + +#if defined(HAVE_NEED_IP) +input_overrun: + *out_len = op - out; + return LZO_E_INPUT_OVERRUN; +#endif + +#if defined(HAVE_NEED_OP) +output_overrun: + *out_len = op - out; + return LZO_E_OUTPUT_OVERRUN; +#endif + +#if defined(LZO_TEST_DECOMPRESS_OVERRUN_LOOKBEHIND) +lookbehind_overrun: + *out_len = op - out; + return LZO_E_LOOKBEHIND_OVERRUN; +#endif +} + +#define LZO_TEST_DECOMPRESS_OVERRUN +#undef DO_DECOMPRESS +#define DO_DECOMPRESS lzo1x_decompress_safe + +#if defined(LZO_TEST_DECOMPRESS_OVERRUN) +# if !defined(LZO_TEST_DECOMPRESS_OVERRUN_INPUT) +# define LZO_TEST_DECOMPRESS_OVERRUN_INPUT 2 +# endif +# if !defined(LZO_TEST_DECOMPRESS_OVERRUN_OUTPUT) +# define LZO_TEST_DECOMPRESS_OVERRUN_OUTPUT 2 +# endif +# if !defined(LZO_TEST_DECOMPRESS_OVERRUN_LOOKBEHIND) +# define LZO_TEST_DECOMPRESS_OVERRUN_LOOKBEHIND +# endif +#endif + +#undef TEST_IP +#undef TEST_OP +#undef TEST_LOOKBEHIND +#undef NEED_IP +#undef NEED_OP +#undef HAVE_TEST_IP +#undef HAVE_TEST_OP +#undef HAVE_NEED_IP +#undef HAVE_NEED_OP +#undef HAVE_ANY_IP +#undef HAVE_ANY_OP + +#if defined(LZO_TEST_DECOMPRESS_OVERRUN_INPUT) +# if (LZO_TEST_DECOMPRESS_OVERRUN_INPUT >= 1) +# define TEST_IP (ip < ip_end) +# endif +# if (LZO_TEST_DECOMPRESS_OVERRUN_INPUT >= 2) +# define NEED_IP(x) \ + if ((lzo_uint)(ip_end - ip) < (lzo_uint)(x)) goto input_overrun +# endif +#endif + +#if defined(LZO_TEST_DECOMPRESS_OVERRUN_OUTPUT) +# if (LZO_TEST_DECOMPRESS_OVERRUN_OUTPUT >= 1) +# define TEST_OP (op <= op_end) +# endif +# if (LZO_TEST_DECOMPRESS_OVERRUN_OUTPUT >= 2) +# undef TEST_OP +# define NEED_OP(x) \ + if ((lzo_uint)(op_end - op) < (lzo_uint)(x)) goto output_overrun +# endif +#endif + +#if defined(LZO_TEST_DECOMPRESS_OVERRUN_LOOKBEHIND) +# define TEST_LOOKBEHIND(m_pos,out) if (m_pos < out) goto lookbehind_overrun +#else +# define TEST_LOOKBEHIND(m_pos,op) ((void) 0) +#endif + +#if !defined(LZO_EOF_CODE) && !defined(TEST_IP) +# define TEST_IP (ip < ip_end) +#endif + +#if defined(TEST_IP) +# define HAVE_TEST_IP +#else +# define TEST_IP 1 +#endif +#if defined(TEST_OP) +# define HAVE_TEST_OP +#else +# define TEST_OP 1 +#endif + +#if defined(NEED_IP) +# define HAVE_NEED_IP +#else +# define NEED_IP(x) ((void) 0) +#endif +#if defined(NEED_OP) +# define HAVE_NEED_OP +#else +# define NEED_OP(x) ((void) 0) +#endif + +#if defined(HAVE_TEST_IP) || defined(HAVE_NEED_IP) +# define HAVE_ANY_IP +#endif +#if defined(HAVE_TEST_OP) || defined(HAVE_NEED_OP) +# define HAVE_ANY_OP +#endif + +#undef __COPY4 +#define __COPY4(dst,src) * (lzo_uint32p)(dst) = * (const lzo_uint32p)(src) + +#undef COPY4 +#if defined(LZO_UNALIGNED_OK_4) +# define COPY4(dst,src) __COPY4(dst,src) +#elif defined(LZO_ALIGNED_OK_4) +# define COPY4(dst,src) __COPY4((lzo_ptr_t)(dst),(lzo_ptr_t)(src)) +#endif + +/***** End of minilzo.c *****/ + diff -puN /dev/null fs/reiser4/plugin/compress/minilzo.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/compress/minilzo.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,100 @@ +/* minilzo.h -- mini subset of the LZO real-time data compression library + + This file is part of the LZO real-time data compression library. + + Copyright (C) 2002 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 2001 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 2000 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 1999 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 1998 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 1997 Markus Franz Xaver Johannes Oberhumer + Copyright (C) 1996 Markus Franz Xaver Johannes Oberhumer + All Rights Reserved. + + The LZO library is free software; you can redistribute it and/or + modify it under the terms of the GNU General Public License as + published by the Free Software Foundation; either version 2 of + the License, or (at your option) any later version. + + The LZO library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with the LZO library; see the file COPYING. + If not, write to the Free Software Foundation, Inc., + 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + + Markus F.X.J. Oberhumer + + http://www.oberhumer.com/opensource/lzo/ + */ + +/* + * NOTE: + * the full LZO package can be found at + * http://www.oberhumer.com/opensource/lzo/ + */ + + +#ifndef __MINILZO_H +#define __MINILZO_H + +#define MINILZO_VERSION 0x1080 + +#ifdef __LZOCONF_H +# error "you cannot use both LZO and miniLZO" +#endif + +#undef LZO_HAVE_CONFIG_H +#include "lzoconf.h" + +#if !defined(LZO_VERSION) || (LZO_VERSION != MINILZO_VERSION) +# error "version mismatch in header files" +#endif + + +#ifdef __cplusplus +extern "C" { +#endif + + +/*********************************************************************** +// +************************************************************************/ + +/* Memory required for the wrkmem parameter. + * When the required size is 0, you can also pass a NULL pointer. + */ + +#define LZO1X_MEM_COMPRESS LZO1X_1_MEM_COMPRESS +#define LZO1X_1_MEM_COMPRESS ((lzo_uint32) (16384L * lzo_sizeof_dict_t)) +#define LZO1X_MEM_DECOMPRESS (0) + + +/* compression */ +LZO_EXTERN(int) +lzo1x_1_compress ( const lzo_byte *src, lzo_uint src_len, + lzo_byte *dst, lzo_uintp dst_len, + lzo_voidp wrkmem ); + +/* decompression */ +LZO_EXTERN(int) +lzo1x_decompress ( const lzo_byte *src, lzo_uint src_len, + lzo_byte *dst, lzo_uintp dst_len, + lzo_voidp wrkmem /* NOT USED */ ); + +/* safe decompression with overrun testing */ +LZO_EXTERN(int) +lzo1x_decompress_safe ( const lzo_byte *src, lzo_uint src_len, + lzo_byte *dst, lzo_uintp dst_len, + lzo_voidp wrkmem /* NOT USED */ ); + + +#ifdef __cplusplus +} /* extern "C" */ +#endif + +#endif /* already included */ + diff -puN /dev/null fs/reiser4/plugin/cryptcompress.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/cryptcompress.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,3459 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README + +This file contains all cluster operations and methods of the reiser4 +cryptcompress object plugin (see http://www.namesys.com/cryptcompress_design.html +for details). + +Cryptcompress specific fields of reiser4 inode/stat-data: + + Incore inode Disk stat-data +******************************************************************************************** +* data structure * field * data structure * field * +******************************************************************************************** +* plugin_set *file plugin id * reiser4_plugin_stat *file plugin id * +* *crypto plugin id * *crypto plugin id * +* *digest plugin id * *digest plugin id * +* *compression plugin id * *compression plugin id* +******************************************************************************************** +* crypto_stat_t * keysize * reiser4_crypto_stat * keysize * +* * keyid * * keyid * +******************************************************************************************** +* cluster_stat_t * cluster_shift * reiser4_cluster_stat * cluster_shift * +******************************************************************************************** +* cryptcompress_info_t * crypto_tfm * * * +******************************************************************************************** +*/ +#include "../debug.h" +#include "../inode.h" +#include "../jnode.h" +#include "../tree.h" +#include "../page_cache.h" +#include "../readahead.h" +#include "../forward.h" +#include "../super.h" +#include "../context.h" +#include "../cluster.h" +#include "../seal.h" +#include "../vfs_ops.h" +#include "plugin.h" +#include "object.h" +#include "../tree_walk.h" +#include "file/funcs.h" + +#include +#include +#include +#include +#include +#include +#include + +int do_readpage_ctail(reiser4_cluster_t *, struct page * page); +int ctail_read_cluster (reiser4_cluster_t *, struct inode *, int); +reiser4_key * append_cluster_key_ctail(const coord_t *, reiser4_key *); +int setattr_reserve(reiser4_tree *); +int writepage_ctail(struct page *); +int update_file_size(struct inode * inode, reiser4_key * key, int update_sd); +int cut_file_items(struct inode *inode, loff_t new_size, int update_sd, loff_t cur_size, + int (*update_actor)(struct inode *, reiser4_key *, int)); +int delete_object(struct inode *inode, int mode); +int ctail_insert_unprepped_cluster(reiser4_cluster_t * clust, struct inode * inode); +int hint_is_set(const hint_t *hint); +reiser4_plugin * get_default_plugin(pset_member memb); +void inode_check_scale_nolock(struct inode * inode, __u64 old, __u64 new); + +/* get cryptcompress specific portion of inode */ +reiser4_internal cryptcompress_info_t * +cryptcompress_inode_data(const struct inode * inode) +{ + return &reiser4_inode_data(inode)->file_plugin_data.cryptcompress_info; +} + +/* plugin->u.file.init_inode_data */ +reiser4_internal void +init_inode_data_cryptcompress(struct inode *inode, + reiser4_object_create_data *crd, int create) +{ + cryptcompress_info_t * data; + + data = cryptcompress_inode_data(inode); + assert("edward-685", data != NULL); + + memset(data, 0, sizeof (*data)); + + init_rwsem(&data->lock); + init_inode_ordering(inode, crd, create); +} + +#if REISER4_DEBUG +static int +crc_generic_check_ok(void) +{ + return MIN_CRYPTO_BLOCKSIZE == DC_CHECKSUM_SIZE << 1; +} + +reiser4_internal int +crc_inode_ok(struct inode * inode) +{ + reiser4_inode * info = reiser4_inode_data(inode); + cryptcompress_info_t * data = cryptcompress_inode_data(inode); + + if ((info->cluster_shift <= MAX_CLUSTER_SHIFT) && + (data->tfm[CRYPTO_TFM] == NULL) && + (data->tfm[DIGEST_TFM] == NULL)) + return 1; + assert("edward-686", 0); + return 0; +} +#endif + +static crypto_stat_t * inode_crypto_stat (struct inode * inode) +{ + assert("edward-90", inode != NULL); + assert("edward-91", reiser4_inode_data(inode) != NULL); + return (reiser4_inode_data(inode)->crypt); +} + +/* NOTE-EDWARD: Do not use crypto without digest */ +static int +alloc_crypto_tfm(struct inode * inode, crypto_data_t * data) +{ + int result; + crypto_plugin * cplug = crypto_plugin_by_id(data->cra); + digest_plugin * dplug = digest_plugin_by_id(data->dia); + + assert("edward-414", dplug != NULL); + assert("edward-415", cplug != NULL); + + result = dplug->alloc(inode); + if (result) + return result; + result = cplug->alloc(inode); + if (result) { + dplug->free(inode); + return result; + } + return 0; +} + +static void +free_crypto_tfm(struct inode * inode) +{ + reiser4_inode * info; + + assert("edward-410", inode != NULL); + + info = reiser4_inode_data(inode); + + if (!inode_get_crypto(inode)) + return; + + assert("edward-411", inode_crypto_plugin(inode)); + assert("edward-763", inode_digest_plugin(inode)); + + inode_crypto_plugin(inode)->free(inode); + inode_digest_plugin(inode)->free(inode); +} + +static int +attach_crypto_stat(struct inode * inode, crypto_data_t * data) +{ + __u8 * txt; + + crypto_stat_t * stat; + struct scatterlist sg; + struct crypto_tfm * dtfm; + + assert("edward-690", inode_get_crypto(inode)); + assert("edward-766", inode_get_digest(inode)); + + dtfm = inode_get_digest(inode); + + stat = reiser4_kmalloc(sizeof(*stat), GFP_KERNEL); + if (!stat) + return -ENOMEM; + + stat->keyid = reiser4_kmalloc((size_t)crypto_tfm_alg_digestsize(dtfm), GFP_KERNEL); + if (!stat->keyid) { + reiser4_kfree(stat); + return -ENOMEM; + } + txt = reiser4_kmalloc(data->keyid_size, GFP_KERNEL); + if (!txt) { + reiser4_kfree(stat->keyid); + reiser4_kfree(stat); + return -ENOMEM; + } + memcpy(txt, data->keyid, data->keyid_size); + sg.page = virt_to_page (txt); + sg.offset = offset_in_page (txt); + sg.length = data->keyid_size; + + crypto_digest_init (dtfm); + crypto_digest_update (dtfm, &sg, 1); + crypto_digest_final (dtfm, stat->keyid); + + reiser4_inode_data(inode)->crypt = stat; + reiser4_kfree(txt); + + return 0; +} + +static void +detach_crypto_stat(struct inode * object) +{ + crypto_stat_t * stat; + + stat = inode_crypto_stat(object); + + assert("edward-691", crc_inode_ok(object)); + + if (!inode_get_crypto(object)) + return; + + assert("edward-412", stat != NULL); + + reiser4_kfree(stat->keyid); + reiser4_kfree(stat); +} + +static void +init_default_crypto(crypto_data_t * data) +{ + assert("edward-692", data != NULL); + + memset(data, 0, sizeof(*data)); + + data->cra = get_default_plugin(PSET_CRYPTO)->h.id; + data->dia = get_default_plugin(PSET_DIGEST)->h.id; + return; +} + +static void +init_default_compression(compression_data_t * data) +{ + assert("edward-693", data != NULL); + + memset(data, 0, sizeof(*data)); + + data->coa = get_default_plugin(PSET_COMPRESSION)->h.id; +} + +static void +init_default_cluster(cluster_data_t * data) +{ + assert("edward-694", data != NULL); + + *data = DEFAULT_CLUSTER_SHIFT; +} + +/* 1) fill crypto specific part of inode + 2) set inode crypto stat which is supposed to be saved in stat-data */ +static int +inode_set_crypto(struct inode * object, crypto_data_t * data) +{ + int result; + crypto_data_t def; + struct crypto_tfm * tfm; + crypto_plugin * cplug; + digest_plugin * dplug; + reiser4_inode * info = reiser4_inode_data(object); + + if (!data) { + init_default_crypto(&def); + data = &def; + } + cplug = crypto_plugin_by_id(data->cra); + dplug = digest_plugin_by_id(data->dia); + + plugin_set_crypto(&info->pset, cplug); + plugin_set_digest(&info->pset, dplug); + + result = alloc_crypto_tfm(object, data); + if (!result) + return result; + + if (!inode_get_crypto(object)) + /* nothing to do anymore */ + return 0; + + assert("edward-416", data != NULL); + assert("edward-414", dplug != NULL); + assert("edward-415", cplug != NULL); + assert("edward-417", data->key!= NULL); + assert("edward-88", data->keyid != NULL); + assert("edward-83", data->keyid_size != 0); + assert("edward-89", data->keysize != 0); + + tfm = inode_get_tfm(object, CRYPTO_TFM); + assert("edward-695", tfm != NULL); + + result = cplug->setkey(tfm, data->key, data->keysize); + if (result) { + free_crypto_tfm(object); + return result; + } + assert ("edward-34", !inode_get_flag(object, REISER4_SECRET_KEY_INSTALLED)); + inode_set_flag(object, REISER4_SECRET_KEY_INSTALLED); + + info->extmask |= (1 << CRYPTO_STAT); + + result = attach_crypto_stat(object, data); + if (result) + goto error; + + info->plugin_mask |= (1 << PSET_CRYPTO) | (1 << PSET_DIGEST); + + return 0; + error: + free_crypto_tfm(object); + inode_clr_flag(object, REISER4_SECRET_KEY_INSTALLED); + return result; +} + +static void +inode_set_compression(struct inode * object, compression_data_t * data) +{ + compression_data_t def; + reiser4_inode * info = reiser4_inode_data(object); + + if (!data) { + init_default_compression(&def); + data = &def; + } + plugin_set_compression(&info->pset, compression_plugin_by_id(data->coa)); + info->plugin_mask |= (1 << PSET_COMPRESSION); + + return; +} + +static int +inode_set_cluster(struct inode * object, cluster_data_t * data) +{ + int result = 0; + cluster_data_t def; + reiser4_inode * info; + + assert("edward-696", object != NULL); + + info = reiser4_inode_data(object); + + if(!data) { + /* NOTE-EDWARD: + this is a necessary parameter for cryptcompress object */ + warning("edward-418", "create_cryptcompress: default cluster size" + " (%u) was assigned for the object %llu\n", + (1U << PAGE_CACHE_SHIFT << DEFAULT_CLUSTER_SHIFT), + (unsigned long long)get_inode_oid(object)); + init_default_cluster(&def); + data = &def; + } + assert("edward-697", *data <= MAX_CLUSTER_SHIFT); + + info->cluster_shift = *data; + info->extmask |= (1 << CLUSTER_STAT); + return result; +} + +/* plugin->create() method for crypto-compressed files + +. install plugins +. attach crypto info if specified +. attach compression info if specified +. attach cluster info +*/ +reiser4_internal int +create_cryptcompress(struct inode *object, struct inode *parent, reiser4_object_create_data * data) +{ + int result; + reiser4_inode * info; + + assert("edward-23", object != NULL); + assert("edward-24", parent != NULL); + assert("edward-30", data != NULL); + assert("edward-26", inode_get_flag(object, REISER4_NO_SD)); + assert("edward-27", data->id == CRC_FILE_PLUGIN_ID); + assert("edward-1170", crc_generic_check_ok()); + + info = reiser4_inode_data(object); + + assert("edward-29", info != NULL); + + /* set file bit */ + info->plugin_mask |= (1 << PSET_FILE); + + /* set crypto */ + result = inode_set_crypto(object, data->crypto); + if (result) + goto error; + + /* set compression */ + inode_set_compression(object, data->compression); + + /* set cluster info */ + result = inode_set_cluster(object, data->cluster); + if (result) + goto error; + /* set plugin mask */ + info->extmask |= (1 << PLUGIN_STAT); + + /* save everything in disk stat-data */ + result = write_sd_by_inode_common(object); + if (!result) + return 0; + /* save() method failed, release attached crypto info */ + inode_clr_flag(object, REISER4_CRYPTO_STAT_LOADED); + inode_clr_flag(object, REISER4_CLUSTER_KNOWN); + error: + free_crypto_tfm(object); + detach_crypto_stat(object); + inode_clr_flag(object, REISER4_SECRET_KEY_INSTALLED); + return result; +} + +reiser4_internal int open_cryptcompress(struct inode * inode, struct file * file) +{ + /* FIXME-EDWARD: should be powered by key management */ + assert("edward-698", inode_file_plugin(inode) == file_plugin_by_id(CRC_FILE_PLUGIN_ID)); + return 0; +} + +/* plugin->destroy_inode() */ +reiser4_internal void +destroy_inode_cryptcompress(struct inode * inode) +{ + assert("edward-802", inode_file_plugin(inode) == file_plugin_by_id(CRC_FILE_PLUGIN_ID)); + assert("edward-803", !is_bad_inode(inode) && is_inode_loaded(inode)); + assert("edward-804", inode_get_flag(inode, REISER4_CLUSTER_KNOWN)); + + free_crypto_tfm(inode); + if (inode_get_flag(inode, REISER4_CRYPTO_STAT_LOADED)) + detach_crypto_stat(inode); + inode_clr_flag(inode, REISER4_CLUSTER_KNOWN); + inode_clr_flag(inode, REISER4_CRYPTO_STAT_LOADED); + inode_clr_flag(inode, REISER4_SECRET_KEY_INSTALLED); +} + +/* returns translated offset */ +static loff_t inode_scaled_offset (struct inode * inode, + const loff_t src_off /* input offset */) +{ + assert("edward-97", inode != NULL); + + if (!inode_get_crypto(inode) || src_off == get_key_offset(max_key())) + return src_off; + + return inode_crypto_plugin(inode)->scale(inode, crypto_blocksize(inode), src_off); +} + +/* returns disk cluster size */ +reiser4_internal size_t +inode_scaled_cluster_size (struct inode * inode) +{ + assert("edward-110", inode != NULL); + assert("edward-111", inode_get_flag(inode, REISER4_CLUSTER_KNOWN)); + + return inode_scaled_offset(inode, inode_cluster_size(inode)); +} + +static int +new_cluster(reiser4_cluster_t * clust, struct inode * inode) +{ + return (clust_to_off(clust->index, inode) >= inode->i_size); +} + +/* set number of cluster pages */ +static void +set_cluster_nrpages(reiser4_cluster_t * clust, struct inode * inode) +{ + reiser4_slide_t * win; + + assert("edward-180", clust != NULL); + assert("edward-1040", inode != NULL); + assert("edward-1042", inode_get_flag(inode, REISER4_CLUSTER_KNOWN)); + + win = clust->win; + if (!win) { + /* FIXME-EDWARD: i_size should be protected */ + clust->nr_pages = count_to_nrpages(fsize_to_count(clust, inode)); + return; + } + assert("edward-1176", clust->op != PCL_UNKNOWN); + assert("edward-1064", win->off + win->count + win->delta != 0); + + if (win->stat == HOLE_WINDOW && + win->off == 0 && + win->count == inode_cluster_size(inode)) { + /* special case: we start write hole from fake cluster */ + clust->nr_pages = 0; + return; + } + clust->nr_pages = + count_to_nrpages(max_count(win->off + win->count + win->delta, + fsize_to_count(clust, inode))); + return; +} + +/* plugin->key_by_inode() */ +/* see plugin/plugin.h for details */ +reiser4_internal int +key_by_inode_cryptcompress(struct inode *inode, loff_t off, reiser4_key * key) +{ + loff_t clust_off; + + assert("edward-64", inode != 0); + // assert("edward-112", ergo(off != get_key_offset(max_key()), !off_to_cloff(off, inode))); + /* don't come here with other offsets */ + + clust_off = (off == get_key_offset(max_key()) ? get_key_offset(max_key()) : off_to_clust_to_off(off, inode)); + + key_by_inode_and_offset_common(inode, 0, key); + set_key_offset(key, (__u64) (!inode_crypto_stat(inode) ? clust_off : inode_scaled_offset(inode, clust_off))); + return 0; +} + +/* plugin->flow_by_inode */ +reiser4_internal int +flow_by_inode_cryptcompress(struct inode *inode /* file to build flow for */ , + char *buf /* user level buffer */ , + int user /* 1 if @buf is of user space, 0 - if it is + kernel space */ , + loff_t size /* buffer size */ , + loff_t off /* offset to start io from */ , + rw_op op /* READ or WRITE */ , + flow_t * f /* resulting flow */) +{ + assert("edward-436", f != NULL); + assert("edward-149", inode != NULL); + assert("edward-150", inode_file_plugin(inode) != NULL); + assert("edward-151", inode_file_plugin(inode)->key_by_inode == key_by_inode_cryptcompress); + + + f->length = size; + f->data = buf; + f->user = user; + f->op = op; + + if (op == WRITE_OP && user == 1) + return 0; + return key_by_inode_cryptcompress(inode, off, &f->key); +} + +static int +crc_hint_validate(hint_t *hint, const reiser4_key *key, znode_lock_mode lock_mode) +{ + coord_t * coord; + + assert("edward-704", hint != NULL); + assert("edward-1089", !hint->ext_coord.valid); + assert("edward-706", hint->ext_coord.lh->owner == NULL); + + coord = &hint->ext_coord.coord; + + if (!hint || !hint_is_set(hint) || hint->mode != lock_mode) + /* hint either not set or set by different operation */ + return RETERR(-E_REPEAT); + + if (get_key_offset(key) != hint->offset) + /* hint is set for different key */ + return RETERR(-E_REPEAT); + + assert("edward-707", schedulable()); + + return seal_validate(&hint->seal, &hint->ext_coord.coord, + key, hint->ext_coord.lh, + lock_mode, + ZNODE_LOCK_LOPRI); +} + +static int +__reserve4cluster(struct inode * inode, reiser4_cluster_t * clust) +{ + int result = 0; + + assert("edward-965", schedulable()); + assert("edward-439", inode != NULL); + assert("edward-440", clust != NULL); + assert("edward-441", clust->pages != NULL); + assert("edward-1261", get_current_context()->grabbed_blocks == 0); + + if (clust->nr_pages == 0) { + assert("edward-1152", clust->win != NULL); + assert("edward-1153", clust->win->stat == HOLE_WINDOW); + /* don't reserve space for fake disk clusteer */ + return 0; + } + assert("edward-442", jprivate(clust->pages[0]) != NULL); + + result = reiser4_grab_space_force(/* for prepped disk cluster */ + estimate_insert_cluster(inode, 0) + + /* for unprepped disk cluster */ + estimate_insert_cluster(inode, 1), + BA_CAN_COMMIT); + if (result) + return result; + clust->reserved = 1; + grabbed2cluster_reserved(estimate_insert_cluster(inode, 0) + + estimate_insert_cluster(inode, 1)); +#if REISER4_DEBUG + clust->reserved_prepped = estimate_insert_cluster(inode, 0); + clust->reserved_unprepped = estimate_insert_cluster(inode, 1); +#endif + assert("edward-1262", get_current_context()->grabbed_blocks == 0); + return 0; +} + +#if REISER4_TRACE +#define reserve4cluster(inode, clust, msg) __reserve4cluster(inode, clust) +#else +#define reserve4cluster(inode, clust, msg) __reserve4cluster(inode, clust) +#endif + +static void +free_reserved4cluster(struct inode * inode, reiser4_cluster_t * clust, int count) +{ + assert("edward-967", clust->reserved == 1); + + cluster_reserved2free(count); + clust->reserved = 0; +} +#if REISER4_DEBUG +static int +eq_to_ldk(znode *node, const reiser4_key *key) +{ + return UNDER_RW(dk, current_tree, read, keyeq(key, znode_get_ld_key(node))); +} +#endif + +/* The core search procedure. + If result is not cbk_errored current znode is locked */ +static int +find_cluster_item(hint_t * hint, + const reiser4_key *key, /* key of the item we are + looking for */ + znode_lock_mode lock_mode /* which lock */, + ra_info_t *ra_info, + lookup_bias bias, + __u32 flags) +{ + int result; + reiser4_key ikey; + coord_t * coord = &hint->ext_coord.coord; + coord_t orig = *coord; + + assert("edward-152", hint != NULL); + + if (hint->ext_coord.valid == 0) { + result = crc_hint_validate(hint, key, lock_mode); + if (result == -E_REPEAT) + goto traverse_tree; + else if (result) { + assert("edward-1216", 0); + return result; + } + hint->ext_coord.valid = 1; + } + assert("edward-709", znode_is_any_locked(coord->node)); + + /* In-place lookup is going here, it means we just need to + check if next item of the @coord match to the @keyhint) */ + + if (equal_to_rdk(coord->node, key)) { + result = goto_right_neighbor(coord, hint->ext_coord.lh); + if (result == -E_NO_NEIGHBOR) { + assert("edward-1217", 0); + return RETERR(-EIO); + } + if (result) + return result; + assert("edward-1218", eq_to_ldk(coord->node, key)); + } + else { + coord->item_pos++; + coord->unit_pos = 0; + coord->between = AT_UNIT; + } + result = zload(coord->node); + if (result) + return result; + assert("edward-1219", !node_is_empty(coord->node)); + + if (!coord_is_existing_item(coord)) { + zrelse(coord->node); + goto not_found; + } + item_key_by_coord(coord, &ikey); + zrelse(coord->node); + if (!keyeq(key, &ikey)) + goto not_found; + return CBK_COORD_FOUND; + + not_found: + assert("edward-1220", coord->item_pos > 0); + //coord->item_pos--; + /* roll back */ + *coord = orig; + ON_DEBUG(coord_update_v(coord)); + return CBK_COORD_NOTFOUND; + + traverse_tree: + assert("edward-713", hint->ext_coord.lh->owner == NULL); + assert("edward-714", schedulable()); + + unset_hint(hint); + coord_init_zero(coord); + result = coord_by_key(current_tree, key, coord, hint->ext_coord.lh, + lock_mode, bias, LEAF_LEVEL, LEAF_LEVEL, + CBK_UNIQUE | flags, ra_info); + if (cbk_errored(result)) + return result; + hint->ext_coord.valid = 1; + return result; +} + +/* FIXME-EDWARD */ +#if 0 +/* This represent reiser4 crypto alignment policy. + Returns the size > 0 of aligning overhead, if we should align/cut, + returns 0, if we shouldn't (alignment assumes appending an overhead of the size > 0) */ +static int +crypto_overhead(size_t len /* advised length */, + reiser4_cluster_t * clust, + struct inode * inode, rw_op rw) +{ + size_t size = 0; + int result = 0; + int oh; + + assert("edward-486", clust != 0); + + if (!inode_get_crypto(inode) || !inode_crypto_plugin(inode)->align_cluster) + return 0; + if (!len) + size = clust->len; + + assert("edward-615", size != 0); + assert("edward-489", crypto_blocksize(inode) != 0); + + switch (rw) { + case WRITE_OP: /* align */ + assert("edward-488", size <= inode_cluster_size(inode)); + + oh = size % crypto_blocksize(inode); + + if (!oh && size == fsize_to_count(clust, inode)) + /* cluster don't need alignment and didn't get compressed */ + return 0; + result = (crypto_blocksize(inode) - oh); + break; + case READ_OP: /* cut */ + assert("edward-490", size <= inode_scaled_cluster_size(inode)); + if (size >= inode_scaled_offset(inode, fsize_to_count(clust, inode))) + /* cluster didn't get aligned */ + return 0; + assert("edward-491", tfm_stream_data(clust, OUTPUT_STREAM) != NULL); + assert("edward-900", 0); + /* FIXME-EDWARD: the stuff above */ + result = *(tfm_stream_data(clust, OUTPUT_STREAM) + size - 1); + break; + default: + impossible("edward-493", "bad option for getting alignment"); + } + return result; +} +#endif + +/* maximal aligning overhead which can be appended + to the flow before encryption if any */ +static unsigned +max_crypto_overhead(struct inode * inode) +{ + if (!inode_get_crypto(inode) || !inode_crypto_plugin(inode)->align_cluster) + return 0; + return crypto_blocksize(inode); +} + +static unsigned +compress_overhead(struct inode * inode, int in_len) +{ + return inode_compression_plugin(inode)->overrun(in_len); +} + +/* Since small input stream can not get compressed, + we try to awoid a lot of useless job */ +static int +min_size_to_compress(struct inode * inode) +{ + assert("edward-1036", + inode_compression_plugin(inode)->min_tfm_size != NULL); + return inode_compression_plugin(inode)->min_tfm_size(); +} + + +/* The following two functions represent reiser4 compression policy */ +static int +try_compress(tfm_cluster_t * tc, struct inode * inode) +{ + assert("edward-1037", min_size_to_compress(inode) > 0 && + min_size_to_compress(inode) < inode_cluster_size(inode)); + + return (inode_compression_plugin(inode) != compression_plugin_by_id(NONE_COMPRESSION_ID)) && + (tc->len >= min_size_to_compress(inode)); +} + +static int +try_encrypt(struct inode * inode) +{ + return inode_get_crypto(inode) != NULL; +} + +/* Decide by the lengths of compressed and decompressed cluster, should we save or should + we discard the result of compression. The policy is that the length of compressed then + encrypted cluster including _all_ appended infrasrtucture should be _less_ then its lenght + before compression. */ +static int +save_compressed(int old_size, int new_size, struct inode * inode) +{ + return (new_size + DC_CHECKSUM_SIZE + max_crypto_overhead(inode) < old_size); +} + +/* guess if the cluster was compressed */ +static int +need_decompression(reiser4_cluster_t * clust, struct inode * inode, + int encrypted /* is cluster encrypted */) +{ + tfm_cluster_t * tc = &clust->tc; + + assert("edward-142", tc != 0); + assert("edward-143", inode != NULL); + + return (inode_compression_plugin(inode) != compression_plugin_by_id(NONE_COMPRESSION_ID)) && + (tc->len < (encrypted ? inode_scaled_offset(inode, fsize_to_count(clust, inode)) : fsize_to_count(clust, inode))); + +} + +static void set_compression_magic(__u8 * magic) +{ + /* FIXME-EDWARD: Use a checksum here */ + assert("edward-279", magic != NULL); + memset(magic, 0, DC_CHECKSUM_SIZE); +} + +reiser4_internal int +grab_tfm_stream(struct inode * inode, tfm_cluster_t * tc, + tfm_action act, tfm_stream_id id) +{ + size_t size = inode_scaled_cluster_size(inode); + + assert("edward-901", tc != NULL); + assert("edward-1027", inode_compression_plugin(inode) != NULL); + + if (act == TFM_WRITE) + size += compress_overhead(inode, inode_cluster_size(inode)); + + if (!tfm_stream(tc, id) && id == INPUT_STREAM) + alternate_streams(tc); + if (!tfm_stream(tc, id)) + return alloc_tfm_stream(tc, size, id); + + assert("edward-902", tfm_stream_is_set(tc, id)); + + if (tfm_stream_size(tc, id) < size) + return realloc_tfm_stream(tc, size, id); + return 0; +} + +/* Common deflate cluster manager */ + +reiser4_internal int +deflate_cluster(reiser4_cluster_t * clust, struct inode * inode) +{ + int result = 0; + int transformed = 0; + tfm_cluster_t * tc = &clust->tc; + + assert("edward-401", inode != NULL); + assert("edward-903", tfm_stream_is_set(tc, INPUT_STREAM)); + assert("edward-498", !tfm_cluster_is_uptodate(tc)); + + if (result) + return result; + if (try_compress(tc, inode)) { + /* try to compress, discard bad results */ + __u32 dst_len; + compression_plugin * cplug = inode_compression_plugin(inode); + + assert("edward-602", cplug != NULL); + + result = grab_tfm_stream(inode, tc, TFM_WRITE, OUTPUT_STREAM); + if (result) + return result; + dst_len = tfm_stream_size(tc, OUTPUT_STREAM); + cplug->compress(get_coa(tc, cplug->h.id), + tfm_stream_data(tc, INPUT_STREAM), tc->len, + tfm_stream_data(tc, OUTPUT_STREAM), &dst_len); + + /* make sure we didn't overwrite extra bytes */ + assert("edward-603", dst_len <= tfm_stream_size(tc, OUTPUT_STREAM)); + + /* should we accept or discard the result of compression transform */ + if (save_compressed(tc->len, dst_len, inode)) { + /* accept */ + tc->len = dst_len; + + set_compression_magic(tfm_stream_data(tc, OUTPUT_STREAM) + tc->len); + tc->len += DC_CHECKSUM_SIZE; + transformed = 1; + } + } + if (try_encrypt(inode)) { + crypto_plugin * cplug; + /* FIXME-EDWARD */ + assert("edward-904", 0); + + cplug = inode_crypto_plugin(inode); + if (transformed) + alternate_streams(tc); + result = grab_tfm_stream(inode, tc, TFM_WRITE, OUTPUT_STREAM); + if (result) + return result; + /* FIXME: set src_len, dst_len, encrypt */ + transformed = 1; + } + if (!transformed) + alternate_streams(tc); + return result; +} + +/* Common inflate cluster manager. + Is used in readpage() or readpages() methods of + cryptcompress object plugins. */ +reiser4_internal int +inflate_cluster(reiser4_cluster_t * clust, struct inode * inode) +{ + int result = 0; + int transformed = 0; + + tfm_cluster_t * tc = &clust->tc; + + assert("edward-905", inode != NULL); + assert("edward-1178", clust->dstat == PREP_DISK_CLUSTER); + assert("edward-906", tfm_stream_is_set(&clust->tc, INPUT_STREAM)); + assert("edward-907", !tfm_cluster_is_uptodate(tc)); + + if (inode_get_crypto(inode) != NULL) { + crypto_plugin * cplug; + + /* FIXME-EDWARD: isn't supported yet */ + assert("edward-908", 0); + cplug = inode_crypto_plugin(inode); + assert("edward-617", cplug != NULL); + + result = grab_tfm_stream(inode, tc, TFM_READ, OUTPUT_STREAM); + if (result) + return result; + assert("edward-909", tfm_cluster_is_set(tc)); + + /* set src_len, dst_len and decrypt */ + /* tc->len = dst_len */ + + transformed = 1; + } + if (need_decompression(clust, inode, 0)) { + __u8 magic[DC_CHECKSUM_SIZE]; + unsigned dst_len = inode_cluster_size(inode); + compression_plugin * cplug = inode_compression_plugin(inode); + + if(transformed) + alternate_streams(tc); + + result = grab_tfm_stream(inode, tc, TFM_READ, OUTPUT_STREAM); + if (result) + return result; + assert("edward-910", tfm_cluster_is_set(tc)); + + /* Check compression magic for possible IO errors. + + End-of-cluster format created before encryption: + + data + compression_magic (4) Indicates presence of compression + infrastructure, should be private. + Can be absent. + crypto_overhead Created by ->align() method of crypto-plugin, + Can be absent. + + Crypto overhead format: + + data + tail_size (1) size of aligning tail, + 1 <= tail_size <= blksize + */ + set_compression_magic(magic); + + if (memcmp(tfm_stream_data(tc, INPUT_STREAM) + (tc->len - (size_t)DC_CHECKSUM_SIZE), + magic, (size_t)DC_CHECKSUM_SIZE)) { + printk("edward-156: wrong compression magic %d (should be %d)\n", + *((int *)(tfm_stream_data(tc, INPUT_STREAM) + (tc->len - (size_t)DC_CHECKSUM_SIZE))), *((int *)magic)); + result = -EIO; + return result; + } + tc->len -= (size_t)DC_CHECKSUM_SIZE; + + /* decompress cluster */ + cplug->decompress(get_coa(tc, cplug->h.id), + tfm_stream_data(tc, INPUT_STREAM), tc->len, + tfm_stream_data(tc, OUTPUT_STREAM), &dst_len); + + /* check length */ + tc->len = dst_len; + assert("edward-157", dst_len == fsize_to_count(clust, inode)); + transformed = 1; + } + if (!transformed) + alternate_streams(tc); + return result; +} + +/* plugin->read() : + * generic_file_read() + * All key offsets don't make sense in traditional unix semantics unless they + * represent the beginning of clusters, so the only thing we can do is start + * right from mapping to the address space (this is precisely what filemap + * generic method does) */ +/* plugin->readpage() */ +reiser4_internal int +readpage_cryptcompress(void *vp, struct page *page) +{ + reiser4_cluster_t clust; + struct file * file; + item_plugin * iplug; + int result; + + assert("edward-88", PageLocked(page)); + assert("edward-89", page->mapping && page->mapping->host); + + file = vp; + if (file) + assert("edward-113", page->mapping == file->f_dentry->d_inode->i_mapping); + + if (PageUptodate(page)) { + printk("readpage_cryptcompress: page became already uptodate\n"); + unlock_page(page); + return 0; + } + reiser4_cluster_init(&clust, 0); + clust.file = file; + iplug = item_plugin_by_id(CTAIL_ID); + if (!iplug->s.file.readpage) { + put_cluster_handle(&clust, TFM_READ); + return -EINVAL; + } + result = iplug->s.file.readpage(&clust, page); + + assert("edward-64", ergo(result == 0, (PageLocked(page) || PageUptodate(page)))); + /* if page has jnode - that jnode is mapped + assert("edward-65", ergo(result == 0 && PagePrivate(page), + jnode_mapped(jprivate(page)))); + */ + put_cluster_handle(&clust, TFM_READ); + return result; +} + +/* plugin->readpages() */ +reiser4_internal void +readpages_cryptcompress(struct file *file, struct address_space *mapping, + struct list_head *pages) +{ + file_plugin *fplug; + item_plugin *iplug; + + assert("edward-1112", mapping != NULL); + assert("edward-1113", mapping->host != NULL); + + fplug = inode_file_plugin(mapping->host); + + assert("edward-1114", fplug == file_plugin_by_id(CRC_FILE_PLUGIN_ID)); + + iplug = item_plugin_by_id(CTAIL_ID); + + iplug->s.file.readpages(file, mapping, pages); + + return; +} + +/* how much pages will be captured */ +static int +cluster_nrpages_to_capture(reiser4_cluster_t * clust) +{ + switch (clust->op) { + case PCL_APPEND: + return clust->nr_pages; + case PCL_TRUNCATE: + assert("edward-1179", clust->win != NULL); + return count_to_nrpages(clust->win->off + clust->win->count); + default: + impossible("edward-1180","bad page cluster option"); + return 0; + } +} + +static void +set_cluster_pages_dirty(reiser4_cluster_t * clust) +{ + int i; + struct page * pg; + int nrpages = cluster_nrpages_to_capture(clust); + + for (i=0; i < nrpages; i++) { + + pg = clust->pages[i]; + + assert("edward-968", pg != NULL); + + lock_page(pg); + + assert("edward-1065", PageUptodate(pg)); + + set_page_dirty_internal(pg, 0); + + if (!PageReferenced(pg)) + SetPageReferenced(pg); + mark_page_accessed(pg); + + unlock_page(pg); + } +} + +static void +clear_cluster_pages_dirty(reiser4_cluster_t * clust) +{ + int i; + assert("edward-1275", clust != NULL); + + for (i = 0; i < clust->nr_pages; i++) { + assert("edward-1276", clust->pages[i] != NULL); + + lock_page(clust->pages[i]); + if (!PageDirty(clust->pages[i])) { + warning("edward-985", "Page of index %lu (inode %llu)" + " is not dirty\n", clust->pages[i]->index, + (unsigned long long)get_inode_oid(clust->pages[i]->mapping->host)); + } + else { + assert("edward-1277", PageUptodate(clust->pages[i])); + reiser4_clear_page_dirty(clust->pages[i]); + } + unlock_page(clust->pages[i]); + } +} + +/* update i_size by window */ +static void +inode_set_new_size(reiser4_cluster_t * clust, struct inode * inode) +{ + loff_t size; + reiser4_slide_t * win; + + assert("edward-1181", clust != NULL); + assert("edward-1182", inode != NULL); + + win = clust->win; + assert("edward-1183", win != NULL); + + size = clust_to_off(clust->index, inode) + win->off; + + switch (clust->op) { + case PCL_APPEND: + if (size + win->count <= inode->i_size) + /* overwrite only */ + return; + size += win->count; + break; + case PCL_TRUNCATE: + break; + default: + impossible("edward-1184", "bad page cluster option"); + break; + } + inode_check_scale_nolock(inode, inode->i_size, size); + inode->i_size = size; + return; +} + +/* . reserve space for a disk cluster if its jnode is not dirty; + . update set of pages referenced by this jnode + . update jnode's counter of referenced pages (excluding first one) +*/ +static void +make_cluster_jnode_dirty_locked(reiser4_cluster_t * clust, jnode * node, + loff_t * old_isize, struct inode * inode) +{ + int i; + int old_refcnt; + int new_refcnt; + + assert("edward-221", node != NULL); + assert("edward-971", clust->reserved == 1); + assert("edward-1028", spin_jnode_is_locked(node)); + assert("edward-972", node->page_count < cluster_nrpages(inode)); + assert("edward-1263", clust->reserved_prepped == estimate_insert_cluster(inode, 0)); + assert("edward-1264", clust->reserved_unprepped == 0); + + + if (jnode_is_dirty(node)) { + /* there are >= 1 pages already referenced by this jnode */ + assert("edward-973", count_to_nrpages(off_to_count(*old_isize, clust->index, inode))); + old_refcnt = count_to_nrpages(off_to_count(*old_isize, clust->index, inode)) - 1; + /* space for the disk cluster is already reserved */ + + free_reserved4cluster(inode, clust, estimate_insert_cluster(inode, 0)); + } + else { + /* there is only one page referenced by this jnode */ + assert("edward-1043", node->page_count == 0); + old_refcnt = 0; + jnode_make_dirty_locked(node); + clust->reserved = 0; + } +#if REISER4_DEBUG + clust->reserved_prepped -= estimate_insert_cluster(inode, 0); +#endif + new_refcnt = cluster_nrpages_to_capture(clust) - 1; + + /* get rid of duplicated references */ + for (i = 0; i <= old_refcnt; i++) { + assert("edward-975", clust->pages[i]); + assert("edward-976", old_refcnt < inode_cluster_size(inode)); + assert("edward-1185", PageUptodate(clust->pages[i])); + + page_cache_release(clust->pages[i]); + } + /* truncate old references */ + if (new_refcnt < old_refcnt) { + assert("edward-1186", clust->op == PCL_TRUNCATE); + for (i = new_refcnt + 1; i <= old_refcnt; i++) { + assert("edward-1187", clust->pages[i]); + assert("edward-1188", PageUptodate(clust->pages[i])); + + page_cache_release(clust->pages[i]); + } + } +#if REISER4_DEBUG + node->page_count = new_refcnt; +#endif + return; +} + +/* This is the interface to capture page cluster. + All the cluster pages contain dependent modifications + and should be committed at the same time */ +static int +try_capture_cluster(reiser4_cluster_t * clust, struct inode * inode) +{ + int result = 0; + loff_t old_size = inode->i_size; + jnode * node; + + assert("edward-1029", clust != NULL); + assert("edward-1030", clust->reserved == 1); + assert("edward-1031", clust->nr_pages != 0); + assert("edward-1032", clust->pages != NULL); + assert("edward-1033", clust->pages[0] != NULL); + + node = jprivate(clust->pages[0]); + + assert("edward-1035", node != NULL); + + if (clust->win) { + spin_lock_inode(inode); + LOCK_JNODE(node); + inode_set_new_size(clust, inode); + } + else + LOCK_JNODE(node); + result = try_capture(node, ZNODE_WRITE_LOCK, 0, 0); + if (result) + goto exit; + make_cluster_jnode_dirty_locked(clust, node, &old_size, inode); + exit: + assert("edward-1034", !result); + UNLOCK_JNODE(node); + if (clust->win) + spin_unlock_inode(inode); + jput(node); + return result; +} + +/* Collect unlocked cluster pages and jnode */ +static int +grab_cluster_pages_jnode(struct inode * inode, reiser4_cluster_t * clust) +{ + int i; + int result = 0; + jnode * node = NULL; + + assert("edward-182", clust != NULL); + assert("edward-183", clust->pages != NULL); + assert("edward-184", clust->nr_pages <= cluster_nrpages(inode)); + + if (clust->nr_pages == 0) + return 0; + + for (i = 0; i < clust->nr_pages; i++) { + + assert("edward-1044", clust->pages[i] == NULL); + + clust->pages[i] = grab_cache_page(inode->i_mapping, clust_to_pg(clust->index, inode) + i); + if (!clust->pages[i]) { + result = RETERR(-ENOMEM); + break; + } + if (i == 0) { + node = jnode_of_page(clust->pages[i]); + unlock_page(clust->pages[i]); + if (IS_ERR(node)) { + result = PTR_ERR(node); + break; + } + assert("edward-919", node); + continue; + } + unlock_page(clust->pages[i]); + } + if (result) { + while(i) + page_cache_release(clust->pages[--i]); + if (node && !IS_ERR(node)) + jput(node); + return result; + } + assert("edward-920", jprivate(clust->pages[0])); + LOCK_JNODE(node); + JF_SET(node, JNODE_CLUSTER_PAGE); + UNLOCK_JNODE(node); + return 0; +} + +/* collect unlocked cluster pages */ +static int +grab_cluster_pages(struct inode * inode, reiser4_cluster_t * clust) +{ + int i; + int result = 0; + + assert("edward-787", clust != NULL); + assert("edward-788", clust->pages != NULL); + assert("edward-789", clust->nr_pages != 0); + assert("edward-790", clust->nr_pages <= cluster_nrpages(inode)); + + for (i = 0; i < clust->nr_pages; i++) { + clust->pages[i] = grab_cache_page(inode->i_mapping, clust_to_pg(clust->index, inode) + i); + if (!clust->pages[i]) { + result = RETERR(-ENOMEM); + break; + } + unlock_page(clust->pages[i]); + } + if (result) + while(i) + page_cache_release(clust->pages[--i]); + return result; +} + +UNUSED_ARG static void +set_cluster_unlinked(reiser4_cluster_t * clust, struct inode * inode) +{ + jnode * node; + + node = jprivate(clust->pages[0]); + + assert("edward-640", node); + + LOCK_JNODE(node); + JF_SET(node, JNODE_NEW); + UNLOCK_JNODE(node); +} + +/* put cluster pages */ +static void +release_cluster_pages(reiser4_cluster_t * clust, int from) +{ + int i; + + assert("edward-447", clust != NULL); + assert("edward-448", from <= clust->nr_pages); + + for (i = from; i < clust->nr_pages; i++) { + + assert("edward-449", clust->pages[i] != NULL); + + page_cache_release(clust->pages[i]); + } +} + +static void +release_cluster_pages_capture(reiser4_cluster_t * clust) +{ + assert("edward-1278", clust != NULL); + assert("edward-1279", clust->nr_pages != 0); + + return release_cluster_pages(clust, 1); +} + +reiser4_internal void +release_cluster_pages_nocapture(reiser4_cluster_t * clust) +{ + return release_cluster_pages(clust, 0); +} + +static void +release_cluster_pages_and_jnode(reiser4_cluster_t * clust) +{ + jnode * node; + + assert("edward-445", clust != NULL); + assert("edward-922", clust->pages != NULL); + assert("edward-446", clust->pages[0] != NULL); + + node = jprivate(clust->pages[0]); + + assert("edward-447", node != NULL); + + release_cluster_pages(clust, 0); + + jput(node); +} + +#if REISER4_DEBUG +static int +window_ok(reiser4_slide_t * win, struct inode * inode) +{ + assert ("edward-1115", win != NULL); + assert ("edward-1116", ergo(win->delta, win->stat == HOLE_WINDOW)); + + return (win->off != inode_cluster_size(inode)) && + (win->off + win->count + win->delta <= inode_cluster_size(inode)); +} + +static int +cluster_ok(reiser4_cluster_t * clust, struct inode * inode) +{ + assert("edward-279", clust != NULL); + + if (!clust->pages) + return 0; + return (clust->win ? window_ok(clust->win, inode) : 1); +} +#endif + +/* guess next window stat */ +static inline window_stat +next_window_stat(reiser4_slide_t * win) +{ + assert ("edward-1130", win != NULL); + return ((win->stat == HOLE_WINDOW && win->delta == 0) ? + HOLE_WINDOW : DATA_WINDOW); +} + +/* guess next cluster index and window params */ +static void +update_cluster(struct inode * inode, reiser4_cluster_t * clust, loff_t file_off, loff_t to_file) +{ + reiser4_slide_t * win; + + assert ("edward-185", clust != NULL); + assert ("edward-438", clust->pages != NULL); + assert ("edward-281", cluster_ok(clust, inode)); + + win = clust->win; + if (!win) + return; + + switch (win->stat) { + case DATA_WINDOW: + /* increment window position */ + clust->index++; + win->stat = DATA_WINDOW; + win->off = 0; + win->count = min_count(inode_cluster_size(inode), to_file); + break; + case HOLE_WINDOW: + switch(next_window_stat(win)) { + case HOLE_WINDOW: + /* set window to fit the offset we start write from */ + clust->index = off_to_clust(file_off, inode); + win->stat = HOLE_WINDOW; + win->off = 0; + win->count = off_to_cloff(file_off, inode); + win->delta = min_count(inode_cluster_size(inode) - win->count, to_file); + break; + case DATA_WINDOW: + /* do not move the window, just change its state, + off+count+delta=inv */ + win->stat = DATA_WINDOW; + win->off = win->off + win->count; + win->count = win->delta; + win->delta = 0; + break; + default: + impossible ("edward-282", "wrong next window state"); + } + break; + default: + impossible ("edward-283", "wrong current window state"); + } + assert ("edward-1068", cluster_ok(clust, inode)); +} + +static int +update_sd_cryptcompress(struct inode *inode) +{ + int result = 0; + + assert("edward-978", schedulable()); + assert("edward-1265", get_current_context()->grabbed_blocks == 0); + + result = reiser4_grab_space_force(/* one for stat data update */ + estimate_update_common(inode), + BA_CAN_COMMIT); + assert("edward-979", !result); + if (result) + return result; + inode->i_ctime = inode->i_mtime = CURRENT_TIME; + result = reiser4_update_sd(inode); + + all_grabbed2free(); + return result; +} + +static void +uncapture_cluster_jnode(jnode *node) +{ + txn_atom *atom; + + assert("edward-1023", spin_jnode_is_locked(node)); + + /*jnode_make_clean(node);*/ + atom = jnode_get_atom(node); + if (atom == NULL) { + assert("jmacd-7111", !jnode_is_dirty(node)); + UNLOCK_JNODE (node); + return; + } + + uncapture_block(node); + UNLOCK_ATOM(atom); + jput(node); +} + +reiser4_internal void +forget_cluster_pages(struct page ** pages, int nr) +{ + int i; + for (i = 0; i < nr; i++) { + + assert("edward-1045", pages[i] != NULL); + page_cache_release(pages[i]); + } +} + +/* Prepare input stream for transform operations. + Try to do it in one step. Return -E_REPEAT when it is + impossible because of races with concurrent processes. +*/ +reiser4_internal int +flush_cluster_pages(reiser4_cluster_t * clust, jnode * node, + struct inode * inode) +{ + int result = 0; + int i; + int nr_pages = 0; + tfm_cluster_t * tc = &clust->tc; + + assert("edward-980", node != NULL); + assert("edward-236", inode != NULL); + assert("edward-237", clust != NULL); + assert("edward-240", !clust->win); + assert("edward-241", schedulable()); + assert("edward-718", crc_inode_ok(inode)); + + LOCK_JNODE(node); + + if (!jnode_is_dirty(node)) { + + assert("edward-981", node->page_count == 0); + warning("edward-982", "flush_cluster_pages: jnode is not dirty " + "clust %lu, inode %llu\n", + clust->index, (unsigned long long)get_inode_oid(inode)); + + /* race with another flush */ + UNLOCK_JNODE(node); + return RETERR(-E_REPEAT); + } + tc->len = fsize_to_count(clust, inode); + clust->nr_pages = count_to_nrpages(tc->len); + + assert("edward-983", clust->nr_pages == node->page_count + 1); +#if REISER4_DEBUG + node->page_count = 0; +#endif + cluster_reserved2grabbed(estimate_insert_cluster(inode, 0)); + uncapture_cluster_jnode(node); + + /* Try to create input stream for the found size (tc->len). + Starting from this point the page cluster can be modified + (truncated, appended) by concurrent processes, so we need + to worry if the constructed stream is valid */ + + assert("edward-1224", schedulable()); + + result = grab_tfm_stream(inode, tc, TFM_WRITE, INPUT_STREAM); + if (result) + return result; + + nr_pages = find_get_pages(inode->i_mapping, clust_to_pg(clust->index, inode), + clust->nr_pages, clust->pages); + + if (nr_pages != clust->nr_pages) { + /* the page cluster get truncated, try again */ + assert("edward-1280", nr_pages < clust->nr_pages); + warning("edward-1281", "Page cluster of index %lu (inode %llu)" + " get truncated from %u to %u pages\n", + clust->index, + (unsigned long long)get_inode_oid(inode), + clust->nr_pages, + nr_pages); + forget_cluster_pages(clust->pages, nr_pages); + return RETERR(-E_REPEAT); + } + for (i = 0; i < clust->nr_pages; i++){ + char * data; + + assert("edward-242", clust->pages[i] != NULL); + + if (clust->pages[i]->index != clust_to_pg(clust->index, inode) + i) { + /* holes in the indices of found group of pages: + page cluster get truncated, transform impossible */ + warning("edward-1282", + "Hole in the indices: " + "Page %d in the cluster of index %lu " + "(inode %llu) has index %lu\n", + i, clust->index, + (unsigned long long)get_inode_oid(inode), + clust->pages[i]->index); + + forget_cluster_pages(clust->pages, nr_pages); + result = RETERR(-E_REPEAT); + goto finish; + } + if (!PageUptodate(clust->pages[i])) { + /* page cluster get truncated, transform impossible */ + assert("edward-1283", !PageDirty(clust->pages[i])); + warning("edward-1284", + "Page of index %lu (inode %llu) " + "is not uptodate\n", clust->pages[i]->index, + (unsigned long long)get_inode_oid(inode)); + + forget_cluster_pages(clust->pages, nr_pages); + result = RETERR(-E_REPEAT); + goto finish; + } + /* ok with this page, flush it to the input stream */ + lock_page(clust->pages[i]); + data = kmap(clust->pages[i]); + + assert("edward-986", off_to_pgcount(tc->len, i) != 0); + + memcpy(tfm_stream_data(tc, INPUT_STREAM) + pg_to_off(i), + data, off_to_pgcount(tc->len, i)); + kunmap(clust->pages[i]); + unlock_page(clust->pages[i]); + } + /* input stream is ready for transform */ + + clear_cluster_pages_dirty(clust); + finish: + release_cluster_pages_capture(clust); + return result; +} + +/* set hint for the cluster of the index @index */ +reiser4_internal void +set_hint_cluster(struct inode * inode, hint_t * hint, + cloff_t index, znode_lock_mode mode) +{ + reiser4_key key; + assert("edward-722", crc_inode_ok(inode)); + assert("edward-723", inode_file_plugin(inode) == file_plugin_by_id(CRC_FILE_PLUGIN_ID)); + + inode_file_plugin(inode)->key_by_inode(inode, clust_to_off(index, inode), &key); + + seal_init(&hint->seal, &hint->ext_coord.coord, &key); + hint->offset = get_key_offset(&key); + hint->mode = mode; +} + +reiser4_internal void +invalidate_hint_cluster(reiser4_cluster_t * clust) +{ + assert("edward-1291", clust != NULL); + assert("edward-1292", clust->hint != NULL); + + longterm_unlock_znode(clust->hint->ext_coord.lh); + clust->hint->ext_coord.valid = 0; +} + +static void +put_hint_cluster(reiser4_cluster_t * clust, struct inode * inode, + znode_lock_mode mode) +{ + assert("edward-1286", clust != NULL); + assert("edward-1287", clust->hint != NULL); + + set_hint_cluster(inode, clust->hint, clust->index + 1, mode); + invalidate_hint_cluster(clust); +} + +static int +balance_dirty_page_cluster(reiser4_cluster_t * clust, struct inode * inode, + loff_t off, loff_t to_file) +{ + int result; + + assert("edward-724", inode != NULL); + assert("edward-725", crc_inode_ok(inode)); + assert("edward-1272", get_current_context()->grabbed_blocks == 0); + + /* set next window params */ + update_cluster(inode, clust, off, to_file); + + result = update_sd_cryptcompress(inode); + assert("edward-988", !result); + if (result) + return result; + assert("edward-726", clust->hint->ext_coord.lh->owner == NULL); + + reiser4_throttle_write(inode); + all_grabbed2free(); + return 0; +} + +/* set zeroes to the cluster, update it, and maybe, try to capture its pages */ +static int +write_hole(struct inode *inode, reiser4_cluster_t * clust, loff_t file_off, loff_t to_file) +{ + char * data; + int result = 0; + unsigned cl_off, cl_count = 0; + unsigned to_pg, pg_off; + reiser4_slide_t * win; + + assert ("edward-190", clust != NULL); + assert ("edward-1069", clust->win != NULL); + assert ("edward-191", inode != NULL); + assert ("edward-727", crc_inode_ok(inode)); + assert ("edward-1171", clust->dstat != INVAL_DISK_CLUSTER); + assert ("edward-1154", + ergo(clust->dstat != FAKE_DISK_CLUSTER, clust->reserved == 1)); + + win = clust->win; + + assert ("edward-1070", win != NULL); + assert ("edward-201", win->stat == HOLE_WINDOW); + assert ("edward-192", cluster_ok(clust, inode)); + + if (win->off == 0 && win->count == inode_cluster_size(inode)) { + /* the hole will be represented by fake disk cluster */ + update_cluster(inode, clust, file_off, to_file); + return 0; + } + cl_count = win->count; /* number of zeroes to write */ + cl_off = win->off; + pg_off = off_to_pgoff(win->off); + + while (cl_count) { + struct page * page; + page = clust->pages[off_to_pg(cl_off)]; + + assert ("edward-284", page != NULL); + + to_pg = min_count(PAGE_CACHE_SIZE - pg_off, cl_count); + lock_page(page); + data = kmap_atomic(page, KM_USER0); + memset(data + pg_off, 0, to_pg); + flush_dcache_page(page); + kunmap_atomic(data, KM_USER0); + SetPageUptodate(page); + unlock_page(page); + + cl_off += to_pg; + cl_count -= to_pg; + pg_off = 0; + } + if (!win->delta) { + /* only zeroes, try to capture */ + + set_cluster_pages_dirty(clust); + result = try_capture_cluster(clust, inode); + if (result) + return result; + put_hint_cluster(clust, inode, ZNODE_WRITE_LOCK); + result = balance_dirty_page_cluster(clust, inode, file_off, to_file); + } + else + update_cluster(inode, clust, file_off, to_file); + return result; +} + +/* + The main disk search procedure for cryptcompress plugins, which + . scans all items of disk cluster + . maybe reads each one (if @read != 0) + . maybe makes its znode dirty (if @write != 0) + + NOTE-EDWARD: Callers should handle the case when disk cluster + is incomplete (-EIO) +*/ +reiser4_internal int +find_cluster(reiser4_cluster_t * clust, + struct inode * inode, + int read, + int write) +{ + flow_t f; + hint_t * hint; + int result; + unsigned long cl_idx; + ra_info_t ra_info; + file_plugin * fplug; + item_plugin * iplug; + tfm_cluster_t * tc; + +#if REISER4_DEBUG + reiser4_context *ctx; + ctx = get_current_context(); +#endif + assert("edward-138", clust != NULL); + assert("edward-728", clust->hint != NULL); + assert("edward-225", read || write); + assert("edward-226", schedulable()); + assert("edward-137", inode != NULL); + assert("edward-729", crc_inode_ok(inode)); + assert("edward-474", get_current_context()->grabbed_blocks == 0); + + hint = clust->hint; + cl_idx = clust->index; + fplug = inode_file_plugin(inode); + + tc = &clust->tc; + + assert("edward-462", !tfm_cluster_is_uptodate(tc)); + assert("edward-461", ergo(read, tfm_stream_is_set(tc, INPUT_STREAM))); + + /* set key of the first disk cluster item */ + fplug->flow_by_inode(inode, + (read ? tfm_stream_data(tc, INPUT_STREAM) : 0), + 0 /* kernel space */, + inode_scaled_cluster_size(inode), + clust_to_off(cl_idx, inode), READ_OP, &f); + if (write) { + /* reserve for flush to make dirty all the leaf nodes + which contain disk cluster */ + result = reiser4_grab_space_force(estimate_disk_cluster(inode), BA_CAN_COMMIT); + assert("edward-990", !result); + if (result) + goto out2; + } + + ra_info.key_to_stop = f.key; + set_key_offset(&ra_info.key_to_stop, get_key_offset(max_key())); + + while (f.length) { + result = find_cluster_item(hint, &f.key, (write ? ZNODE_WRITE_LOCK : ZNODE_READ_LOCK), NULL, FIND_EXACT, 0); + switch (result) { + case CBK_COORD_NOTFOUND: + if (inode_scaled_offset(inode, clust_to_off(cl_idx, inode)) == get_key_offset(&f.key)) { + /* first item not found, this is treated + as disk cluster is absent */ + clust->dstat = FAKE_DISK_CLUSTER; + result = 0; + goto out2; + } + /* we are outside the cluster, stop search here */ + assert("edward-146", f.length != inode_scaled_cluster_size(inode)); + goto ok; + case CBK_COORD_FOUND: + assert("edward-148", hint->ext_coord.coord.between == AT_UNIT); + assert("edward-460", hint->ext_coord.coord.unit_pos == 0); + + coord_clear_iplug(&hint->ext_coord.coord); + result = zload_ra(hint->ext_coord.coord.node, &ra_info); + if (unlikely(result)) + goto out2; + iplug = item_plugin_by_coord(&hint->ext_coord.coord); + assert("edward-147", + item_id_by_coord(&hint->ext_coord.coord) == CTAIL_ID); + + result = iplug->s.file.read(NULL, &f, hint); + if (result) + goto out; + if (write) { + znode_make_dirty(hint->ext_coord.coord.node); + znode_set_convertible(hint->ext_coord.coord.node); + } + zrelse(hint->ext_coord.coord.node); + break; + default: + goto out2; + } + } + ok: + /* at least one item was found */ + /* NOTE-EDWARD: Callers should handle the case when disk cluster is incomplete (-EIO) */ + tc->len = inode_scaled_cluster_size(inode) - f.length; + assert("edward-1196", tc->len > 0); + + if (hint_is_unprepped_dclust(clust->hint)) + clust->dstat = UNPR_DISK_CLUSTER; + else + clust->dstat = PREP_DISK_CLUSTER; + all_grabbed2free(); + return 0; + out: + zrelse(hint->ext_coord.coord.node); + out2: + all_grabbed2free(); + return result; +} + +reiser4_internal int +get_disk_cluster_locked(reiser4_cluster_t * clust, struct inode * inode, + znode_lock_mode lock_mode) +{ + reiser4_key key; + ra_info_t ra_info; + + assert("edward-730", schedulable()); + assert("edward-731", clust != NULL); + assert("edward-732", inode != NULL); + + if (clust->hint->ext_coord.valid) { + assert("edward-1293", clust->dstat != INVAL_DISK_CLUSTER); + assert("edward-1294", znode_is_write_locked(clust->hint->ext_coord.lh->node)); + /* already have a valid locked position */ + return (clust->dstat == FAKE_DISK_CLUSTER ? CBK_COORD_NOTFOUND : CBK_COORD_FOUND); + } + key_by_inode_cryptcompress(inode, clust_to_off(clust->index, inode), &key); + ra_info.key_to_stop = key; + set_key_offset(&ra_info.key_to_stop, get_key_offset(max_key())); + + return find_cluster_item(clust->hint, &key, lock_mode, NULL, FIND_EXACT, CBK_FOR_INSERT); +} + +/* Read needed cluster pages before modifying. + If success, @clust->hint contains locked position in the tree. + Also: + . find and set disk cluster state + . make disk cluster dirty if its state is not FAKE_DISK_CLUSTER. +*/ +static int +read_some_cluster_pages(struct inode * inode, reiser4_cluster_t * clust) +{ + int i; + int result = 0; + item_plugin * iplug; + reiser4_slide_t * win = clust->win; + + iplug = item_plugin_by_id(CTAIL_ID); + + assert("edward-733", get_current_context()->grabbed_blocks == 0); + assert("edward-924", !tfm_cluster_is_uptodate(&clust->tc)); + +#if REISER4_DEBUG + if (clust->nr_pages == 0) { + /* start write hole from fake disk cluster */ + assert("edward-1117", win != NULL); + assert("edward-1118", win->stat == HOLE_WINDOW); + assert("edward-1119", new_cluster(clust, inode)); + } +#endif + if (new_cluster(clust, inode)) { + /* + new page cluster is about to be written, nothing to read, + */ + assert("edward-734", schedulable()); + assert("edward-735", clust->hint->ext_coord.lh->owner == NULL); + + clust->dstat = FAKE_DISK_CLUSTER; + return 0; + } + /* + Here we should search for disk cluster to figure out its real state. + Also there is one more important reason to do disk search: we need + to make disk cluster _dirty_ if it exists + */ + + /* if windows is specified, read the only pages + that will be modified partially */ + + for (i = 0; i < clust->nr_pages; i++) { + struct page * pg = clust->pages[i]; + + lock_page(pg); + if (PageUptodate(pg)) { + unlock_page(pg); + continue; + } + unlock_page(pg); + + if (win && + i >= count_to_nrpages(win->off) && + i < off_to_pg(win->off + win->count + win->delta)) + /* page will be completely overwritten */ + continue; + if (win && (i == clust->nr_pages - 1) && + /* the last page is + partially modified, + not uptodate .. */ + (count_to_nrpages(inode->i_size) <= pg->index)) { + /* .. and appended, + so set zeroes to the rest */ + char * data; + int offset; + lock_page(pg); + data = kmap_atomic(pg, KM_USER0); + + assert("edward-1260", + count_to_nrpages(win->off + win->count + win->delta) - 1 == i); + + offset = off_to_pgoff(win->off + win->count + win->delta); + memset(data + offset, 0, PAGE_CACHE_SIZE - offset); + flush_dcache_page(pg); + kunmap_atomic(data, KM_USER0); + unlock_page(pg); + /* still not uptodate */ + break; + } + if (!tfm_cluster_is_uptodate(&clust->tc)) { + result = ctail_read_cluster(clust, inode, 1 /* write */); + assert("edward-992", !result); + if (result) + goto out; + assert("edward-925", tfm_cluster_is_uptodate(&clust->tc)); + } + lock_page(pg); + result = do_readpage_ctail(clust, pg); + unlock_page(pg); + assert("edward-993", !result); + if (result) { + impossible("edward-219", "do_readpage_ctail returned crap"); + goto out; + } + } + if (!tfm_cluster_is_uptodate(&clust->tc)) { + /* disk cluster unclaimed, but we need to make its znodes dirty + to make flush update convert its content */ + result = find_cluster(clust, inode, 0 /* do not read */, 1 /*write */); + assert("edward-994", !cbk_errored(result)); + if (!cbk_errored(result)) + result = 0; + } + out: + tfm_cluster_clr_uptodate(&clust->tc); + return result; +} + +static int +should_create_unprepped_cluster(reiser4_cluster_t * clust, struct inode * inode) +{ + assert("edward-737", clust != NULL); + + switch (clust->dstat) { + case PREP_DISK_CLUSTER: + case UNPR_DISK_CLUSTER: + return 0; + case FAKE_DISK_CLUSTER: + if (clust->win && + clust->win->stat == HOLE_WINDOW && + clust->nr_pages == 0) { + assert("edward-1172", new_cluster(clust, inode)); + return 0; + } + return 1; + default: + impossible("edward-1173", "bad disk cluster state"); + return 0; + } +} + +static int +crc_make_unprepped_cluster (reiser4_cluster_t * clust, struct inode * inode) +{ + int result; + + assert("edward-1123", schedulable()); + assert("edward-737", clust != NULL); + assert("edward-738", inode != NULL); + assert("edward-739", crc_inode_ok(inode)); + assert("edward-1053", clust->hint != NULL); + assert("edward-1266", get_current_context()->grabbed_blocks == 0); + + if (clust->reserved){ + cluster_reserved2grabbed(estimate_insert_cluster(inode, 1)); +#if REISER4_DEBUG + assert("edward-1267", clust->reserved_unprepped == estimate_insert_cluster(inode, 1)); + clust->reserved_unprepped -= estimate_insert_cluster(inode, 1); +#endif + } + if (!should_create_unprepped_cluster(clust, inode)) { + all_grabbed2free(); + return 0; + } else { + assert("edward-1268", clust->reserved == 1); + } + result = ctail_insert_unprepped_cluster(clust, inode); + all_grabbed2free(); + if (result) + return result; + + assert("edward-743", crc_inode_ok(inode)); + assert("edward-1269", get_current_context()->grabbed_blocks == 0); + assert("edward-744", znode_is_write_locked(clust->hint->ext_coord.lh->node)); + + clust->dstat = UNPR_DISK_CLUSTER; + return 0; +} + +#if REISER4_DEBUG +static int +jnode_truncate_ok(struct inode *inode, cloff_t index) +{ + jnode * node; + node = jlookup(current_tree, get_inode_oid(inode), clust_to_pg(index, inode)); + if (node) + jput(node); + return (node == NULL); +} +#endif + +/* Collect unlocked cluster pages and jnode (the last is in the + case when the page cluster will be modified and captured) */ +reiser4_internal int +prepare_page_cluster(struct inode *inode, reiser4_cluster_t *clust, int capture) +{ + assert("edward-177", inode != NULL); + assert("edward-741", crc_inode_ok(inode)); + assert("edward-740", clust->pages != NULL); + + set_cluster_nrpages(clust, inode); + reset_cluster_pgset(clust, cluster_nrpages(inode)); + return (capture ? + grab_cluster_pages_jnode(inode, clust) : + grab_cluster_pages (inode, clust)); +} + +/* Truncate all the pages and jnode bound with the cluster of index @index */ +reiser4_internal void +truncate_page_cluster(struct inode *inode, cloff_t index) +{ + int i; + int found = 0; + int nr_pages; + jnode * node; + struct page * pages[MAX_CLUSTER_NRPAGES]; + + node = jlookup(current_tree, get_inode_oid(inode), clust_to_pg(index, inode)); + /* jnode is absent, just drop pages which can not + acquire jnode because of exclusive access */ + if (!node) { + truncate_inode_pages_range(inode->i_mapping, + clust_to_off(index, inode), + clust_to_off(index, inode) + inode_cluster_size(inode) - 1); + return; + } + /* jnode is present and may be dirty, if so, put + all the cluster pages except the first one */ + nr_pages = count_to_nrpages(off_to_count(inode->i_size, index, inode)); + + found = find_get_pages(inode->i_mapping, clust_to_pg(index, inode), + nr_pages, pages); + + LOCK_JNODE(node); + if (jnode_is_dirty(node)) { + /* jnode is dirty => space for disk cluster + conversion grabbed */ + cluster_reserved2grabbed(estimate_insert_cluster(inode, 0)); + grabbed2free(get_current_context(), + get_current_super_private(), + estimate_insert_cluster(inode, 0)); + + assert("edward-1198", found == nr_pages); + /* This will clear dirty bit so concurrent flush + won't start to convert the disk cluster */ + assert("edward-1199", PageUptodate(pages[0])); + uncapture_cluster_jnode(node); + + for (i = 1; i < nr_pages ; i++) { + assert("edward-1200", PageUptodate(pages[i])); + + page_cache_release(pages[i]); + } + } + else + UNLOCK_JNODE(node); + /* now drop pages and jnode */ + /* FIXME-EDWARD: Use truncate_complete_page in the loop above instead */ + + jput(node); + forget_cluster_pages(pages, found); + truncate_inode_pages_range(inode->i_mapping, + clust_to_off(index, inode), + clust_to_off(index, inode) + inode_cluster_size(inode) - 1); + assert("edward-1201", jnode_truncate_ok(inode, index)); + return; +} + +/* Prepare cluster handle before write. Called by all the clients which + age going to modify the page cluster and put it into a transaction + (file_write, truncate, writepages, etc..) + + . grab cluster pages; + . reserve disk space; + . maybe read pages from disk and set the disk cluster dirty; + . maybe write hole; + . maybe create 'unprepped' disk cluster (if the disk cluster is fake (isn't represenred + by any items on disk) +*/ + +static int +prepare_cluster(struct inode *inode, + loff_t file_off /* write position in the file */, + loff_t to_file, /* bytes of users data to write to the file */ + reiser4_cluster_t *clust, + page_cluster_op op) + +{ + int result = 0; + reiser4_slide_t * win = clust->win; + + assert("edward-1273", get_current_context()->grabbed_blocks == 0); + reset_cluster_params(clust); +#if REISER4_DEBUG + clust->ctx = get_current_context(); +#endif + assert("edward-1190", op != PCL_UNKNOWN); + + clust->op = op; + + result = prepare_page_cluster(inode, clust, 1); + if (result) + return result; + result = reserve4cluster(inode, clust, msg); + if (result) + goto err1; + result = read_some_cluster_pages(inode, clust); + if (result) { + free_reserved4cluster(inode, + clust, + estimate_insert_cluster(inode, 0) + + estimate_insert_cluster(inode, 1)); + goto err1; + } + assert("edward-1124", clust->dstat != INVAL_DISK_CLUSTER); + + result = crc_make_unprepped_cluster(clust, inode); + if (result) + goto err2; + if (win && win->stat == HOLE_WINDOW) { + result = write_hole(inode, clust, file_off, to_file); + if (result) + goto err2; + } + return 0; + err2: + free_reserved4cluster(inode, + clust, + estimate_insert_cluster(inode, 0)); + err1: + page_cache_release(clust->pages[0]); + release_cluster_pages_and_jnode(clust); + assert("edward-1125", 0); + return result; +} + +/* set window by two offsets */ +static void +set_window(reiser4_cluster_t * clust, reiser4_slide_t * win, + struct inode * inode, loff_t o1, loff_t o2) +{ + assert("edward-295", clust != NULL); + assert("edward-296", inode != NULL); + assert("edward-1071", win != NULL); + assert("edward-297", o1 <= o2); + + clust->index = off_to_clust(o1, inode); + + win->off = off_to_cloff(o1, inode); + win->count = min_count(inode_cluster_size(inode) - win->off, o2 - o1); + win->delta = 0; + + clust->win = win; +} + +static int +set_cluster_params(struct inode * inode, reiser4_cluster_t * clust, + reiser4_slide_t * win, flow_t * f, loff_t file_off) +{ + int result; + + assert("edward-197", clust != NULL); + assert("edward-1072", win != NULL); + assert("edward-198", inode != NULL); + assert("edward-747", reiser4_inode_data(inode)->cluster_shift <= MAX_CLUSTER_SHIFT); + + result = alloc_cluster_pgset(clust, cluster_nrpages(inode)); + if (result) + return result; + + if (file_off > inode->i_size) { + /* Uhmm, hole in crypto-file... */ + loff_t hole_size; + hole_size = file_off - inode->i_size; + + printk("edward-176, Warning: Hole of size %llu in " + "cryptcompress file (inode %llu, offset %llu) \n", + hole_size, (unsigned long long)get_inode_oid(inode), file_off); + + set_window(clust, win, inode, inode->i_size, file_off); + win->stat = HOLE_WINDOW; + if (win->off + hole_size < inode_cluster_size(inode)) + /* there is also user's data to append to the hole */ + win->delta = min_count(inode_cluster_size(inode) - (win->off + win->count), f->length); + return 0; + } + set_window(clust, win, inode, file_off, file_off + f->length); + win->stat = DATA_WINDOW; + return 0; +} + +/* reset all the params that not get updated */ +reiser4_internal void +reset_cluster_params(reiser4_cluster_t * clust) +{ + assert("edward-197", clust != NULL); + + clust->dstat = INVAL_DISK_CLUSTER; + clust->tc.uptodate = 0; + clust->tc.len = 0; +} + +/* Main write procedure for cryptcompress objects, + this slices user's data into clusters and copies to page cache. + If @buf != NULL, returns number of bytes in successfully written clusters, + otherwise returns error */ +/* FIXME_EDWARD replace flow by something lightweigth */ + +static loff_t +write_cryptcompress_flow(struct file * file , struct inode * inode, const char *buf, size_t count, loff_t pos) +{ + int i; + flow_t f; + hint_t hint; + lock_handle lh; + int result = 0; + size_t to_write = 0; + loff_t file_off; + reiser4_slide_t win; + reiser4_cluster_t clust; + + assert("edward-161", schedulable()); + assert("edward-748", crc_inode_ok(inode)); + assert("edward-159", current_blocksize == PAGE_CACHE_SIZE); + assert("edward-749", reiser4_inode_data(inode)->cluster_shift <= MAX_CLUSTER_SHIFT); + assert("edward-1274", get_current_context()->grabbed_blocks == 0); + + result = load_file_hint(file, &hint); + if (result) + return result; + init_lh(&lh); + hint.ext_coord.lh = &lh; + + result = flow_by_inode_cryptcompress(inode, (char *)buf, 1 /* user space */, count, pos, WRITE_OP, &f); + if (result) + goto out; + to_write = f.length; + + /* current write position in file */ + file_off = pos; + reiser4_slide_init(&win); + reiser4_cluster_init(&clust, &win); + clust.hint = &hint; + + result = set_cluster_params(inode, &clust, &win, &f, file_off); + if (result) + goto out; + + if (next_window_stat(&win) == HOLE_WINDOW) { + result = prepare_cluster(inode, file_off, f.length, &clust, PCL_APPEND); + if (result) + goto out; + } + do { + char *src; + unsigned page_off, page_count; + + assert("edward-750", schedulable()); + + result = prepare_cluster(inode, file_off, f.length, &clust, PCL_APPEND); + if (result) + goto out; + + assert("edward-751", crc_inode_ok(inode)); + assert("edward-204", win.stat == DATA_WINDOW); + assert("edward-1288", clust.hint->ext_coord.valid); + assert("edward-752", znode_is_write_locked(hint.ext_coord.coord.node)); + + put_hint_cluster(&clust, inode, ZNODE_WRITE_LOCK); + + /* set write position in page */ + page_off = off_to_pgoff(win.off); + + /* copy user's data to cluster pages */ + for (i = off_to_pg(win.off), src = f.data; i < count_to_nrpages(win.off + win.count); i++, src += page_count) { + page_count = off_to_pgcount(win.off + win.count, i) - page_off; + + assert("edward-1039", page_off + page_count <= PAGE_CACHE_SIZE); + assert("edward-287", clust.pages[i] != NULL); + + lock_page(clust.pages[i]); + result = __copy_from_user((char *)kmap(clust.pages[i]) + page_off, src, page_count); + kunmap(clust.pages[i]); + if (unlikely(result)) { + unlock_page(clust.pages[i]); + result = -EFAULT; + goto err3; + } + SetPageUptodate(clust.pages[i]); + unlock_page(clust.pages[i]); + page_off = 0; + } + assert("edward-753", crc_inode_ok(inode)); + + set_cluster_pages_dirty(&clust); + + result = try_capture_cluster(&clust, inode); + if (result) + goto err2; + + assert("edward-998", f.user == 1); + + move_flow_forward(&f, win.count); + + /* disk cluster may be already clean at this point */ + + /* . update cluster + . set hint for new offset + . unlock znode + . update inode + . balance dirty pages + */ + result = balance_dirty_page_cluster(&clust, inode, 0, f.length); + if(result) + goto err1; + assert("edward-755", hint.ext_coord.lh->owner == NULL); + reset_cluster_params(&clust); + continue; + err3: + page_cache_release(clust.pages[0]); + err2: + release_cluster_pages_and_jnode(&clust); + err1: + if (clust.reserved) + free_reserved4cluster(inode, + &clust, + estimate_insert_cluster(inode, 0)); + break; + } while (f.length); + out: + done_lh(&lh); + if (result == -EEXIST) + printk("write returns EEXIST!\n"); + + put_cluster_handle(&clust, TFM_READ); + save_file_hint(file, &hint); + if (buf) { + /* if nothing were written - there must be an error */ + assert("edward-195", ergo((to_write == f.length), result < 0)); + return (to_write - f.length) ? (to_write - f.length) : result; + } + return result; +} + +static ssize_t +write_crc_file(struct file * file, /* file to write to */ + struct inode *inode, /* inode */ + const char *buf, /* address of user-space buffer */ + size_t count, /* number of bytes to write */ + loff_t * off /* position to write which */) +{ + + int result; + loff_t pos; + ssize_t written; + cryptcompress_info_t * info = cryptcompress_inode_data(inode); + + assert("edward-196", crc_inode_ok(inode)); + + result = generic_write_checks(file, off, &count, 0); + if (unlikely(result != 0)) + return result; + + if (unlikely(count == 0)) + return 0; + + /* FIXME-EDWARD: other UNIX features */ + + down_write(&info->lock); + LOCK_CNT_INC(inode_sem_w); + + pos = *off; + written = write_cryptcompress_flow(file, inode, (char *)buf, count, pos); + if (written < 0) { + if (written == -EEXIST) + printk("write_crc_file returns EEXIST!\n"); + return written; + } + + /* update position in a file */ + *off = pos + written; + + up_write(&info->lock); + LOCK_CNT_DEC(inode_sem_w); + + /* return number of written bytes */ + return written; +} + +/* plugin->u.file.write */ +reiser4_internal ssize_t +write_cryptcompress(struct file * file, /* file to write to */ + const char *buf, /* address of user-space buffer */ + size_t count, /* number of bytes to write */ + loff_t * off /* position to write which */) +{ + ssize_t result; + struct inode *inode; + + inode = file->f_dentry->d_inode; + + down(&inode->i_sem); + + result = write_crc_file(file, inode, buf, count, off); + + up(&inode->i_sem); + return result; +} + +static void +readpages_crc(struct address_space *mapping, struct list_head *pages, void *data) +{ + file_plugin *fplug; + item_plugin *iplug; + + assert("edward-1112", mapping != NULL); + assert("edward-1113", mapping->host != NULL); + + fplug = inode_file_plugin(mapping->host); + assert("edward-1114", fplug == file_plugin_by_id(CRC_FILE_PLUGIN_ID)); + iplug = item_plugin_by_id(CTAIL_ID); + + iplug->s.file.readpages(data, mapping, pages); + + return; +} + +static reiser4_block_nr +cryptcompress_estimate_read(struct inode *inode) +{ + /* reserve one block to update stat data item */ + assert("edward-1193", + inode_file_plugin(inode)->estimate.update == estimate_update_common); + return estimate_update_common(inode); +} + +/* plugin->u.file.read */ +ssize_t read_cryptcompress(struct file * file, char *buf, size_t size, loff_t * off) +{ + ssize_t result; + struct inode *inode; + reiser4_file_fsdata * fsdata; + cryptcompress_info_t * info; + reiser4_block_nr needed; + + inode = file->f_dentry->d_inode; + assert("edward-1194", !inode_get_flag(inode, REISER4_NO_SD)); + assert("edward-1195", inode_get_flag(inode, REISER4_CLUSTER_KNOWN)); + + info = cryptcompress_inode_data(inode); + needed = cryptcompress_estimate_read(inode); + /* FIXME-EDWARD: + Grab space for sd_update so find_cluster will be happy */ +#if 0 + result = reiser4_grab_space(needed, BA_CAN_COMMIT); + if (result != 0) + return result; +#endif + fsdata = reiser4_get_file_fsdata(file); + fsdata->ra2.data = file; + fsdata->ra2.readpages = readpages_crc; + + down_read(&info->lock); + LOCK_CNT_INC(inode_sem_r); + + result = generic_file_read(file, buf, size, off); + + up_read(&info->lock); + LOCK_CNT_DEC(inode_sem_r); + + return result; +} + +static void +set_append_cluster_key(const coord_t *coord, reiser4_key *key, struct inode *inode) +{ + item_key_by_coord(coord, key); + set_key_offset(key, ((__u64)(clust_by_coord(coord, inode)) + 1) << inode_cluster_shift(inode) << PAGE_CACHE_SHIFT); +} + +/* If @index > 0, find real disk cluster of the index (@index - 1), + If @index == 0 find the real disk cluster of the object of maximal index. + Keep incremented index of the result in @found. + It succes was returned: + (@index == 0 && @found == 0) means that the object doesn't have real disk + clusters. + (@index != 0 && @found == 0) means that disk cluster of @index doesn't exist. +*/ +static int +find_real_disk_cluster(struct inode * inode, cloff_t * found, cloff_t index) +{ + int result; + reiser4_key key; + loff_t offset; + hint_t hint; + lookup_bias bias; + coord_t *coord; + lock_handle lh; + item_plugin *iplug; + file_plugin *fplug = inode_file_plugin(inode); + + assert("edward-1131", fplug == file_plugin_by_id(CRC_FILE_PLUGIN_ID)); + assert("edward-95", crc_inode_ok(inode)); + + init_lh(&lh); + hint_init_zero(&hint); + hint.ext_coord.lh = &lh; + + bias = (index ? FIND_EXACT : FIND_MAX_NOT_MORE_THAN); + offset = (index ? clust_to_off(index, inode) - 1 : get_key_offset(max_key())); + + fplug->key_by_inode(inode, offset, &key); + + /* find the last item of this object */ + result = find_cluster_item(&hint, &key, ZNODE_READ_LOCK, 0/* ra_info */, bias, 0); + if (cbk_errored(result)) { + done_lh(&lh); + return result; + } + if (result == CBK_COORD_NOTFOUND) { + /* no real disk clusters */ + done_lh(&lh); + *found = 0; + return 0; + } + /* disk cluster is found */ + coord = &hint.ext_coord.coord; + coord_clear_iplug(coord); + result = zload(coord->node); + if (unlikely(result)) { + done_lh(&lh); + return result; + } + iplug = item_plugin_by_coord(coord); + assert("edward-277", iplug == item_plugin_by_id(CTAIL_ID)); + assert("edward-1202", ctail_ok(coord)); + + set_append_cluster_key(coord, &key, inode); + + *found = off_to_clust(get_key_offset(&key), inode); + + assert("edward-1132", ergo(index, index == *found)); + + zrelse(coord->node); + done_lh(&lh); + + return 0; +} + +static int +find_actual_cloff(struct inode *inode, cloff_t * index) +{ + return find_real_disk_cluster(inode, index, 0 /* find last real one */); +} + +/* Set left coord when unit is not found after node_lookup() + This takes into account that there can be holes in a sequence + of disk clusters */ + +static void +adjust_left_coord(coord_t * left_coord) +{ + switch(left_coord->between) { + case AFTER_UNIT: + left_coord->between = AFTER_ITEM; + case AFTER_ITEM: + case BEFORE_UNIT: + break; + default: + impossible("edward-1204", "bad left coord to cut"); + } + return; +} + +#define CRC_CUT_TREE_MIN_ITERATIONS 64 +reiser4_internal int +cut_tree_worker_cryptcompress(tap_t * tap, const reiser4_key * from_key, + const reiser4_key * to_key, reiser4_key * smallest_removed, + struct inode * object, int truncate, int *progress) +{ + lock_handle next_node_lock; + coord_t left_coord; + int result; + + assert("edward-1158", tap->coord->node != NULL); + assert("edward-1159", znode_is_write_locked(tap->coord->node)); + assert("edward-1160", znode_get_level(tap->coord->node) == LEAF_LEVEL); + + *progress = 0; + init_lh(&next_node_lock); + + while (1) { + znode *node; /* node from which items are cut */ + node_plugin *nplug; /* node plugin for @node */ + + node = tap->coord->node; + + /* Move next_node_lock to the next node on the left. */ + result = reiser4_get_left_neighbor( + &next_node_lock, node, ZNODE_WRITE_LOCK, GN_CAN_USE_UPPER_LEVELS); + if (result != 0 && result != -E_NO_NEIGHBOR) + break; + /* FIXME-EDWARD: Check can we delete the node as a whole. */ + result = tap_load(tap); + if (result) + return result; + + /* Prepare the second (right) point for cut_node() */ + if (*progress) + coord_init_last_unit(tap->coord, node); + + else if (item_plugin_by_coord(tap->coord)->b.lookup == NULL) + /* set rightmost unit for the items without lookup method */ + tap->coord->unit_pos = coord_last_unit_pos(tap->coord); + + nplug = node->nplug; + + assert("edward-1161", nplug); + assert("edward-1162", nplug->lookup); + + /* left_coord is leftmost unit cut from @node */ + result = nplug->lookup(node, from_key, + FIND_EXACT, &left_coord); + + if (IS_CBKERR(result)) + break; + + if (result == CBK_COORD_NOTFOUND) + adjust_left_coord(&left_coord); + + /* adjust coordinates so that they are set to existing units */ + if (coord_set_to_right(&left_coord) || coord_set_to_left(tap->coord)) { + result = 0; + break; + } + + if (coord_compare(&left_coord, tap->coord) == COORD_CMP_ON_RIGHT) { + /* keys from @from_key to @to_key are not in the tree */ + result = 0; + break; + } + + /* cut data from one node */ + *smallest_removed = *min_key(); + result = kill_node_content(&left_coord, + tap->coord, + from_key, + to_key, + smallest_removed, + next_node_lock.node, + object, truncate); +#if REISER4_DEBUG + /*node_check(node, ~0U);*/ +#endif + tap_relse(tap); + + if (result) + break; + + ++ (*progress); + + /* Check whether all items with keys >= from_key were removed + * from the tree. */ + if (keyle(smallest_removed, from_key)) + /* result = 0;*/ + break; + + if (next_node_lock.node == NULL) + break; + + result = tap_move(tap, &next_node_lock); + done_lh(&next_node_lock); + if (result) + break; + + /* Break long cut_tree operation (deletion of a large file) if + * atom requires commit. */ + if (*progress > CRC_CUT_TREE_MIN_ITERATIONS + && current_atom_should_commit()) + { + result = -E_REPEAT; + break; + } + } + done_lh(&next_node_lock); + return result; +} + +/* Append or expand hole in two steps (exclusive access should be aquired!) + 1) write zeroes to the last existing cluster, + 2) expand hole via fake clusters (just increase i_size) */ +static int +cryptcompress_append_hole(struct inode * inode /*contains old i_size */, + loff_t new_size) +{ + int result = 0; + hint_t hint; + loff_t hole_size; + int nr_zeroes; + lock_handle lh; + reiser4_slide_t win; + reiser4_cluster_t clust; + + assert("edward-1133", inode->i_size < new_size); + assert("edward-1134", schedulable()); + assert("edward-1135", crc_inode_ok(inode)); + assert("edward-1136", current_blocksize == PAGE_CACHE_SIZE); + + init_lh(&lh); + hint_init_zero(&hint); + hint.ext_coord.lh = &lh; + + reiser4_slide_init(&win); + reiser4_cluster_init(&clust, &win); + clust.hint = &hint; + + if (off_to_cloff(inode->i_size, inode) == 0) + /* appending hole to cluster boundary */ + goto fake_append; + + /* set cluster handle */ + + result = alloc_cluster_pgset(&clust, cluster_nrpages(inode)); + if (result) + goto out; + hole_size = new_size - inode->i_size; + nr_zeroes = min_count(inode_cluster_size(inode) - off_to_cloff(inode->i_size, inode), hole_size); + + set_window(&clust, &win, inode, inode->i_size, inode->i_size + nr_zeroes); + win.stat = HOLE_WINDOW; + + assert("edward-1137", clust.index == off_to_clust(inode->i_size, inode)); +#if REISER4_DEBUG + printk("edward-1138, Warning: Hole of size %llu in " + "cryptcompress file (inode %llu); " + "%u zeroes appended to cluster (index = %lu) \n", + hole_size, (unsigned long long)get_inode_oid(inode), nr_zeroes, clust.index); +#endif + result = prepare_cluster(inode, 0, 0, &clust, PCL_APPEND); + assert("edward-1271", !result); + if (result) + goto out; + assert("edward-1139", + clust.dstat == PREP_DISK_CLUSTER || + clust.dstat == UNPR_DISK_CLUSTER); + + hole_size -= nr_zeroes; + if (!hole_size) + /* nothing to append anymore */ + goto out; + fake_append: + + INODE_SET_FIELD(inode, i_size, new_size); + out: + done_lh(&lh); + put_cluster_handle(&clust, TFM_READ); + return result; +} + +#if REISER4_DEBUG +static int +page_truncate_ok(struct inode * inode, loff_t old_size, pgoff_t start) +{ + struct pagevec pvec; + int i; + int count; + int rest; + + rest = count_to_nrpages(old_size) - start; + + pagevec_init(&pvec, 0); + count = min_count(pagevec_space(&pvec), rest); + + while (rest) { + count = min_count(pagevec_space(&pvec), rest); + pvec.nr = find_get_pages(inode->i_mapping, start, + count, pvec.pages); + for (i = 0; i < pagevec_count(&pvec); i++) { + if (PageUptodate(pvec.pages[i])) { + warning("edward-1205", + "truncated page of index %lu is uptodate", + pvec.pages[i]->index); + return 0; + } + } + start += count; + rest -= count; + pagevec_release(&pvec); + } + return 1; +} + +static int +body_truncate_ok(struct inode * inode, cloff_t aidx) +{ + int result; + cloff_t raidx; + + result = find_actual_cloff(inode, &raidx); + return !result && (aidx == raidx); +} +#endif + +static int +update_cryptcompress_size(struct inode * inode, reiser4_key * key, int update_sd) +{ + return (get_key_offset(key) & ((loff_t)(inode_cluster_size(inode)) - 1) ? + 0 : + update_file_size(inode, key, update_sd)); +} + +/* prune cryptcompress file in two steps (exclusive access should be acquired!) + 1) cut all disk clusters but the last one partially truncated, + 2) set zeroes and capture last partially truncated page cluster if the last + one exists, otherwise truncate via prune fake cluster (just decrease i_size) +*/ +static int +prune_cryptcompress(struct inode * inode, loff_t new_size, int update_sd, + cloff_t aidx) +{ + int result = 0; + unsigned nr_zeroes; + loff_t to_prune; + loff_t old_size; + cloff_t fidx; + + hint_t hint; + lock_handle lh; + reiser4_slide_t win; + reiser4_cluster_t clust; + + assert("edward-1140", inode->i_size > new_size); + assert("edward-1141", schedulable()); + assert("edward-1142", crc_inode_ok(inode)); + assert("edward-1143", current_blocksize == PAGE_CACHE_SIZE); + + init_lh(&lh); + hint_init_zero(&hint); + hint.ext_coord.lh = &lh; + + reiser4_slide_init(&win); + reiser4_cluster_init(&clust, &win); + clust.hint = &hint; + + /* first completely truncated cluster */ + fidx = count_to_nrclust(new_size, inode); + + assert("edward-1174", fidx <= aidx); + old_size = inode->i_size; + if (fidx != aidx) { + result = cut_file_items(inode, + clust_to_off(fidx, inode), + update_sd, + clust_to_off(aidx, inode), + update_cryptcompress_size); + if (result) + goto out; + } + if (!off_to_cloff(new_size, inode)) { + /* no partially truncated clusters */ + assert("edward-1145", inode->i_size == new_size); + goto finish; + } + assert("edward-1146", new_size < inode->i_size); + + to_prune = inode->i_size - new_size; + + /* check if partially truncated cluster is fake */ + result = find_real_disk_cluster(inode, &aidx, fidx); + if (result) + goto out; + if (!aidx) + /* yup, this is fake one */ + goto finish; + + assert("edward-1148", aidx == fidx); + + /* try to capture partially truncated page cluster */ + result = alloc_cluster_pgset(&clust, cluster_nrpages(inode)); + if (result) + goto out; + nr_zeroes = (off_to_pgoff(new_size) ? + PAGE_CACHE_SIZE - off_to_pgoff(new_size) : + 0); + set_window(&clust, &win, inode, new_size, new_size + nr_zeroes); + win.stat = HOLE_WINDOW; + + assert("edward-1149", clust.index == fidx - 1); + + result = prepare_cluster(inode, 0, 0, &clust, PCL_TRUNCATE); + if (result) + goto out; + assert("edward-1151", + clust.dstat == PREP_DISK_CLUSTER || + clust.dstat == UNPR_DISK_CLUSTER); + + assert("edward-1191", inode->i_size == new_size); + assert("edward-1206", body_truncate_ok(inode, fidx)); + finish: + /* drop all the pages that don't have jnodes + because of holes represented by fake disk clusters + including the pages of partially truncated cluster + which was released by prepare_cluster() */ + truncate_inode_pages(inode->i_mapping, + pg_to_off(count_to_nrpages(new_size))); + INODE_SET_FIELD(inode, i_size, new_size); + out: + done_lh(&lh); + put_cluster_handle(&clust, TFM_READ); + return result; +} + +/* returns true if the cluster we prune or append to is fake */ +static int +truncating_last_fake_dc(struct inode * inode, cloff_t aidx, loff_t new_size) +{ + return aidx == 0 /* no items */|| + (aidx <= off_to_clust(inode->i_size, inode) && + aidx <= off_to_clust(new_size, inode)); +} + +/* This is called in setattr_cryptcompress when it is used to truncate, + and in delete_cryptcompress */ + +static int +cryptcompress_truncate(struct inode *inode, /* old size */ + loff_t new_size, /* new size */ + int update_sd) +{ + int result; + cloff_t aidx; /* appended index to the last actual one */ + loff_t old_size = inode->i_size; + + assert("edward-1167", (new_size != old_size) || (!new_size && !old_size)); + + result = find_actual_cloff(inode, &aidx); + if (result) + return result; + + assert("edward-1208", + ergo(aidx > 0, inode->i_size > clust_to_off(aidx - 1, inode))); + + if (truncating_last_fake_dc(inode, aidx, new_size)) { + /* we do not need to truncate items, so just drop pages + which can not acquire jnodes because of exclusive access */ + + INODE_SET_FIELD(inode, i_size, new_size); + if (old_size > new_size) { + truncate_inode_pages(inode->i_mapping, + pg_to_off(count_to_nrpages(new_size))); + assert("edward-663", ergo(!new_size, + reiser4_inode_data(inode)->anonymous_eflushed == 0 && + reiser4_inode_data(inode)->captured_eflushed == 0)); + } + if (update_sd) + result = update_sd_cryptcompress(inode); + return result; + } + result = (old_size < new_size ? cryptcompress_append_hole(inode, new_size) : + prune_cryptcompress(inode, new_size, update_sd, aidx)); + + assert("edward-1209", + page_truncate_ok(inode, old_size, count_to_nrpages(new_size))); + return result; +} + +/* plugin->u.file.truncate */ +reiser4_internal int +truncate_cryptcompress(struct inode *inode, loff_t new_size) +{ + return 0; +} + +/* page cluser is anonymous if it contains at least one anonymous page */ +static int +capture_anonymous_cluster(reiser4_cluster_t * clust, struct inode * inode) +{ + int result; + + assert("edward-1073", clust != NULL); + assert("edward-1074", inode != NULL); + assert("edward-1075", clust->dstat == INVAL_DISK_CLUSTER); + + result = prepare_cluster(inode, 0, 0, clust, PCL_APPEND); + if (result) + return result; + set_cluster_pages_dirty(clust); + + result = try_capture_cluster(clust, inode); + set_hint_cluster(inode, clust->hint, clust->index + 1, ZNODE_WRITE_LOCK); + if (result) + release_cluster_pages_and_jnode(clust); + return result; +} + +static void +redirty_inode(struct inode *inode) +{ + spin_lock(&inode_lock); + inode->i_state |= I_DIRTY; + spin_unlock(&inode_lock); +} + +#define CAPTURE_APAGE_BURST (1024) + +static int +capture_anonymous_clusters(struct address_space * mapping, pgoff_t * index) +{ + int result = 0; + int to_capture; + int found; + struct page * page = NULL; + hint_t hint; + lock_handle lh; + reiser4_cluster_t clust; + + assert("edward-1127", mapping != NULL); + assert("edward-1128", mapping->host != NULL); + + init_lh(&lh); + hint_init_zero(&hint); + hint.ext_coord.lh = &lh; + reiser4_cluster_init(&clust, 0); + clust.hint = &hint; + + result = alloc_cluster_pgset(&clust, cluster_nrpages(mapping->host)); + if (result) + goto out; + to_capture = (__u32)CAPTURE_APAGE_BURST >> inode_cluster_shift(mapping->host); + + do { + found = find_get_pages_tag(mapping, index, PAGECACHE_TAG_REISER4_MOVED, 1, &page); + if (!found) + break; + assert("edward-1109", page != NULL); + + clust.index = pg_to_clust(*index, mapping->host); + + result = capture_anonymous_cluster(&clust, mapping->host); + if (result) { + page_cache_release(page); + break; + } + page_cache_release(page); + to_capture --; + + assert("edward-1076", clust.index <= pg_to_clust(*index, mapping->host)); + /* index of the next cluster to capture */ + if (clust.index == pg_to_clust(*index, mapping->host)) + *index = clust_to_pg(clust.index + 1, mapping->host); + } while (to_capture); + + if (result) { + warning("edward-1077", "Cannot capture anon pages: result=%i (captured=%d)\n", + result, + ((__u32)CAPTURE_APAGE_BURST >> inode_cluster_shift(mapping->host)) - to_capture); + } else { + /* something had to be found */ + assert("edward-1078", to_capture <= CAPTURE_APAGE_BURST); + if (to_capture == 0) + /* there may be left more pages */ + redirty_inode(mapping->host); + } + out: + done_lh(&lh); + put_cluster_handle(&clust, TFM_READ); + return result; +} + +/* Check mapping for existence of not captured dirty pages. + This returns !0 if either page tree contains pages tagged + PAGECACHE_TAG_REISER4_MOVED */ +static int +crc_inode_has_anon_pages(struct inode *inode) +{ + return mapping_tagged(inode->i_mapping, PAGECACHE_TAG_REISER4_MOVED); +} + +/* plugin->u.file.capture */ +reiser4_internal int +capture_cryptcompress(struct inode *inode, struct writeback_control *wbc) +{ + int result; + pgoff_t index = 0; + cryptcompress_info_t * info; + + if (!crc_inode_has_anon_pages(inode)) + return 0; + + info = cryptcompress_inode_data(inode); + + do { + reiser4_context ctx; + + if (is_in_reiser4_context()) { + /* It can be in the context of write system call from + balance_dirty_pages() */ + if (down_read_trylock(&info->lock) == 0) { + result = RETERR(-EBUSY); + break; + } + } else + down_read(&info->lock); + + init_context(&ctx, inode->i_sb); + ctx.nobalance = 1; + + assert("edward-1079", lock_stack_isclean(get_current_lock_stack())); + + LOCK_CNT_INC(inode_sem_r); + + result = capture_anonymous_clusters(inode->i_mapping, &index); + + up_read(&info->lock); + + LOCK_CNT_DEC(inode_sem_r); + + if (result != 0 || wbc->sync_mode != WB_SYNC_ALL) { + reiser4_exit_context(&ctx); + break; + } + result = txnmgr_force_commit_all(inode->i_sb, 0); + reiser4_exit_context(&ctx); + } while (result == 0 && crc_inode_has_anon_pages(inode)); + + return result; +} + +/* plugin->u.file.mmap */ +reiser4_internal int +mmap_cryptcompress(struct file * file, struct vm_area_struct * vma) +{ + return -ENOSYS; + //return generic_file_mmap(file, vma); +} + + +/* plugin->u.file.release */ +/* plugin->u.file.get_block */ +/* This function is used for ->bmap() VFS method in reiser4 address_space_operations */ +reiser4_internal int +get_block_cryptcompress(struct inode *inode, sector_t block, struct buffer_head *bh_result, int create UNUSED_ARG) +{ + if (current_blocksize != inode_cluster_size(inode)) + return RETERR(-EINVAL); + else { + int result; + reiser4_key key; + hint_t hint; + lock_handle lh; + item_plugin *iplug; + + assert("edward-1166", 0); + assert("edward-420", create == 0); + key_by_inode_cryptcompress(inode, (loff_t)block * current_blocksize, &key); + init_lh(&lh); + hint_init_zero(&hint); + hint.ext_coord.lh = &lh; + result = find_cluster_item(&hint, &key, ZNODE_READ_LOCK, 0, FIND_EXACT, 0); + if (result != CBK_COORD_FOUND) { + done_lh(&lh); + return result; + } + result = zload(hint.ext_coord.coord.node); + if (unlikely(result)) { + done_lh(&lh); + return result; + } + iplug = item_plugin_by_coord(&hint.ext_coord.coord); + + assert("edward-421", iplug == item_plugin_by_id(CTAIL_ID)); + + if (iplug->s.file.get_block) + result = iplug->s.file.get_block(&hint.ext_coord.coord, block, bh_result); + else + result = RETERR(-EINVAL); + + zrelse(hint.ext_coord.coord.node); + done_lh(&lh); + return result; + } +} + +/* plugin->u.file.delete method + see plugin.h for description */ +reiser4_internal int +delete_cryptcompress(struct inode *inode) +{ + int result; + + assert("edward-429", inode->i_nlink == 0); + + if (inode->i_size) { + result = cryptcompress_truncate(inode, 0, 0); + if (result) { + warning("edward-430", "cannot truncate cryptcompress file %lli: %i", + (unsigned long long)get_inode_oid(inode), result); + return result; + } + } + return delete_object(inode, 0); +} + +/* plugin->u.file.pre_delete method + see plugin.h for description */ +reiser4_internal int +pre_delete_cryptcompress(struct inode *inode) +{ + return cryptcompress_truncate(inode, 0, 0); +} + +/* plugin->u.file.setattr method + see plugin.h for description */ +reiser4_internal int +setattr_cryptcompress(struct inode *inode, /* Object to change attributes */ + struct iattr *attr /* change description */ ) +{ + int result; + + if (attr->ia_valid & ATTR_SIZE) { + /* EDWARD-FIXME-HANS: VS-FIXME-HANS: + Q: this case occurs when? truncate? + A: yes + + Q: If so, why isn't this code in truncate itself instead of here? + + A: because vfs calls fs's truncate after it has called truncate_inode_pages to get rid of pages + corresponding to part of file being truncated. In reiser4 it may cause existence of unallocated + extents which do not have jnodes. Flush code does not expect that. Solution of this problem is + straightforward. As vfs's truncate is implemented using setattr operation (common implementaion of + which calls truncate_inode_pages and fs's truncate in case when size of file changes) - it seems + reasonable to have reiser4_setattr which will take care of removing pages, jnodes and extents + simultaneously in case of truncate. + Q: do you think implementing truncate using setattr is ugly, + and vfs needs improving, or is there some sense in which this is a good design? + + A: VS-FIXME-HANS: + */ + + /* truncate does reservation itself and requires exclusive access obtained */ + if (inode->i_size != attr->ia_size) { + loff_t old_size; + cryptcompress_info_t * info = cryptcompress_inode_data(inode); + + down_write(&info->lock); + LOCK_CNT_INC(inode_sem_w); + + inode_check_scale(inode, inode->i_size, attr->ia_size); + + old_size = inode->i_size; + + result = cryptcompress_truncate(inode, attr->ia_size, 1/* update stat data */); + if (result) { + warning("edward-1192", "truncate_cryptcompress failed: oid %lli, " + "old size %lld, new size %lld, retval %d", + (unsigned long long)get_inode_oid(inode), + old_size, attr->ia_size, result); + } + up_write(&info->lock); + LOCK_CNT_DEC(inode_sem_w); + } else + result = 0; + } else + result = setattr_common(inode, attr); + return result; +} + +static int +save_len_cryptcompress_plugin(struct inode * inode, reiser4_plugin * plugin) +{ + assert("edward-457", inode != NULL); + assert("edward-458", plugin != NULL); + assert("edward-459", plugin->h.id == CRC_FILE_PLUGIN_ID); + return 0; +} + +static int +load_cryptcompress_plugin(struct inode * inode, reiser4_plugin * plugin, char **area, int *len) +{ + assert("edward-455", inode != NULL); + assert("edward-456", (reiser4_inode_data(inode)->pset != NULL)); + + plugin_set_file(&reiser4_inode_data(inode)->pset, file_plugin_by_id(CRC_FILE_PLUGIN_ID)); + return 0; +} + +static int +change_crypto_file(struct inode * inode, reiser4_plugin * plugin) +{ + /* cannot change object plugin of already existing object */ + return RETERR(-EINVAL); +} + +struct reiser4_plugin_ops cryptcompress_plugin_ops = { + .load = load_cryptcompress_plugin, + .save_len = save_len_cryptcompress_plugin, + .save = NULL, + .alignment = 8, + .change = change_crypto_file +}; + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/cryptcompress.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/cryptcompress.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,518 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ +/* See http://www.namesys.com/cryptcompress_design.html */ + +#if !defined( __FS_REISER4_CRYPTCOMPRESS_H__ ) +#define __FS_REISER4_CRYPTCOMPRESS_H__ + +#include "compress/compress.h" + +#include +#include +#include + +#define MIN_CLUSTER_SIZE PAGE_CACHE_SIZE +#define MAX_CLUSTER_SHIFT 4 +#define MAX_CLUSTER_NRPAGES (1 << MAX_CLUSTER_SHIFT) +#define DEFAULT_CLUSTER_SHIFT 0 +#define DC_CHECKSUM_SIZE 4 +#define MIN_CRYPTO_BLOCKSIZE 8 + +typedef unsigned long cloff_t; + +/* Set of transform id's supported by reiser4, + each transform is implemented by appropriate transform plugin: */ +typedef enum { + CRYPTO_TFM, /* crypto plugin */ + DIGEST_TFM, /* digest plugin */ + COMPRESS_TFM, /* compression plugin */ + LAST_TFM +} reiser4_tfm; + +typedef struct tfm_stream { + __u8 * data; + size_t size; +} tfm_stream_t; + +typedef enum { + INPUT_STREAM, + OUTPUT_STREAM, + LAST_STREAM +} tfm_stream_id; + +typedef tfm_stream_t * tfm_unit[LAST_STREAM]; + +static inline __u8 * +ts_data(tfm_stream_t * stm) +{ + assert("edward-928", stm != NULL); + return stm->data; +} + +static inline size_t +ts_size(tfm_stream_t * stm) +{ + assert("edward-929", stm != NULL); + return stm->size; +} + +static inline void +set_ts_size(tfm_stream_t * stm, size_t size) +{ + assert("edward-930", stm != NULL); + + stm->size = size; +} + +static inline int +alloc_ts(tfm_stream_t ** stm) +{ + assert("edward-931", stm); + assert("edward-932", *stm == NULL); + + *stm = reiser4_kmalloc(sizeof ** stm, GFP_KERNEL); + if (*stm == NULL) + return -ENOMEM; + memset(*stm, 0, sizeof ** stm); + return 0; +} + +static inline void +free_ts(tfm_stream_t * stm) +{ + assert("edward-933", !ts_data(stm)); + assert("edward-934", !ts_size(stm)); + + reiser4_kfree(stm); +} + +static inline int +alloc_ts_data(tfm_stream_t * stm, size_t size) +{ + assert("edward-935", !ts_data(stm)); + assert("edward-936", !ts_size(stm)); + assert("edward-937", size != 0); + + stm->data = vmalloc(size); + if (!stm->data) + return -ENOMEM; + set_ts_size(stm, size); + return 0; +} + +static inline void +free_ts_data(tfm_stream_t * stm) +{ + assert("edward-938", equi(ts_data(stm), ts_size(stm))); + + if (ts_data(stm)) + vfree(ts_data(stm)); + memset(stm, 0, sizeof *stm); +} + +/* Write modes for item conversion in flush convert phase */ +typedef enum { + CRC_APPEND_ITEM = 1, + CRC_OVERWRITE_ITEM = 2, + CRC_CUT_ITEM = 3 +} crc_write_mode_t; + +typedef enum { + PCL_UNKNOWN = 0, /* invalid option */ + PCL_APPEND = 1, /* append and/or overwrite */ + PCL_TRUNCATE = 2 /* truncate */ +} page_cluster_op; + +/* Reiser4 file write/read transforms page cluster into disk cluster (and back) + using crypto/compression transforms implemented by reiser4 transform plugins. + Before each transform we allocate a pair of streams (tfm_unit) and assemble + page cluster into the input one. After transform we split output stream into + a set of items (disk cluster). +*/ +typedef struct tfm_cluster{ + coa_set coa; + tfm_unit tun; + int uptodate; + int len; +} tfm_cluster_t; + +static inline coa_t +get_coa(tfm_cluster_t * tc, reiser4_compression_id id) +{ + return tc->coa[id]; +} + +static inline void +set_coa(tfm_cluster_t * tc, reiser4_compression_id id, coa_t coa) +{ + tc->coa[id] = coa; +} + +static inline int +alloc_coa(tfm_cluster_t * tc, compression_plugin * cplug, tfm_action act) +{ + coa_t coa; + + coa = cplug->alloc(act); + if (IS_ERR(coa)) + return PTR_ERR(coa); + set_coa(tc, cplug->h.id, coa); + return 0; +} + +static inline void +free_coa_set(tfm_cluster_t * tc, tfm_action act) +{ + reiser4_compression_id i; + compression_plugin * cplug; + + assert("edward-810", tc != NULL); + + for(i = 0; i < LAST_COMPRESSION_ID; i++) { + if (!get_coa(tc, i)) + continue; + cplug = compression_plugin_by_id(i); + assert("edward-812", cplug->free != NULL); + cplug->free(get_coa(tc, i), act); + set_coa(tc, i, 0); + } + return; +} + +static inline tfm_stream_t * +tfm_stream (tfm_cluster_t * tc, tfm_stream_id id) +{ + return tc->tun[id]; +} + +static inline void +set_tfm_stream (tfm_cluster_t * tc, tfm_stream_id id, tfm_stream_t * ts) +{ + tc->tun[id] = ts; +} + +static inline __u8 * +tfm_stream_data (tfm_cluster_t * tc, tfm_stream_id id) +{ + return ts_data(tfm_stream(tc, id)); +} + +static inline void +set_tfm_stream_data(tfm_cluster_t * tc, tfm_stream_id id, __u8 * data) +{ + tfm_stream(tc, id)->data = data; +} + +static inline size_t +tfm_stream_size (tfm_cluster_t * tc, tfm_stream_id id) +{ + return ts_size(tfm_stream(tc, id)); +} + +static inline void +set_tfm_stream_size(tfm_cluster_t * tc, tfm_stream_id id, size_t size) +{ + tfm_stream(tc, id)->size = size; +} + +static inline int +alloc_tfm_stream(tfm_cluster_t * tc, size_t size, tfm_stream_id id) +{ + assert("edward-939", tc != NULL); + assert("edward-940", !tfm_stream(tc, id)); + + tc->tun[id] = reiser4_kmalloc(sizeof(tfm_stream_t), GFP_KERNEL); + if (!tc->tun[id]) + return -ENOMEM; + memset(tfm_stream(tc, id), 0, sizeof(tfm_stream_t)); + return alloc_ts_data(tfm_stream(tc, id), size); +} + +static inline int +realloc_tfm_stream(tfm_cluster_t * tc, size_t size, tfm_stream_id id) +{ + assert("edward-941", tfm_stream_size(tc, id) < size); + free_ts_data(tfm_stream(tc, id)); + return alloc_ts_data(tfm_stream(tc, id), size); +} + +static inline void +free_tfm_stream(tfm_cluster_t * tc, tfm_stream_id id) +{ + free_ts_data(tfm_stream(tc, id)); + free_ts(tfm_stream(tc, id)); + set_tfm_stream(tc, id, 0); +} + +static inline void +free_tfm_unit(tfm_cluster_t * tc) +{ + tfm_stream_id id; + for (id = 0; id < LAST_STREAM; id++) { + if (!tfm_stream(tc, id)) + continue; + free_tfm_stream(tc, id); + } +} + +static inline void +put_tfm_cluster(tfm_cluster_t * tc, tfm_action act) +{ + assert("edward-942", tc != NULL); + free_coa_set(tc, act); + free_tfm_unit(tc); +} + +static inline int +tfm_cluster_is_uptodate (tfm_cluster_t * tc) +{ + assert("edward-943", tc != NULL); + assert("edward-944", tc->uptodate == 0 || tc->uptodate == 1); + return (tc->uptodate == 1); +} + +static inline void +tfm_cluster_set_uptodate (tfm_cluster_t * tc) +{ + assert("edward-945", tc != NULL); + assert("edward-946", tc->uptodate == 0 || tc->uptodate == 1); + tc->uptodate = 1; + return; +} + +static inline void +tfm_cluster_clr_uptodate (tfm_cluster_t * tc) +{ + assert("edward-947", tc != NULL); + assert("edward-948", tc->uptodate == 0 || tc->uptodate == 1); + tc->uptodate = 0; + return; +} + +static inline int +tfm_stream_is_set(tfm_cluster_t * tc, tfm_stream_id id) +{ + return (tfm_stream(tc, id) && + tfm_stream_data(tc, id) && + tfm_stream_size(tc, id)); +} + +static inline int +tfm_cluster_is_set(tfm_cluster_t * tc) +{ + int i; + for (i = 0; i < LAST_STREAM; i++) + if (!tfm_stream_is_set(tc, i)) + return 0; + return 1; +} + +static inline void +alternate_streams(tfm_cluster_t * tc) +{ + tfm_stream_t * tmp = tfm_stream(tc, INPUT_STREAM); + + set_tfm_stream(tc, INPUT_STREAM, tfm_stream(tc, OUTPUT_STREAM)); + set_tfm_stream(tc, OUTPUT_STREAM, tmp); +} + +/* a kind of data that we can write to the window */ +typedef enum { + DATA_WINDOW, /* the data we copy form user space */ + HOLE_WINDOW /* zeroes if we write hole */ +} window_stat; + +/* Sliding window of cluster size which should be set to the approprite position + (defined by cluster index) in a file before page cluster modification by + file_write. Then we translate file size, offset to write from, number of + bytes to write, etc.. to the following configuration needed to estimate + number of pages to read before write, etc... +*/ +typedef struct reiser4_slide { + unsigned off; /* offset we start to write/truncate from */ + unsigned count; /* number of bytes (zeroes) to write/truncate */ + unsigned delta; /* number of bytes to append to the hole */ + window_stat stat; /* a kind of data to write to the window */ +} reiser4_slide_t; + +/* The following is a set of possible disk cluster states */ +typedef enum { + INVAL_DISK_CLUSTER,/* unknown state */ + PREP_DISK_CLUSTER, /* disk cluster got converted by flush + at least 1 time */ + UNPR_DISK_CLUSTER, /* disk cluster just created and should be + converted by flush */ + FAKE_DISK_CLUSTER /* disk cluster doesn't exist neither in memory + nor on disk */ +} disk_cluster_stat; + +/* + While implementing all transforms (from page to disk cluster, and back) + reiser4 cluster manager fills the following structure incapsulating pointers + to all the clusters for the same index including the sliding window above +*/ +typedef struct reiser4_cluster{ + tfm_cluster_t tc; /* transform cluster */ + int nr_pages; /* number of pages */ + struct page ** pages; /* page cluster */ + page_cluster_op op; /* page cluster operation */ + struct file * file; + hint_t * hint; /* disk cluster item for traversal */ + disk_cluster_stat dstat; /* state of the current disk cluster */ + cloff_t index; /* offset in the units of cluster size */ + reiser4_slide_t * win; /* sliding window of cluster size */ + int reserved; /* this indicates that space for disk + cluster modification is reserved */ +#if REISER4_DEBUG + reiser4_context * ctx; + int reserved_prepped; + int reserved_unprepped; +#endif +} reiser4_cluster_t; + +static inline void +reset_cluster_pgset(reiser4_cluster_t * clust, int nrpages) +{ + assert("edward-1057", clust->pages != NULL); + memset(clust->pages, 0, sizeof(*clust->pages) * nrpages); +} + +static inline int +alloc_cluster_pgset(reiser4_cluster_t * clust, int nrpages) +{ + assert("edward-949", clust != NULL); + assert("edward-950", nrpages != 0 && nrpages <= MAX_CLUSTER_NRPAGES); + + clust->pages = reiser4_kmalloc(sizeof(*clust->pages) * nrpages, GFP_KERNEL); + if (!clust->pages) + return RETERR(-ENOMEM); + reset_cluster_pgset(clust, nrpages); + return 0; +} + +static inline void +free_cluster_pgset(reiser4_cluster_t * clust) +{ + assert("edward-951", clust->pages != NULL); + reiser4_kfree(clust->pages); +} + +static inline void +put_cluster_handle(reiser4_cluster_t * clust, tfm_action act) +{ + assert("edward-435", clust != NULL); + + put_tfm_cluster(&clust->tc, act); + if (clust->pages) + free_cluster_pgset(clust); + memset(clust, 0, sizeof *clust); +} + +/* security attributes supposed to be stored on disk + are loaded by stat-data methods (see plugin/item/static_stat.c */ +typedef struct crypto_stat { + __u8 * keyid; /* pointer to a fingerprint */ + __u16 keysize; /* key size, bits */ +} crypto_stat_t; + +/* cryptcompress specific part of reiser4_inode */ +typedef struct cryptcompress_info { + struct rw_semaphore lock; + struct crypto_tfm *tfm[LAST_TFM]; + __u32 * expkey; +} cryptcompress_info_t; + +cryptcompress_info_t *cryptcompress_inode_data(const struct inode * inode); +int equal_to_rdk(znode *, const reiser4_key *); +int goto_right_neighbor(coord_t *, lock_handle *); +int load_file_hint(struct file *, hint_t *); +void save_file_hint(struct file *, const hint_t *); + +/* declarations of functions implementing methods of cryptcompress object plugin */ +void init_inode_data_cryptcompress(struct inode *inode, reiser4_object_create_data *crd, int create); +int create_cryptcompress(struct inode *, struct inode *, reiser4_object_create_data *); +int open_cryptcompress(struct inode * inode, struct file * file); +int truncate_cryptcompress(struct inode *, loff_t size); +int readpage_cryptcompress(void *, struct page *); +int capture_cryptcompress(struct inode *inode, struct writeback_control *wbc); +ssize_t read_cryptcompress(struct file * file, char *buf, size_t size, loff_t * off); +ssize_t write_cryptcompress(struct file *, const char *buf, size_t size, loff_t *off); +int release_cryptcompress(struct inode *inode, struct file *); +int mmap_cryptcompress(struct file *, struct vm_area_struct *vma); +int get_block_cryptcompress(struct inode *, sector_t block, struct buffer_head *bh_result, int create); +int flow_by_inode_cryptcompress(struct inode *, char *buf, int user, loff_t, loff_t, rw_op, flow_t *); +int key_by_inode_cryptcompress(struct inode *, loff_t off, reiser4_key *); +int delete_cryptcompress(struct inode *); +int owns_item_cryptcompress(const struct inode *, const coord_t *); +int setattr_cryptcompress(struct inode *, struct iattr *); +void readpages_cryptcompress(struct file *, struct address_space *, struct list_head *pages); +void init_inode_data_cryptcompress(struct inode *, reiser4_object_create_data *, int create); +int pre_delete_cryptcompress(struct inode *); +int cut_tree_worker_cryptcompress(tap_t * tap, const reiser4_key * from_key, + const reiser4_key * to_key, reiser4_key * smallest_removed, + struct inode * object, int, int*); +void hint_init_zero(hint_t *); +void destroy_inode_cryptcompress(struct inode * inode); +int crc_inode_ok(struct inode * inode); + +static inline struct crypto_tfm * +inode_get_tfm (struct inode * inode, reiser4_tfm tfm) +{ + return cryptcompress_inode_data(inode)->tfm[tfm]; +} + +static inline struct crypto_tfm * +inode_get_crypto (struct inode * inode) +{ + return (inode_get_tfm(inode, CRYPTO_TFM)); +} + +static inline struct crypto_tfm * +inode_get_digest (struct inode * inode) +{ + return (inode_get_tfm(inode, DIGEST_TFM)); +} + +static inline unsigned int +crypto_blocksize(struct inode * inode) +{ + assert("edward-758", inode_get_tfm(inode, CRYPTO_TFM) != NULL); + return crypto_tfm_alg_blocksize(inode_get_tfm(inode, CRYPTO_TFM)); +} + +#define REGISTER_NONE_ALG(ALG, TFM) \ +static int alloc_none_ ## ALG (struct inode * inode) \ +{ \ + cryptcompress_info_t * info; \ + assert("edward-760", inode != NULL); \ + \ + info = cryptcompress_inode_data(inode); \ + \ + \ + cryptcompress_inode_data(inode)->tfm[TFM ## _TFM] = NULL; \ + return 0; \ + \ +} \ +static void free_none_ ## ALG (struct inode * inode) \ +{ \ + cryptcompress_info_t * info; \ + assert("edward-761", inode != NULL); \ + \ + info = cryptcompress_inode_data(inode); \ + \ + assert("edward-762", info != NULL); \ + \ + info->tfm[TFM ## _TFM] = NULL; \ +} + +#endif /* __FS_REISER4_CRYPTCOMPRESS_H__ */ + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/digest.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/digest.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,32 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* reiser4 digest transform plugin (is used by cryptcompress object plugin) */ +/* EDWARD-FIXME-HANS: and it does what? a digest is a what? */ +#include "../debug.h" +#include "plugin_header.h" +#include "plugin.h" +#include "cryptcompress.h" + +#include + +#define NONE_DIGEST_SIZE 0 + +REGISTER_NONE_ALG(digest, DIGEST) + +/* digest plugins */ +digest_plugin digest_plugins[LAST_DIGEST_ID] = { + [NONE_DIGEST_ID] = { + .h = { + .type_id = REISER4_DIGEST_PLUGIN_TYPE, + .id = NONE_DIGEST_ID, + .pops = NULL, + .label = "none", + .desc = "trivial digest", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .dsize = NONE_DIGEST_SIZE, + .alloc = alloc_none_digest, + .free = free_none_digest, + } +}; + diff -puN /dev/null fs/reiser4/plugin/dir/dir.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/dir/dir.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1885 @@ +/* Copyright 2001, 2002, 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Methods of directory plugin. */ + +#include "../../forward.h" +#include "../../debug.h" +#include "../../spin_macros.h" +#include "../plugin_header.h" +#include "../../key.h" +#include "../../kassign.h" +#include "../../coord.h" +#include "../../type_safe_list.h" +#include "../plugin.h" +#include "dir.h" +#include "../item/item.h" +#include "../security/perm.h" +#include "../../jnode.h" +#include "../../znode.h" +#include "../../tap.h" +#include "../../vfs_ops.h" +#include "../../inode.h" +#include "../../super.h" +#include "../../safe_link.h" +#include "../object.h" + +#include "hashed_dir.h" +#include "pseudo_dir.h" + +#include /* for __u?? */ +#include /* for struct file */ +#include +#include /* for struct dentry */ + +/* helper function. Standards require that for many file-system operations + on success ctime and mtime of parent directory is to be updated. */ +reiser4_internal int +reiser4_update_dir(struct inode *dir) +{ + assert("nikita-2525", dir != NULL); + + dir->i_ctime = dir->i_mtime = CURRENT_TIME; + return reiser4_mark_inode_dirty(dir); +} + +/* estimate disk space necessary to add a link from @parent to @object. */ +static reiser4_block_nr common_estimate_link( + struct inode *parent /* parent directory */, + struct inode *object /* object to which new link is being cerated */) +{ + reiser4_block_nr res = 0; + file_plugin *fplug; + dir_plugin *dplug; + + assert("vpf-317", object != NULL); + assert("vpf-318", parent != NULL ); + + fplug = inode_file_plugin(object); + dplug = inode_dir_plugin(parent); + /* VS-FIXME-HANS: why do we do fplug->estimate.update(object) twice instead of multiplying by 2? */ + /* reiser4_add_nlink(object) */ + res += fplug->estimate.update(object); + /* add_entry(parent) */ + res += dplug->estimate.add_entry(parent); + /* reiser4_del_nlink(object) */ + res += fplug->estimate.update(object); + /* update_dir(parent) */ + res += inode_file_plugin(parent)->estimate.update(parent); + /* safe-link */ + res += estimate_one_item_removal(tree_by_inode(object)); + + return res; +} + +/* add link from @parent directory to @existing object. + + . get plugins + . check permissions + . check that "existing" can hold yet another link + . start transaction + . add link to "existing" + . add entry to "parent" + . if last step fails, remove link from "existing" + +*/ +static int +link_common(struct inode *parent /* parent directory */ , + struct dentry *existing /* dentry of object to which + * new link is being + * cerated */ , + struct dentry *newname /* new name */ ) +{ + int result; + struct inode *object; + dir_plugin *parent_dplug; + reiser4_dir_entry_desc entry; + reiser4_object_create_data data; + reiser4_block_nr reserve; + + assert("nikita-1431", existing != NULL); + assert("nikita-1432", parent != NULL); + assert("nikita-1433", newname != NULL); + + object = existing->d_inode; + assert("nikita-1434", object != NULL); + + /* check for race with create_object() */ + if (inode_get_flag(object, REISER4_IMMUTABLE)) + return RETERR(-E_REPEAT); + + /* links to directories are not allowed if file-system + logical name-space should be ADG */ + if (S_ISDIR(object->i_mode) && reiser4_is_set(parent->i_sb, REISER4_ADG)) + return RETERR(-EISDIR); + + /* check permissions */ + result = perm_chk(parent, link, existing, parent, newname); + if (result != 0) + return result; + + parent_dplug = inode_dir_plugin(parent); + + memset(&entry, 0, sizeof entry); + entry.obj = object; + + data.mode = object->i_mode; + data.id = inode_file_plugin(object)->h.id; + + reserve = common_estimate_link(parent, existing->d_inode); + if ((__s64)reserve < 0) + return reserve; + + if (reiser4_grab_space(reserve, BA_CAN_COMMIT)) + return RETERR(-ENOSPC); + + /* + * Subtle race handling: sys_link() doesn't take i_sem on @parent. It + * means that link(2) can race against unlink(2) or rename(2), and + * inode is dead (->i_nlink == 0) when reiser4_link() is entered. + * + * For such inode we have to undo special processing done in + * reiser4_unlink() viz. creation of safe-link. + */ + if (unlikely(inode_file_plugin(object)->not_linked(object))) { + result = safe_link_del(object, SAFE_UNLINK); + if (result != 0) + return result; + } + + result = reiser4_add_nlink(object, parent, 1); + if (result == 0) { + /* add entry to the parent */ + result = parent_dplug->add_entry(parent, newname, &data, &entry); + if (result != 0) { + /* failure to add entry to the parent, remove + link from "existing" */ + reiser4_del_nlink(object, parent, 1); + /* now, if this fails, we have a file with too + big nlink---space leak, much better than + directory entry pointing to nowhere */ + /* may be it should be recorded somewhere, but + if addition of link to parent and update of + object's stat data both failed, chances are + that something is going really wrong */ + } + } + if (result == 0) { + atomic_inc(&object->i_count); + /* Upon successful completion, link() shall mark for update + the st_ctime field of the file. Also, the st_ctime and + st_mtime fields of the directory that contains the new + entry shall be marked for update. --SUS + */ + result = reiser4_update_dir(parent); + } + return result; +} + +/* estimate disk space necessary to remove a link between @parent and + * @object. */ +static reiser4_block_nr common_estimate_unlink ( + struct inode *parent /* parent directory */, + struct inode *object /* object to which new link is being cerated */) +{ + reiser4_block_nr res = 0; + file_plugin *fplug; + dir_plugin *dplug; + + assert("vpf-317", object != NULL); + assert("vpf-318", parent != NULL ); + + fplug = inode_file_plugin(object); + dplug = inode_dir_plugin(parent); + + /* rem_entry(parent) */ + res += dplug->estimate.rem_entry(parent); + /* reiser4_del_nlink(object) */ + res += fplug->estimate.update(object); + /* update_dir(parent) */ + res += inode_file_plugin(parent)->estimate.update(parent); + /* fplug->unlink */ + res += fplug->estimate.unlink(object, parent); + /* safe-link */ + res += estimate_one_insert_item(tree_by_inode(object)); + + return res; +} + +/* grab space for unlink. */ +static int +unlink_check_and_grab(struct inode *parent, struct dentry *victim) +{ + file_plugin *fplug; + struct inode *child; + int result; + + result = 0; + child = victim->d_inode; + fplug = inode_file_plugin(child); + + /* check for race with create_object() */ + if (inode_get_flag(child, REISER4_IMMUTABLE)) + return RETERR(-E_REPEAT); + /* object being deleted should have stat data */ + assert("vs-949", !inode_get_flag(child, REISER4_NO_SD)); + + /* check permissions */ + result = perm_chk(parent, unlink, parent, victim); + if (result != 0) + return result; + + /* ask object plugin */ + if (fplug->can_rem_link != NULL && !fplug->can_rem_link(child)) + return RETERR(-ENOTEMPTY); + + result = (int)common_estimate_unlink(parent, child); + if (result < 0) + return result; + + return reiser4_grab_reserved(child->i_sb, result, BA_CAN_COMMIT); +} + +/* remove link from @parent directory to @victim object. + + . get plugins + . find entry in @parent + . check permissions + . decrement nlink on @victim + . if nlink drops to 0, delete object +*/ +static int +unlink_common(struct inode *parent /* parent object */ , + struct dentry *victim /* name being removed from @parent */) +{ + int result; + struct inode *object; + file_plugin *fplug; + + object = victim->d_inode; + fplug = inode_file_plugin(object); + assert("nikita-2882", fplug->detach != NULL); + + result = unlink_check_and_grab(parent, victim); + if (result != 0) + return result; + + result = fplug->detach(object, parent); + if (result == 0) { + dir_plugin *parent_dplug; + reiser4_dir_entry_desc entry; + + parent_dplug = inode_dir_plugin(parent); + memset(&entry, 0, sizeof entry); + + /* first, delete directory entry */ + result = parent_dplug->rem_entry(parent, victim, &entry); + if (result == 0) { + /* + * if name was removed successfully, we _have_ to + * return 0 from this function, because upper level + * caller (vfs_{rmdir,unlink}) expect this. + */ + /* now that directory entry is removed, update + * stat-data */ + reiser4_del_nlink(object, parent, 1); + /* Upon successful completion, unlink() shall mark for + update the st_ctime and st_mtime fields of the + parent directory. Also, if the file's link count is + not 0, the st_ctime field of the file shall be + marked for update. --SUS */ + reiser4_update_dir(parent); + /* add safe-link for this file */ + if (fplug->not_linked(object)) + safe_link_add(object, SAFE_UNLINK); + } + } + + if (unlikely(result != 0)) { + if (result != -ENOMEM) + warning("nikita-3398", "Cannot unlink %llu (%i)", + (unsigned long long)get_inode_oid(object), + result); + /* if operation failed commit pending inode modifications to + * the stat-data */ + reiser4_update_sd(object); + reiser4_update_sd(parent); + } + + reiser4_release_reserved(object->i_sb); + + /* @object's i_ctime was updated by ->rem_link() method(). */ + + return result; +} + +/* Estimate the maximum amount of nodes will be allocated or changed for: + - insert an in the parent entry + - update the SD of parent + - estimate child creation +*/ +static reiser4_block_nr common_estimate_create_child( + struct inode *parent, /* parent object */ + struct inode *object /* object */) +{ + assert("vpf-309", parent != NULL); + assert("vpf-307", object != NULL); + + return + /* object creation estimation */ + inode_file_plugin(object)->estimate.create(object) + + /* stat data of parent directory estimation */ + inode_file_plugin(parent)->estimate.update(parent) + + /* adding entry estimation */ + inode_dir_plugin(parent)->estimate.add_entry(parent) + + /* to undo in the case of failure */ + inode_dir_plugin(parent)->estimate.rem_entry(parent); +} + +/* Create child in directory. + + . get object's plugin + . get fresh inode + . initialize inode + . add object's stat-data + . initialize object's directory + . add entry to the parent + . instantiate dentry + +*/ +/* ->create_child method of directory plugin */ +static int +create_child_common(reiser4_object_create_data * data /* parameters + * of new + * object */, + struct inode ** retobj) +{ + int result; + + struct dentry *dentry; /* parent object */ + struct inode *parent; /* new name */ + + dir_plugin *par_dir; /* directory plugin on the parent */ + dir_plugin *obj_dir; /* directory plugin on the new object */ + file_plugin *obj_plug; /* object plugin on the new object */ + struct inode *object; /* new object */ + reiser4_block_nr reserve; + + reiser4_dir_entry_desc entry; /* new directory entry */ + + assert("nikita-1420", data != NULL); + parent = data->parent; + dentry = data->dentry; + + assert("nikita-1418", parent != NULL); + assert("nikita-1419", dentry != NULL); + par_dir = inode_dir_plugin(parent); + /* check permissions */ + result = perm_chk(parent, create, parent, dentry, data); + if (result != 0) + return result; + + /* check, that name is acceptable for parent */ + if (par_dir->is_name_acceptable && + !par_dir->is_name_acceptable(parent, + dentry->d_name.name, + (int) dentry->d_name.len)) + return RETERR(-ENAMETOOLONG); + + result = 0; + obj_plug = file_plugin_by_id((int) data->id); + if (obj_plug == NULL) { + warning("nikita-430", "Cannot find plugin %i", data->id); + return RETERR(-ENOENT); + } + object = new_inode(parent->i_sb); + if (object == NULL) + return RETERR(-ENOMEM); + /* we'll update i_nlink below */ + object->i_nlink = 0; + /* new_inode() initializes i_ino to "arbitrary" value. Reset it to 0, + * to simplify error handling: if some error occurs before i_ino is + * initialized with oid, i_ino should already be set to some + * distinguished value. */ + object->i_ino = 0; + + /* So that on error iput will be called. */ + *retobj = object; + + if (DQUOT_ALLOC_INODE(object)) { + DQUOT_DROP(object); + object->i_flags |= S_NOQUOTA; + return RETERR(-EDQUOT); + } + + memset(&entry, 0, sizeof entry); + entry.obj = object; + + plugin_set_file(&reiser4_inode_data(object)->pset, obj_plug); + result = obj_plug->set_plug_in_inode(object, parent, data); + if (result) { + warning("nikita-431", "Cannot install plugin %i on %llx", + data->id, (unsigned long long)get_inode_oid(object)); + DQUOT_FREE_INODE(object); + object->i_flags |= S_NOQUOTA; + return result; + } + + /* reget plugin after installation */ + obj_plug = inode_file_plugin(object); + + if (obj_plug->create == NULL) { + DQUOT_FREE_INODE(object); + object->i_flags |= S_NOQUOTA; + return RETERR(-EPERM); + } + + /* if any of hash, tail, sd or permission plugins for newly created + object are not set yet set them here inheriting them from parent + directory + */ + assert("nikita-2070", obj_plug->adjust_to_parent != NULL); + result = obj_plug->adjust_to_parent(object, + parent, + object->i_sb->s_root->d_inode); + if (result != 0) { + warning("nikita-432", "Cannot inherit from %llx to %llx", + (unsigned long long)get_inode_oid(parent), + (unsigned long long)get_inode_oid(object)); + DQUOT_FREE_INODE(object); + object->i_flags |= S_NOQUOTA; + return result; + } + + /* call file plugin's method to initialize plugin specific part of + * inode */ + if (obj_plug->init_inode_data) + obj_plug->init_inode_data(object, data, 1/*create*/); + + /* obtain directory plugin (if any) for new object. */ + obj_dir = inode_dir_plugin(object); + if (obj_dir != NULL && obj_dir->init == NULL) { + DQUOT_FREE_INODE(object); + object->i_flags |= S_NOQUOTA; + return RETERR(-EPERM); + } + + reiser4_inode_data(object)->locality_id = get_inode_oid(parent); + + reserve = common_estimate_create_child(parent, object); + if (reiser4_grab_space(reserve, BA_CAN_COMMIT)) { + DQUOT_FREE_INODE(object); + object->i_flags |= S_NOQUOTA; + return RETERR(-ENOSPC); + } + + /* mark inode `immutable'. We disable changes to the file being + created until valid directory entry for it is inserted. Otherwise, + if file were expanded and insertion of directory entry fails, we + have to remove file, but we only alloted enough space in + transaction to remove _empty_ file. 3.x code used to remove stat + data in different transaction thus possibly leaking disk space on + crash. This all only matters if it's possible to access file + without name, for example, by inode number + */ + inode_set_flag(object, REISER4_IMMUTABLE); + + /* create empty object, this includes allocation of new objectid. For + directories this implies creation of dot and dotdot */ + assert("nikita-2265", inode_get_flag(object, REISER4_NO_SD)); + + /* mark inode as `loaded'. From this point onward + reiser4_delete_inode() will try to remove its stat-data. */ + inode_set_flag(object, REISER4_LOADED); + + result = obj_plug->create(object, parent, data); + if (result != 0) { + inode_clr_flag(object, REISER4_IMMUTABLE); + if (result != -ENAMETOOLONG && result != -ENOMEM) + warning("nikita-2219", + "Failed to create sd for %llu", + (unsigned long long)get_inode_oid(object)); + DQUOT_FREE_INODE(object); + object->i_flags |= S_NOQUOTA; + return result; + } + + if (obj_dir != NULL) + result = obj_dir->init(object, parent, data); + if (result == 0) { + assert("nikita-434", !inode_get_flag(object, REISER4_NO_SD)); + /* insert inode into VFS hash table */ + insert_inode_hash(object); + /* create entry */ + result = par_dir->add_entry(parent, dentry, data, &entry); + if (result == 0) { + result = reiser4_add_nlink(object, parent, 0); + /* If O_CREAT is set and the file did not previously + exist, upon successful completion, open() shall + mark for update the st_atime, st_ctime, and + st_mtime fields of the file and the st_ctime and + st_mtime fields of the parent directory. --SUS + */ + /* @object times are already updated by + reiser4_add_nlink() */ + if (result == 0) + reiser4_update_dir(parent); + if (result != 0) + /* cleanup failure to add nlink */ + par_dir->rem_entry(parent, dentry, &entry); + } + if (result != 0) + /* cleanup failure to add entry */ + obj_plug->detach(object, parent); + } else if (result != -ENOMEM) + warning("nikita-2219", "Failed to initialize dir for %llu: %i", + (unsigned long long)get_inode_oid(object), result); + + /* + * update stat-data, committing all pending modifications to the inode + * fields. + */ + reiser4_update_sd(object); + if (result != 0) { + DQUOT_FREE_INODE(object); + object->i_flags |= S_NOQUOTA; + /* if everything was ok (result == 0), parent stat-data is + * already updated above (update_parent_dir()) */ + reiser4_update_sd(parent); + /* failure to create entry, remove object */ + obj_plug->delete(object); + } + + /* file has name now, clear immutable flag */ + inode_clr_flag(object, REISER4_IMMUTABLE); + + /* on error, iput() will call ->delete_inode(). We should keep track + of the existence of stat-data for this inode and avoid attempt to + remove it in reiser4_delete_inode(). This is accomplished through + REISER4_NO_SD bit in inode.u.reiser4_i.plugin.flags + */ + return result; +} + +/* ->is_name_acceptable() method of directory plugin */ +/* Audited by: green(2002.06.15) */ +reiser4_internal int +is_name_acceptable(const struct inode *inode /* directory to check */ , + const char *name UNUSED_ARG /* name to check */ , + int len /* @name's length */ ) +{ + assert("nikita-733", inode != NULL); + assert("nikita-734", name != NULL); + assert("nikita-735", len > 0); + + return len <= reiser4_max_filename_len(inode); +} + +/* return true, iff @coord points to the valid directory item that is part of + * @inode directory. */ +static int +is_valid_dir_coord(struct inode * inode, coord_t * coord) +{ + return + item_type_by_coord(coord) == DIR_ENTRY_ITEM_TYPE && + inode_file_plugin(inode)->owns_item(inode, coord); +} + +/* true if directory is empty (only contains dot and dotdot) */ +reiser4_internal int +is_dir_empty(const struct inode *dir) +{ + assert("nikita-1976", dir != NULL); + + /* rely on our method to maintain directory i_size being equal to the + number of entries. */ + return dir->i_size <= 2 ? 0 : RETERR(-ENOTEMPTY); +} + +/* compare two logical positions within the same directory */ +static cmp_t dir_pos_cmp(const dir_pos * p1, const dir_pos * p2) +{ + cmp_t result; + + assert("nikita-2534", p1 != NULL); + assert("nikita-2535", p2 != NULL); + + result = de_id_cmp(&p1->dir_entry_key, &p2->dir_entry_key); + if (result == EQUAL_TO) { + int diff; + + diff = p1->pos - p2->pos; + result = (diff < 0) ? LESS_THAN : (diff ? GREATER_THAN : EQUAL_TO); + } + return result; +} + +/* true, if file descriptor @f is created by NFS server by "demand" to serve + * one file system operation. This means that there may be "detached state" + * for underlying inode. */ +static inline int +file_is_stateless(struct file *f) +{ + return reiser4_get_dentry_fsdata(f->f_dentry)->stateless; +} + +#define CID_SHIFT (20) +#define CID_MASK (0xfffffull) + +/* calculate ->fpos from user-supplied cookie. Normally it is dir->f_pos, but + * in the case of stateless directory operation (readdir-over-nfs), client id + * was encoded in the high bits of cookie and should me masked off. */ +static loff_t +get_dir_fpos(struct file * dir) +{ + if (file_is_stateless(dir)) + return dir->f_pos & CID_MASK; + else + return dir->f_pos; +} + +/* see comment before readdir_common() for overview of why "adjustment" is + * necessary. */ +static void +adjust_dir_pos(struct file * dir, + readdir_pos * readdir_spot, + const dir_pos * mod_point, + int adj) +{ + dir_pos *pos; + + /* + * new directory entry was added (adj == +1) or removed (adj == -1) at + * the @mod_point. Directory file descriptor @dir is doing readdir and + * is currently positioned at @readdir_spot. Latter has to be updated + * to maintain stable readdir. + */ + /* directory is positioned to the beginning. */ + if (readdir_spot->entry_no == 0) + return; + + pos = &readdir_spot->position; + switch (dir_pos_cmp(mod_point, pos)) { + case LESS_THAN: + /* @mod_pos is _before_ @readdir_spot, that is, entry was + * added/removed on the left (in key order) of current + * position. */ + /* logical number of directory entry readdir is "looking" at + * changes */ + readdir_spot->entry_no += adj; + assert("nikita-2577", + ergo(dir != NULL, get_dir_fpos(dir) + adj >= 0)); + if (de_id_cmp(&pos->dir_entry_key, + &mod_point->dir_entry_key) == EQUAL_TO) { + assert("nikita-2575", mod_point->pos < pos->pos); + /* + * if entry added/removed has the same key as current + * for readdir, update counter of duplicate keys in + * @readdir_spot. + */ + pos->pos += adj; + } + break; + case GREATER_THAN: + /* directory is modified after @pos: nothing to do. */ + break; + case EQUAL_TO: + /* cannot insert an entry readdir is looking at, because it + already exists. */ + assert("nikita-2576", adj < 0); + /* directory entry to which @pos points to is being + removed. + + NOTE-NIKITA: Right thing to do is to update @pos to point + to the next entry. This is complex (we are under spin-lock + for one thing). Just rewind it to the beginning. Next + readdir will have to scan the beginning of + directory. Proper solution is to use semaphore in + spin lock's stead and use rewind_right() here. + + NOTE-NIKITA: now, semaphore is used, so... + */ + memset(readdir_spot, 0, sizeof *readdir_spot); + } +} + +/* scan all file-descriptors for this directory and adjust their positions + respectively. */ +reiser4_internal void +adjust_dir_file(struct inode *dir, const struct dentry * de, int offset, int adj) +{ + reiser4_file_fsdata *scan; + dir_pos mod_point; + + assert("nikita-2536", dir != NULL); + assert("nikita-2538", de != NULL); + assert("nikita-2539", adj != 0); + + build_de_id(dir, &de->d_name, &mod_point.dir_entry_key); + mod_point.pos = offset; + + spin_lock_inode(dir); + + /* + * new entry was added/removed in directory @dir. Scan all file + * descriptors for @dir that are currently involved into @readdir and + * update them. + */ + + for_all_type_safe_list(readdir, get_readdir_list(dir), scan) + adjust_dir_pos(scan->back, &scan->dir.readdir, &mod_point, adj); + + spin_unlock_inode(dir); +} + +/* + * traverse tree to start/continue readdir from the readdir position @pos. + */ +static int +dir_go_to(struct file *dir, readdir_pos * pos, tap_t * tap) +{ + reiser4_key key; + int result; + struct inode *inode; + + assert("nikita-2554", pos != NULL); + + inode = dir->f_dentry->d_inode; + result = inode_dir_plugin(inode)->build_readdir_key(dir, &key); + if (result != 0) + return result; + result = object_lookup(inode, + &key, + tap->coord, + tap->lh, + tap->mode, + FIND_EXACT, + LEAF_LEVEL, + LEAF_LEVEL, + 0, + &tap->ra_info); + if (result == CBK_COORD_FOUND) + result = rewind_right(tap, (int) pos->position.pos); + else { + tap->coord->node = NULL; + done_lh(tap->lh); + result = RETERR(-EIO); + } + return result; +} + +/* + * handling of non-unique keys: calculate at what ordinal position within + * sequence of directory items with identical keys @pos is. + */ +static int +set_pos(struct inode * inode, readdir_pos * pos, tap_t * tap) +{ + int result; + coord_t coord; + lock_handle lh; + tap_t scan; + de_id *did; + reiser4_key de_key; + + coord_init_zero(&coord); + init_lh(&lh); + tap_init(&scan, &coord, &lh, ZNODE_READ_LOCK); + tap_copy(&scan, tap); + tap_load(&scan); + pos->position.pos = 0; + + did = &pos->position.dir_entry_key; + + if (is_valid_dir_coord(inode, scan.coord)) { + + build_de_id_by_key(unit_key_by_coord(scan.coord, &de_key), did); + + while (1) { + + result = go_prev_unit(&scan); + if (result != 0) + break; + + if (!is_valid_dir_coord(inode, scan.coord)) { + result = -EINVAL; + break; + } + + /* get key of directory entry */ + unit_key_by_coord(scan.coord, &de_key); + if (de_id_key_cmp(did, &de_key) != EQUAL_TO) { + /* duplicate-sequence is over */ + break; + } + pos->position.pos ++; + } + } else + result = RETERR(-ENOENT); + tap_relse(&scan); + tap_done(&scan); + return result; +} + + +/* + * "rewind" directory to @offset, i.e., set @pos and @tap correspondingly. + */ +static int +dir_rewind(struct file *dir, readdir_pos * pos, tap_t * tap) +{ + __u64 destination; + __s64 shift; + int result; + struct inode *inode; + loff_t dirpos; + + assert("nikita-2553", dir != NULL); + assert("nikita-2548", pos != NULL); + assert("nikita-2551", tap->coord != NULL); + assert("nikita-2552", tap->lh != NULL); + + dirpos = get_dir_fpos(dir); + shift = dirpos - pos->fpos; + /* this is logical directory entry within @dir which we are rewinding + * to */ + destination = pos->entry_no + shift; + + inode = dir->f_dentry->d_inode; + if (dirpos < 0) + return RETERR(-EINVAL); + else if (destination == 0ll || dirpos == 0) { + /* rewind to the beginning of directory */ + memset(pos, 0, sizeof *pos); + return dir_go_to(dir, pos, tap); + } else if (destination >= inode->i_size) + return RETERR(-ENOENT); + + if (shift < 0) { + /* I am afraid of negative numbers */ + shift = -shift; + /* rewinding to the left */ + if (shift <= (int) pos->position.pos) { + /* destination is within sequence of entries with + duplicate keys. */ + result = dir_go_to(dir, pos, tap); + } else { + shift -= pos->position.pos; + while (1) { + /* repetitions: deadlock is possible when + going to the left. */ + result = dir_go_to(dir, pos, tap); + if (result == 0) { + result = rewind_left(tap, shift); + if (result == -E_DEADLOCK) { + tap_done(tap); + continue; + } + } + break; + } + } + } else { + /* rewinding to the right */ + result = dir_go_to(dir, pos, tap); + if (result == 0) + result = rewind_right(tap, shift); + } + if (result == 0) { + result = set_pos(inode, pos, tap); + if (result == 0) { + /* update pos->position.pos */ + pos->entry_no = destination; + pos->fpos = dirpos; + } + } + return result; +} + +/* + * Function that is called by common_readdir() on each directory entry while + * doing readdir. ->filldir callback may block, so we had to release long term + * lock while calling it. To avoid repeating tree traversal, seal is used. If + * seal is broken, we return -E_REPEAT. Node is unlocked in this case. + * + * Whether node is unlocked in case of any other error is undefined. It is + * guaranteed to be still locked if success (0) is returned. + * + * When ->filldir() wants no more, feed_entry() returns 1, and node is + * unlocked. + */ +static int +feed_entry(struct file *f, + readdir_pos * pos, tap_t *tap, filldir_t filldir, void *dirent) +{ + item_plugin *iplug; + char *name; + reiser4_key sd_key; + int result; + char buf[DE_NAME_BUF_LEN]; + char name_buf[32]; + char *local_name; + unsigned file_type; + seal_t seal; + coord_t *coord; + reiser4_key entry_key; + + coord = tap->coord; + iplug = item_plugin_by_coord(coord); + + /* pointer to name within the node */ + name = iplug->s.dir.extract_name(coord, buf); + assert("nikita-1371", name != NULL); + + /* key of object the entry points to */ + if (iplug->s.dir.extract_key(coord, &sd_key) != 0) + return RETERR(-EIO); + + /* we must release longterm znode lock before calling filldir to avoid + deadlock which may happen if filldir causes page fault. So, copy + name to intermediate buffer */ + if (strlen(name) + 1 > sizeof(name_buf)) { + local_name = kmalloc(strlen(name) + 1, GFP_KERNEL); + if (local_name == NULL) + return RETERR(-ENOMEM); + } else + local_name = name_buf; + + strcpy(local_name, name); + file_type = iplug->s.dir.extract_file_type(coord); + + unit_key_by_coord(coord, &entry_key); + seal_init(&seal, coord, &entry_key); + + longterm_unlock_znode(tap->lh); + + /* + * send information about directory entry to the ->filldir() filler + * supplied to us by caller (VFS). + * + * ->filldir is entitled to do weird things. For example, ->filldir + * supplied by knfsd re-enters file system. Make sure no locks are + * held. + */ + assert("nikita-3436", lock_stack_isclean(get_current_lock_stack())); + + result = filldir(dirent, name, (int) strlen(name), + /* offset of this entry */ + f->f_pos, + /* inode number of object bounden by this entry */ + oid_to_uino(get_key_objectid(&sd_key)), + file_type); + if (local_name != name_buf) + kfree(local_name); + if (result < 0) + /* ->filldir() is satisfied. (no space in buffer, IOW) */ + result = 1; + else + result = seal_validate(&seal, coord, &entry_key, + tap->lh, tap->mode, ZNODE_LOCK_HIPRI); + return result; +} + +static void +move_entry(readdir_pos * pos, coord_t * coord) +{ + reiser4_key de_key; + de_id *did; + + /* update @pos */ + ++pos->entry_no; + did = &pos->position.dir_entry_key; + + /* get key of directory entry */ + unit_key_by_coord(coord, &de_key); + + if (de_id_key_cmp(did, &de_key) == EQUAL_TO) + /* we are within sequence of directory entries + with duplicate keys. */ + ++pos->position.pos; + else { + pos->position.pos = 0; + build_de_id_by_key(&de_key, did); + } + ++pos->fpos; +} + +/* + * STATELESS READDIR + * + * readdir support in reiser4 relies on ability to update readdir_pos embedded + * into reiser4_file_fsdata on each directory modification (name insertion and + * removal), see readdir_common() function below. This obviously doesn't work + * when reiser4 is accessed over NFS, because NFS doesn't keep any state + * across client READDIR requests for the same directory. + * + * To address this we maintain a "pool" of detached reiser4_file_fsdata + * (d_cursor). Whenever NFS readdir request comes, we detect this, and try to + * find detached reiser4_file_fsdata corresponding to previous readdir + * request. In other words, additional state is maintained on the + * server. (This is somewhat contrary to the design goals of NFS protocol.) + * + * To efficiently detect when our ->readdir() method is called by NFS server, + * dentry is marked as "stateless" in reiser4_decode_fh() (this is checked by + * file_is_stateless() function). + * + * To find out d_cursor in the pool, we encode client id (cid) in the highest + * bits of NFS readdir cookie: when first readdir request comes to the given + * directory from the given client, cookie is set to 0. This situation is + * detected, global cid_counter is incremented, and stored in highest bits of + * all direntry offsets returned to the client, including last one. As the + * only valid readdir cookie is one obtained as direntry->offset, we are + * guaranteed that next readdir request (continuing current one) will have + * current cid in the highest bits of starting readdir cookie. All d_cursors + * are hashed into per-super-block hash table by (oid, cid) key. + * + * In addition d_cursors are placed into per-super-block radix tree where they + * are keyed by oid alone. This is necessary to efficiently remove them during + * rmdir. + * + * At last, currently unused d_cursors are linked into special list. This list + * is used d_cursor_shrink to reclaim d_cursors on memory pressure. + * + */ + +TYPE_SAFE_LIST_DECLARE(d_cursor); +TYPE_SAFE_LIST_DECLARE(a_cursor); + +typedef struct { + __u16 cid; + __u64 oid; +} d_cursor_key; + +struct dir_cursor { + int ref; + reiser4_file_fsdata *fsdata; + d_cursor_hash_link hash; + d_cursor_list_link list; + d_cursor_key key; + d_cursor_info *info; + a_cursor_list_link alist; +}; + +static kmem_cache_t *d_cursor_slab; +static struct shrinker *d_cursor_shrinker; +static unsigned long d_cursor_unused = 0; +static spinlock_t d_lock = SPIN_LOCK_UNLOCKED; +static a_cursor_list_head cursor_cache = TYPE_SAFE_LIST_HEAD_INIT(cursor_cache); + +#define D_CURSOR_TABLE_SIZE (256) + +static inline unsigned long +d_cursor_hash(d_cursor_hash_table *table, const d_cursor_key * key) +{ + assert("nikita-3555", IS_POW(D_CURSOR_TABLE_SIZE)); + return (key->oid + key->cid) & (D_CURSOR_TABLE_SIZE - 1); +} + +static inline int +d_cursor_eq(const d_cursor_key * k1, const d_cursor_key * k2) +{ + return k1->cid == k2->cid && k1->oid == k2->oid; +} + +#define KMALLOC(size) kmalloc((size), GFP_KERNEL) +#define KFREE(ptr, size) kfree(ptr) +TYPE_SAFE_HASH_DEFINE(d_cursor, + dir_cursor, + d_cursor_key, + key, + hash, + d_cursor_hash, + d_cursor_eq); +#undef KFREE +#undef KMALLOC + +TYPE_SAFE_LIST_DEFINE(d_cursor, dir_cursor, list); +TYPE_SAFE_LIST_DEFINE(a_cursor, dir_cursor, alist); + +static void kill_cursor(dir_cursor *cursor); + +/* + * shrink d_cursors cache. Scan LRU list of unused cursors, freeing requested + * number. Return number of still freeable cursors. + */ +static int d_cursor_shrink(int nr, unsigned int gfp_mask) +{ + if (nr != 0) { + dir_cursor *scan; + int killed; + + killed = 0; + spin_lock(&d_lock); + while (!a_cursor_list_empty(&cursor_cache)) { + scan = a_cursor_list_front(&cursor_cache); + assert("nikita-3567", scan->ref == 0); + kill_cursor(scan); + ++ killed; + -- nr; + if (nr == 0) + break; + } + spin_unlock(&d_lock); + } + return d_cursor_unused; +} + +/* + * perform global initializations for the d_cursor sub-system. + */ +reiser4_internal int +d_cursor_init(void) +{ + d_cursor_slab = kmem_cache_create("d_cursor", sizeof (dir_cursor), 0, + SLAB_HWCACHE_ALIGN, NULL, NULL); + if (d_cursor_slab == NULL) + return RETERR(-ENOMEM); + else { + /* actually, d_cursors are "priceless", because there is no + * way to recover information stored in them. On the other + * hand, we don't want to consume all kernel memory by + * them. As a compromise, just assign higher "seeks" value to + * d_cursor cache, so that it will be shrunk only if system is + * really tight on memory. */ + d_cursor_shrinker = set_shrinker(DEFAULT_SEEKS << 3, + d_cursor_shrink); + if (d_cursor_shrinker == NULL) + return RETERR(-ENOMEM); + else + return 0; + } +} + +/* + * Dual to d_cursor_init(): release global d_cursor resources. + */ +reiser4_internal void +d_cursor_done(void) +{ + if (d_cursor_shrinker != NULL) { + remove_shrinker(d_cursor_shrinker); + d_cursor_shrinker = NULL; + } + if (d_cursor_slab != NULL) { + kmem_cache_destroy(d_cursor_slab); + d_cursor_slab = NULL; + } +} + +/* + * initialize per-super-block d_cursor resources + */ +reiser4_internal int +d_cursor_init_at(struct super_block *s) +{ + d_cursor_info *p; + + p = &get_super_private(s)->d_info; + + INIT_RADIX_TREE(&p->tree, GFP_KERNEL); + return d_cursor_hash_init(&p->table, D_CURSOR_TABLE_SIZE); +} + +/* + * Dual to d_cursor_init_at: release per-super-block d_cursor resources + */ +reiser4_internal void +d_cursor_done_at(struct super_block *s) +{ + d_cursor_hash_done(&get_super_private(s)->d_info.table); +} + +/* + * return d_cursor data for the file system @inode is in. + */ +static inline d_cursor_info * d_info(struct inode *inode) +{ + return &get_super_private(inode->i_sb)->d_info; +} + +/* + * lookup d_cursor in the per-super-block radix tree. + */ +static inline dir_cursor *lookup(d_cursor_info *info, unsigned long index) +{ + return (dir_cursor *)radix_tree_lookup(&info->tree, index); +} + +/* + * attach @cursor to the radix tree. There may be multiple cursors for the + * same oid, they are chained into circular list. + */ +static void bind_cursor(dir_cursor *cursor, unsigned long index) +{ + dir_cursor *head; + + head = lookup(cursor->info, index); + if (head == NULL) { + /* this is the first cursor for this index */ + d_cursor_list_clean(cursor); + radix_tree_insert(&cursor->info->tree, index, cursor); + } else { + /* some cursor already exists. Chain ours */ + d_cursor_list_insert_after(head, cursor); + } +} + +/* + * remove @cursor from indices and free it + */ +static void +kill_cursor(dir_cursor *cursor) +{ + unsigned long index; + + assert("nikita-3566", cursor->ref == 0); + assert("nikita-3572", cursor->fsdata != NULL); + + index = (unsigned long)cursor->key.oid; + readdir_list_remove_clean(cursor->fsdata); + reiser4_free_fsdata(cursor->fsdata); + cursor->fsdata = NULL; + + if (d_cursor_list_is_clean(cursor)) + /* this is last cursor for a file. Kill radix-tree entry */ + radix_tree_delete(&cursor->info->tree, index); + else { + void **slot; + + /* + * there are other cursors for the same oid. + */ + + /* + * if radix tree point to the cursor being removed, re-target + * radix tree slot to the next cursor in the (non-empty as was + * checked above) element of the circular list of all cursors + * for this oid. + */ + slot = radix_tree_lookup_slot(&cursor->info->tree, index); + assert("nikita-3571", *slot != NULL); + if (*slot == cursor) + *slot = d_cursor_list_next(cursor); + /* remove cursor from circular list */ + d_cursor_list_remove_clean(cursor); + } + /* remove cursor from the list of unused cursors */ + a_cursor_list_remove_clean(cursor); + /* remove cursor from the hash table */ + d_cursor_hash_remove(&cursor->info->table, cursor); + /* and free it */ + kmem_cache_free(d_cursor_slab, cursor); + -- d_cursor_unused; +} + +/* possible actions that can be performed on all cursors for the given file */ +enum cursor_action { + /* load all detached state: this is called when stat-data is loaded + * from the disk to recover information about all pending readdirs */ + CURSOR_LOAD, + /* detach all state from inode, leaving it in the cache. This is + * called when inode is removed form the memory by memory pressure */ + CURSOR_DISPOSE, + /* detach cursors from the inode, and free them. This is called when + * inode is destroyed. */ + CURSOR_KILL +}; + +static void +process_cursors(struct inode *inode, enum cursor_action act) +{ + oid_t oid; + dir_cursor *start; + readdir_list_head *head; + reiser4_context ctx; + d_cursor_info *info; + + /* this can be called by + * + * kswapd->...->prune_icache->..reiser4_destroy_inode + * + * without reiser4_context + */ + init_context(&ctx, inode->i_sb); + + assert("nikita-3558", inode != NULL); + + info = d_info(inode); + oid = get_inode_oid(inode); + spin_lock_inode(inode); + head = get_readdir_list(inode); + spin_lock(&d_lock); + /* find any cursor for this oid: reference to it is hanging of radix + * tree */ + start = lookup(info, (unsigned long)oid); + if (start != NULL) { + dir_cursor *scan; + reiser4_file_fsdata *fsdata; + + /* process circular list of cursors for this oid */ + scan = start; + do { + dir_cursor *next; + + next = d_cursor_list_next(scan); + fsdata = scan->fsdata; + assert("nikita-3557", fsdata != NULL); + if (scan->key.oid == oid) { + switch (act) { + case CURSOR_DISPOSE: + readdir_list_remove_clean(fsdata); + break; + case CURSOR_LOAD: + readdir_list_push_front(head, fsdata); + break; + case CURSOR_KILL: + kill_cursor(scan); + break; + } + } + if (scan == next) + /* last cursor was just killed */ + break; + scan = next; + } while (scan != start); + } + spin_unlock(&d_lock); + /* check that we killed 'em all */ + assert("nikita-3568", ergo(act == CURSOR_KILL, + readdir_list_empty(get_readdir_list(inode)))); + assert("nikita-3569", ergo(act == CURSOR_KILL, + lookup(info, oid) == NULL)); + spin_unlock_inode(inode); + reiser4_exit_context(&ctx); +} + +/* detach all cursors from inode. This is called when inode is removed from + * the memory by memory pressure */ +reiser4_internal void dispose_cursors(struct inode *inode) +{ + process_cursors(inode, CURSOR_DISPOSE); +} + +/* attach all detached cursors to the inode. This is done when inode is loaded + * into memory */ +reiser4_internal void load_cursors(struct inode *inode) +{ + process_cursors(inode, CURSOR_LOAD); +} + +/* free all cursors for this inode. This is called when inode is destroyed. */ +reiser4_internal void kill_cursors(struct inode *inode) +{ + process_cursors(inode, CURSOR_KILL); +} + +/* global counter used to generate "client ids". These ids are encoded into + * high bits of fpos. */ +static __u32 cid_counter = 0; + +/* + * detach fsdata (if detachable) from file descriptor, and put cursor on the + * "unused" list. Called when file descriptor is not longer in active use. + */ +static void +clean_fsdata(struct file *f) +{ + dir_cursor *cursor; + reiser4_file_fsdata *fsdata; + + assert("nikita-3570", file_is_stateless(f)); + + fsdata = (reiser4_file_fsdata *)f->private_data; + if (fsdata != NULL) { + cursor = fsdata->cursor; + if (cursor != NULL) { + spin_lock(&d_lock); + -- cursor->ref; + if (cursor->ref == 0) { + a_cursor_list_push_back(&cursor_cache, cursor); + ++ d_cursor_unused; + } + spin_unlock(&d_lock); + f->private_data = NULL; + } + } +} + +/* add detachable readdir state to the @f */ +static int +insert_cursor(dir_cursor *cursor, struct file *f, struct inode *inode) +{ + int result; + reiser4_file_fsdata *fsdata; + + memset(cursor, 0, sizeof *cursor); + + /* this is either first call to readdir, or rewind. Anyway, create new + * cursor. */ + fsdata = create_fsdata(NULL, GFP_KERNEL); + if (fsdata != NULL) { + result = radix_tree_preload(GFP_KERNEL); + if (result == 0) { + d_cursor_info *info; + oid_t oid; + + info = d_info(inode); + oid = get_inode_oid(inode); + /* cid occupies higher 12 bits of f->f_pos. Don't + * allow it to become negative: this confuses + * nfsd_readdir() */ + cursor->key.cid = (++ cid_counter) & 0x7ff; + cursor->key.oid = oid; + cursor->fsdata = fsdata; + cursor->info = info; + cursor->ref = 1; + spin_lock_inode(inode); + /* install cursor as @f's private_data, discarding old + * one if necessary */ + clean_fsdata(f); + reiser4_free_file_fsdata(f); + f->private_data = fsdata; + fsdata->cursor = cursor; + spin_unlock_inode(inode); + spin_lock(&d_lock); + /* insert cursor into hash table */ + d_cursor_hash_insert(&info->table, cursor); + /* and chain it into radix-tree */ + bind_cursor(cursor, (unsigned long)oid); + spin_unlock(&d_lock); + radix_tree_preload_end(); + f->f_pos = ((__u64)cursor->key.cid) << CID_SHIFT; + } + } else + result = RETERR(-ENOMEM); + return result; +} + +/* find or create cursor for readdir-over-nfs */ +static int +try_to_attach_fsdata(struct file *f, struct inode *inode) +{ + loff_t pos; + int result; + dir_cursor *cursor; + + /* + * we are serialized by inode->i_sem + */ + + if (!file_is_stateless(f)) + return 0; + + pos = f->f_pos; + result = 0; + if (pos == 0) { + /* + * first call to readdir (or rewind to the beginning of + * directory) + */ + cursor = kmem_cache_alloc(d_cursor_slab, GFP_KERNEL); + if (cursor != NULL) + result = insert_cursor(cursor, f, inode); + else + result = RETERR(-ENOMEM); + } else { + /* try to find existing cursor */ + d_cursor_key key; + + key.cid = pos >> CID_SHIFT; + key.oid = get_inode_oid(inode); + spin_lock(&d_lock); + cursor = d_cursor_hash_find(&d_info(inode)->table, &key); + if (cursor != NULL) { + /* cursor was found */ + if (cursor->ref == 0) { + /* move it from unused list */ + a_cursor_list_remove_clean(cursor); + -- d_cursor_unused; + } + ++ cursor->ref; + } + spin_unlock(&d_lock); + if (cursor != NULL) { + spin_lock_inode(inode); + assert("nikita-3556", cursor->fsdata->back == NULL); + clean_fsdata(f); + reiser4_free_file_fsdata(f); + f->private_data = cursor->fsdata; + spin_unlock_inode(inode); + } + } + return result; +} + +/* detach fsdata, if necessary */ +static void +detach_fsdata(struct file *f) +{ + struct inode *inode; + + if (!file_is_stateless(f)) + return; + + inode = f->f_dentry->d_inode; + spin_lock_inode(inode); + clean_fsdata(f); + spin_unlock_inode(inode); +} + +/* + * prepare for readdir. + */ +static int +dir_readdir_init(struct file *f, tap_t * tap, readdir_pos ** pos) +{ + struct inode *inode; + reiser4_file_fsdata *fsdata; + int result; + + assert("nikita-1359", f != NULL); + inode = f->f_dentry->d_inode; + assert("nikita-1360", inode != NULL); + + if (!S_ISDIR(inode->i_mode)) + return RETERR(-ENOTDIR); + + /* try to find detached readdir state */ + result = try_to_attach_fsdata(f, inode); + if (result != 0) + return result; + + fsdata = reiser4_get_file_fsdata(f); + assert("nikita-2571", fsdata != NULL); + if (IS_ERR(fsdata)) + return PTR_ERR(fsdata); + + /* add file descriptor to the readdir list hanging of directory + * inode. This list is used to scan "readdirs-in-progress" while + * inserting or removing names in the directory. */ + spin_lock_inode(inode); + if (readdir_list_is_clean(fsdata)) + readdir_list_push_front(get_readdir_list(inode), fsdata); + *pos = &fsdata->dir.readdir; + spin_unlock_inode(inode); + + /* move @tap to the current position */ + return dir_rewind(f, *pos, tap); +} + +/* + * ->readdir method of directory plugin + * + * readdir problems: + * + * Traditional UNIX API for scanning through directory + * (readdir/seekdir/telldir/opendir/closedir/rewindir/getdents) is based + * on the assumption that directory is structured very much like regular + * file, in particular, it is implied that each name within given + * directory (directory entry) can be uniquely identified by scalar offset + * and that such offset is stable across the life-time of the name is + * identifies. + * + * This is manifestly not so for reiser4. In reiser4 the only stable + * unique identifies for the directory entry is its key that doesn't fit + * into seekdir/telldir API. + * + * solution: + * + * Within each file descriptor participating in readdir-ing of directory + * plugin/dir/dir.h:readdir_pos is maintained. This structure keeps track + * of the "current" directory entry that file descriptor looks at. It + * contains a key of directory entry (plus some additional info to deal + * with non-unique keys that we wouldn't dwell onto here) and a logical + * position of this directory entry starting from the beginning of the + * directory, that is ordinal number of this entry in the readdir order. + * + * Obviously this logical position is not stable in the face of directory + * modifications. To work around this, on each addition or removal of + * directory entry all file descriptors for directory inode are scanned + * and their readdir_pos are updated accordingly (adjust_dir_pos()). + * + */ +static int +readdir_common(struct file *f /* directory file being read */ , + void *dirent /* opaque data passed to us by VFS */ , + filldir_t filld /* filler function passed to us + * by VFS */ ) +{ + int result; + struct inode *inode; + coord_t coord; + lock_handle lh; + tap_t tap; + readdir_pos *pos; + + assert("nikita-1359", f != NULL); + inode = f->f_dentry->d_inode; + assert("nikita-1360", inode != NULL); + + if (!S_ISDIR(inode->i_mode)) + return RETERR(-ENOTDIR); + + coord_init_zero(&coord); + init_lh(&lh); + tap_init(&tap, &coord, &lh, ZNODE_READ_LOCK); + + reiser4_readdir_readahead_init(inode, &tap); + + repeat: + result = dir_readdir_init(f, &tap, &pos); + if (result == 0) { + result = tap_load(&tap); + /* scan entries one by one feeding them to @filld */ + while (result == 0) { + coord_t *coord; + + coord = tap.coord; + assert("nikita-2572", coord_is_existing_unit(coord)); + assert("nikita-3227", is_valid_dir_coord(inode, coord)); + + result = feed_entry(f, pos, &tap, filld, dirent); + if (result > 0) { + break; + } else if (result == 0) { + ++ f->f_pos; + result = go_next_unit(&tap); + if (result == -E_NO_NEIGHBOR || + result == -ENOENT) { + result = 0; + break; + } else if (result == 0) { + if (is_valid_dir_coord(inode, coord)) + move_entry(pos, coord); + else + break; + } + } else if (result == -E_REPEAT) { + /* feed_entry() had to restart. */ + ++ f->f_pos; + tap_relse(&tap); + goto repeat; + } else + warning("vs-1617", + "readdir_common: unexpected error %d", + result); + } + tap_relse(&tap); + + if (result >= 0) + f->f_version = inode->i_version; + } else if (result == -E_NO_NEIGHBOR || result == -ENOENT) + result = 0; + tap_done(&tap); + detach_fsdata(f); + return (result <= 0) ? result : 0; +} + +/* + * seek method for directory. See comment before readdir_common() for + * explanation. + */ +loff_t +seek_dir(struct file *file, loff_t off, int origin) +{ + loff_t result; + struct inode *inode; + + inode = file->f_dentry->d_inode; + down(&inode->i_sem); + + /* update ->f_pos */ + result = default_llseek(file, off, origin); + if (result >= 0) { + int ff; + coord_t coord; + lock_handle lh; + tap_t tap; + readdir_pos *pos; + + coord_init_zero(&coord); + init_lh(&lh); + tap_init(&tap, &coord, &lh, ZNODE_READ_LOCK); + + ff = dir_readdir_init(file, &tap, &pos); + detach_fsdata(file); + if (ff != 0) + result = (loff_t) ff; + tap_done(&tap); + } + detach_fsdata(file); + up(&inode->i_sem); + return result; +} + +/* ->attach method of directory plugin */ +static int +attach_common(struct inode *child UNUSED_ARG, struct inode *parent UNUSED_ARG) +{ + assert("nikita-2647", child != NULL); + assert("nikita-2648", parent != NULL); + + return 0; +} + +/* ->estimate.add_entry method of directory plugin + estimation of adding entry which supposes that entry is inserting a unit into item +*/ +static reiser4_block_nr +estimate_add_entry_common(struct inode *inode) +{ + return estimate_one_insert_into_item(tree_by_inode(inode)); +} + +/* ->estimate.rem_entry method of directory plugin */ +static reiser4_block_nr +estimate_rem_entry_common(struct inode *inode) +{ + return estimate_one_item_removal(tree_by_inode(inode)); +} + +/* placeholder for VFS methods not-applicable to the object */ +static ssize_t +noperm(void) +{ + return RETERR(-EPERM); +} + +#define dir_eperm ((void *)noperm) + +static int +_noop(void) +{ + return 0; +} + +#define enoop ((void *)_noop) + +static int +change_dir(struct inode * inode, reiser4_plugin * plugin) +{ + /* cannot change dir plugin of already existing object */ + return RETERR(-EINVAL); +} + +static reiser4_plugin_ops dir_plugin_ops = { + .init = NULL, + .load = NULL, + .save_len = NULL, + .save = NULL, + .change = change_dir +}; + +/* + * definition of directory plugins + */ + +dir_plugin dir_plugins[LAST_DIR_ID] = { + /* standard hashed directory plugin */ + [HASHED_DIR_PLUGIN_ID] = { + .h = { + .type_id = REISER4_DIR_PLUGIN_TYPE, + .id = HASHED_DIR_PLUGIN_ID, + .pops = &dir_plugin_ops, + .label = "dir", + .desc = "hashed directory", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .get_parent = get_parent_hashed, + .lookup = lookup_hashed, + .unlink = unlink_common, + .link = link_common, + .is_name_acceptable = is_name_acceptable, + .build_entry_key = build_entry_key_common, + .build_readdir_key = build_readdir_key_common, + .add_entry = add_entry_hashed, + .rem_entry = rem_entry_hashed, + .create_child = create_child_common, + .rename = rename_hashed, + .readdir = readdir_common, + .init = init_hashed, + .done = done_hashed, + .attach = attach_common, + .detach = detach_hashed, + .estimate = { + .add_entry = estimate_add_entry_common, + .rem_entry = estimate_rem_entry_common, + .unlink = estimate_unlink_hashed + } + }, + /* hashed directory for which seekdir/telldir are guaranteed to + * work. Brain-damage. */ + [SEEKABLE_HASHED_DIR_PLUGIN_ID] = { + .h = { + .type_id = REISER4_DIR_PLUGIN_TYPE, + .id = SEEKABLE_HASHED_DIR_PLUGIN_ID, + .pops = &dir_plugin_ops, + .label = "dir32", + .desc = "directory hashed with 31 bit hash", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .get_parent = get_parent_hashed, + .lookup = lookup_hashed, + .unlink = unlink_common, + .link = link_common, + .is_name_acceptable = is_name_acceptable, + .build_entry_key = build_entry_key_stable_entry, + .build_readdir_key = build_readdir_key_common, + .add_entry = add_entry_hashed, + .rem_entry = rem_entry_hashed, + .create_child = create_child_common, + .rename = rename_hashed, + .readdir = readdir_common, + .init = init_hashed, + .done = done_hashed, + .attach = attach_common, + .detach = detach_hashed, + .estimate = { + .add_entry = estimate_add_entry_common, + .rem_entry = estimate_rem_entry_common, + .unlink = estimate_unlink_hashed + } + }, + /* pseudo directory. */ + [PSEUDO_DIR_PLUGIN_ID] = { + .h = { + .type_id = REISER4_DIR_PLUGIN_TYPE, + .id = PSEUDO_DIR_PLUGIN_ID, + .pops = &dir_plugin_ops, + .label = "pseudo", + .desc = "pseudo directory", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .get_parent = get_parent_pseudo, + .lookup = lookup_pseudo, + .unlink = dir_eperm, + .link = dir_eperm, + .is_name_acceptable = NULL, + .build_entry_key = NULL, + .build_readdir_key = NULL, + .add_entry = dir_eperm, + .rem_entry = dir_eperm, + .create_child = NULL, + .rename = dir_eperm, + .readdir = readdir_pseudo, + .init = enoop, + .done = enoop, + .attach = enoop, + .detach = enoop, + .estimate = { + .add_entry = NULL, + .rem_entry = NULL, + .unlink = NULL + } + } +}; + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/dir/dir.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/dir/dir.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,106 @@ +/* Copyright 2001, 2002, 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Directory plugin's methods. See dir.c for more details. */ + +#if !defined( __REISER4_DIR_H__ ) +#define __REISER4_DIR_H__ + +#include "../../forward.h" +#include "../../kassign.h" +#include "../../type_safe_hash.h" + +#include /* for __u?? */ +#include /* for struct file */ +#include + +/* locking: fields of per file descriptor readdir_pos and ->f_pos are + * protected by ->i_sem on inode. Under this lock following invariant + * holds: + * + * file descriptor is "looking" at the entry_no-th directory entry from + * the beginning of directory. This entry has key dir_entry_key and is + * pos-th entry with duplicate-key sequence. + * + */ + +/* logical position within directory */ +typedef struct { + /* key of directory entry (actually, part of a key sufficient to + identify directory entry) */ + de_id dir_entry_key; + /* ordinal number of directory entry among all entries with the same + key. (Starting from 0.) */ + unsigned pos; +} dir_pos; + +typedef struct { + /* f_pos corresponding to this readdir position */ + __u64 fpos; + /* logical position within directory */ + dir_pos position; + /* logical number of directory entry within + directory */ + __u64 entry_no; +} readdir_pos; + +extern void adjust_dir_file(struct inode *dir, const struct dentry *de, + int offset, int adj); +extern loff_t seek_dir(struct file *file, loff_t off, int origin); + +/* description of directory entry being created/destroyed/sought for + + It is passed down to the directory plugin and farther to the + directory item plugin methods. Creation of new directory is done in + several stages: first we search for an entry with the same name, then + create new one. reiser4_dir_entry_desc is used to store some information + collected at some stage of this process and required later: key of + item that we want to insert/delete and pointer to an object that will + be bound by the new directory entry. Probably some more fields will + be added there. + +*/ +struct reiser4_dir_entry_desc { + /* key of directory entry */ + reiser4_key key; + /* object bound by this entry. */ + struct inode *obj; +}; + +int is_name_acceptable(const struct inode *inode, const char *name UNUSED_ARG, int len); +int is_dir_empty(const struct inode *dir); +int reiser4_update_dir(struct inode *dir); + +void dispose_cursors(struct inode *inode); +void load_cursors(struct inode *inode); +void kill_cursors(struct inode *inode); + +typedef struct dir_cursor dir_cursor; + +TYPE_SAFE_HASH_DECLARE(d_cursor, dir_cursor); + +int d_cursor_init_at(struct super_block *s); +void d_cursor_done_at(struct super_block *s); + +/* + * information about d_cursors (detached readdir state) maintained in reiser4 + * specific portion of reiser4 super-block. See dir.c for more information on + * d_cursors. + */ +typedef struct d_cursor_info { + d_cursor_hash_table table; + struct radix_tree_root tree; +} d_cursor_info; + +/* __REISER4_DIR_H__ */ +#endif + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/dir/hashed_dir.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/dir/hashed_dir.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1459 @@ +/* Copyright 2001, 2002, 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Directory plugin using hashes (see fs/reiser4/plugin/hash.c) to map file + names to the files. */ + +/* See fs/reiser4/doc/directory-service for initial design note. */ + +/* + * Hashed directory logically consists of persistent directory + * entries. Directory entry is a pair of a file name and a key of stat-data of + * a file that has this name in the given directory. + * + * Directory entries are stored in the tree in the form of directory + * items. Directory item should implement dir_entry_ops portion of item plugin + * interface (see plugin/item/item.h). Hashed directory interacts with + * directory item plugin exclusively through dir_entry_ops operations. + * + * Currently there are two implementations of directory items: "simple + * directory item" (plugin/item/sde.[ch]), and "compound directory item" + * (plugin/item/cde.[ch]) with the latter being the default. + * + * There is, however some delicate way through which directory code interferes + * with item plugin: key assignment policy. A key for a directory item is + * chosen by directory code, and as described in kassign.c, this key contains + * a portion of file name. Directory item uses this knowledge to avoid storing + * this portion of file name twice: in the key and in the directory item body. + * + */ + +#include "../../forward.h" +#include "../../debug.h" +#include "../../spin_macros.h" +#include "../../key.h" +#include "../../kassign.h" +#include "../../coord.h" +#include "../../seal.h" +#include "dir.h" +#include "../item/item.h" +#include "../security/perm.h" +#include "../pseudo/pseudo.h" +#include "../plugin.h" +#include "../object.h" +#include "../../jnode.h" +#include "../../znode.h" +#include "../../tree.h" +#include "../../vfs_ops.h" +#include "../../inode.h" +#include "../../reiser4.h" +#include "../../safe_link.h" + +#include /* for struct inode */ +#include /* for struct dentry */ + +static int create_dot_dotdot(struct inode *object, struct inode *parent); +static int find_entry(struct inode *dir, struct dentry *name, + lock_handle * lh, znode_lock_mode mode, + reiser4_dir_entry_desc * entry); +static int check_item(const struct inode *dir, + const coord_t * coord, const char *name); + +static reiser4_block_nr +hashed_estimate_init(struct inode *parent, struct inode *object) +{ + reiser4_block_nr res = 0; + + assert("vpf-321", parent != NULL); + assert("vpf-322", object != NULL); + + /* hashed_add_entry(object) */ + res += inode_dir_plugin(object)->estimate.add_entry(object); + /* reiser4_add_nlink(object) */ + res += inode_file_plugin(object)->estimate.update(object); + /* hashed_add_entry(object) */ + res += inode_dir_plugin(object)->estimate.add_entry(object); + /* reiser4_add_nlink(parent) */ + res += inode_file_plugin(parent)->estimate.update(parent); + + return 0; +} + +/* plugin->u.dir.init + create sd for directory file. Create stat-data, dot, and dotdot. */ +reiser4_internal int +init_hashed(struct inode *object /* new directory */ , + struct inode *parent /* parent directory */ , + reiser4_object_create_data * data UNUSED_ARG /* info passed + * to us, this + * is filled by + * reiser4() + * syscall in + * particular */ ) +{ + reiser4_block_nr reserve; + + assert("nikita-680", object != NULL); + assert("nikita-681", S_ISDIR(object->i_mode)); + assert("nikita-682", parent != NULL); + assert("nikita-684", data != NULL); + assert("nikita-686", data->id == DIRECTORY_FILE_PLUGIN_ID); + assert("nikita-687", object->i_mode & S_IFDIR); + + reserve = hashed_estimate_init(parent, object); + if (reiser4_grab_space(reserve, BA_CAN_COMMIT)) + return RETERR(-ENOSPC); + + return create_dot_dotdot(object, parent); +} + +static reiser4_block_nr +hashed_estimate_done(struct inode *object) +{ + reiser4_block_nr res = 0; + + /* hashed_rem_entry(object) */ + res += inode_dir_plugin(object)->estimate.rem_entry(object); + return res; +} + +/* plugin->u.dir.estimate.unlink */ +reiser4_internal reiser4_block_nr +estimate_unlink_hashed(struct inode *parent, struct inode *object) +{ + reiser4_block_nr res = 0; + + /* hashed_rem_entry(object) */ + res += inode_dir_plugin(object)->estimate.rem_entry(object); + /* del_nlink(parent) */ + res += 2 * inode_file_plugin(parent)->estimate.update(parent); + + return res; +} + +/* ->delete() method of directory plugin + plugin->u.dir.done + Delete dot, and call common_file_delete() to delete stat data. +*/ +reiser4_internal int +done_hashed(struct inode *object /* object being deleted */) +{ + int result; + reiser4_block_nr reserve; + struct dentry goodby_dots; + reiser4_dir_entry_desc entry; + + assert("nikita-1449", object != NULL); + + if (inode_get_flag(object, REISER4_NO_SD)) + return 0; + + /* of course, this can be rewritten to sweep everything in one + cut_tree(). */ + memset(&entry, 0, sizeof entry); + + /* FIXME: this done method is called from delete_directory_common which + * reserved space already */ + reserve = hashed_estimate_done(object); + if (reiser4_grab_space(reserve, BA_CAN_COMMIT | BA_RESERVED)) + return RETERR(-ENOSPC); + + memset(&goodby_dots, 0, sizeof goodby_dots); + entry.obj = goodby_dots.d_inode = object; + goodby_dots.d_name.name = "."; + goodby_dots.d_name.len = 1; + result = rem_entry_hashed(object, &goodby_dots, &entry); + reiser4_free_dentry_fsdata(&goodby_dots); + if (unlikely(result != 0 && result != -ENOMEM && result != -ENOENT)) + /* only worth a warning + + "values of B will give rise to dom!\n" + -- v6src/s2/mv.c:89 + */ + warning("nikita-2252", "Cannot remove dot of %lli: %i", + (unsigned long long)get_inode_oid(object), result); + return 0; +} + +/* ->detach() method of directory plugin + plugin->u.dir.done + Delete dotdot, decrease nlink on parent +*/ +reiser4_internal int +detach_hashed(struct inode *object, struct inode *parent) +{ + int result; + struct dentry goodby_dots; + reiser4_dir_entry_desc entry; + + assert("nikita-2885", object != NULL); + assert("nikita-2886", !inode_get_flag(object, REISER4_NO_SD)); + + memset(&entry, 0, sizeof entry); + + /* NOTE-NIKITA this only works if @parent is -the- parent of + @object, viz. object whose key is stored in dotdot + entry. Wouldn't work with hard-links on directories. */ + memset(&goodby_dots, 0, sizeof goodby_dots); + entry.obj = goodby_dots.d_inode = parent; + goodby_dots.d_name.name = ".."; + goodby_dots.d_name.len = 2; + result = rem_entry_hashed(object, &goodby_dots, &entry); + reiser4_free_dentry_fsdata(&goodby_dots); + if (result == 0) { + /* the dot should be the only entry remaining at this time... */ + assert("nikita-3400", object->i_size == 1); + /* and, together with the only name directory can have, they + * provides for the last 2 remaining references. If we get + * here as part of error handling during mkdir, @object + * possibly has no name yet, so its nlink == 1. If we get here + * from rename (targeting empty directory), it has no name + * already, so its nlink == 1. */ + assert("nikita-3401", + object->i_nlink == 2 || object->i_nlink == 1); + + reiser4_del_nlink(parent, object, 0); + } + return result; +} + + +/* ->owns_item() for hashed directory object plugin. */ +reiser4_internal int +owns_item_hashed(const struct inode *inode /* object to check against */ , + const coord_t * coord /* coord of item to check */ ) +{ + reiser4_key item_key; + + assert("nikita-1335", inode != NULL); + assert("nikita-1334", coord != NULL); + + if (item_type_by_coord(coord) == DIR_ENTRY_ITEM_TYPE) + return get_key_locality(item_key_by_coord(coord, &item_key)) == get_inode_oid(inode); + else + return owns_item_common(inode, coord); +} + +/* helper function for directory_file_create(). Create "." and ".." */ +static int +create_dot_dotdot(struct inode *object /* object to create dot and + * dotdot for */ , + struct inode *parent /* parent of @object */ ) +{ + int result; + struct dentry dots_entry; + reiser4_dir_entry_desc entry; + + assert("nikita-688", object != NULL); + assert("nikita-689", S_ISDIR(object->i_mode)); + assert("nikita-691", parent != NULL); + + /* We store dot and dotdot as normal directory entries. This is + not necessary, because almost all information stored in them + is already in the stat-data of directory, the only thing + being missed is objectid of grand-parent directory that can + easily be added there as extension. + + But it is done the way it is done, because not storing dot + and dotdot will lead to the following complications: + + . special case handling in ->lookup(). + . addition of another extension to the sd. + . dependency on key allocation policy for stat data. + + */ + + memset(&entry, 0, sizeof entry); + memset(&dots_entry, 0, sizeof dots_entry); + entry.obj = dots_entry.d_inode = object; + dots_entry.d_name.name = "."; + dots_entry.d_name.len = 1; + result = add_entry_hashed(object, &dots_entry, NULL, &entry); + reiser4_free_dentry_fsdata(&dots_entry); + + if (result == 0) { + result = reiser4_add_nlink(object, object, 0); + if (result == 0) { + entry.obj = dots_entry.d_inode = parent; + dots_entry.d_name.name = ".."; + dots_entry.d_name.len = 2; + result = add_entry_hashed(object, + &dots_entry, NULL, &entry); + reiser4_free_dentry_fsdata(&dots_entry); + /* if creation of ".." failed, iput() will delete + object with ".". */ + if (result == 0) { + result = reiser4_add_nlink(parent, object, 0); + if (result != 0) + /* + * if we failed to bump i_nlink, try + * to remove ".." + */ + detach_hashed(object, parent); + } + } + } + + if (result != 0) { + /* + * in the case of error, at least update stat-data so that, + * ->i_nlink updates are not lingering. + */ + reiser4_update_sd(object); + reiser4_update_sd(parent); + } + + return result; +} + +/* looks for name specified in @dentry in directory @parent and if name is + found - key of object found entry points to is stored in @entry->key */ +static int +lookup_name_hashed(struct inode *parent /* inode of directory to lookup for + * name in */, + struct dentry *dentry /* name to look for */, + reiser4_key *key /* place to store key */) +{ + int result; + coord_t *coord; + lock_handle lh; + const char *name; + int len; + reiser4_dir_entry_desc entry; + reiser4_dentry_fsdata *fsdata; + + assert("nikita-1247", parent != NULL); + assert("nikita-1248", dentry != NULL); + assert("nikita-1123", dentry->d_name.name != NULL); + assert("vs-1486", + dentry->d_op == &get_super_private(parent->i_sb)->ops.dentry); + + result = perm_chk(parent, lookup, parent, dentry); + if (result != 0) + return 0; + + name = dentry->d_name.name; + len = dentry->d_name.len; + + if (!is_name_acceptable(parent, name, len)) + /* some arbitrary error code to return */ + return RETERR(-ENAMETOOLONG); + + fsdata = reiser4_get_dentry_fsdata(dentry); + if (IS_ERR(fsdata)) + return PTR_ERR(fsdata); + + coord = &fsdata->dec.entry_coord; + coord_clear_iplug(coord); + init_lh(&lh); + + /* find entry in a directory. This is plugin method. */ + result = find_entry(parent, dentry, &lh, ZNODE_READ_LOCK, &entry); + if (result == 0) { + /* entry was found, extract object key from it. */ + result = WITH_COORD(coord, item_plugin_by_coord(coord)->s.dir.extract_key(coord, key)); + } + done_lh(&lh); + return result; + +} + +/* + * helper for ->lookup() and ->get_parent() methods: if @inode is a + * light-weight file, setup its credentials that are not stored in the + * stat-data in this case + */ +static void +check_light_weight(struct inode *inode, struct inode *parent) +{ + if (inode_get_flag(inode, REISER4_LIGHT_WEIGHT)) { + inode->i_uid = parent->i_uid; + inode->i_gid = parent->i_gid; + /* clear light-weight flag. If inode would be read by any + other name, [ug]id wouldn't change. */ + inode_clr_flag(inode, REISER4_LIGHT_WEIGHT); + } +} + +/* implementation of ->lookup() method for hashed directories. */ +reiser4_internal int +lookup_hashed(struct inode * parent /* inode of directory to + * lookup into */ , + struct dentry **dentryloc /* name to look for */ ) +{ + int result; + struct inode *inode; + struct dentry *dentry; + reiser4_dir_entry_desc entry; + + dentry = *dentryloc; + /* set up operations on dentry. */ + dentry->d_op = &get_super_private(parent->i_sb)->ops.dentry; + + result = lookup_name_hashed(parent, dentry, &entry.key); + if (result == 0) { + inode = reiser4_iget(parent->i_sb, &entry.key, 0); + if (!IS_ERR(inode)) { + check_light_weight(inode, parent); + /* success */ + *dentryloc = d_splice_alias(inode, dentry); + reiser4_iget_complete(inode); + } else + result = PTR_ERR(inode); + } else if (result == -ENOENT) + result = lookup_pseudo_file(parent, dentryloc); + + return result; +} + +/* + * ->get_parent() method of hashed directory. This is used by NFS kernel + * server to "climb" up directory tree to check permissions. + */ +reiser4_internal struct dentry * +get_parent_hashed(struct inode *child) +{ + struct super_block *s; + struct inode *parent; + struct dentry dotdot; + struct dentry *dentry; + reiser4_key key; + int result; + + /* + * lookup dotdot entry. + */ + + s = child->i_sb; + memset(&dotdot, 0, sizeof(dotdot)); + dotdot.d_name.name = ".."; + dotdot.d_name.len = 2; + dotdot.d_op = &get_super_private(s)->ops.dentry; + + result = lookup_name_hashed(child, &dotdot, &key); + if (result != 0) + return ERR_PTR(result); + + parent = reiser4_iget(s, &key, 1); + if (!IS_ERR(parent)) { + /* + * FIXME-NIKITA dubious: attributes are inherited from @child + * to @parent. But: + * + * (*) this is the only this we can do + * + * (*) attributes of light-weight object are inherited + * from a parent through which object was looked up first, + * so it is ambiguous anyway. + * + */ + check_light_weight(parent, child); + reiser4_iget_complete(parent); + dentry = d_alloc_anon(parent); + if (dentry == NULL) { + iput(parent); + dentry = ERR_PTR(RETERR(-ENOMEM)); + } else + dentry->d_op = &get_super_private(s)->ops.dentry; + } else if (PTR_ERR(parent) == -ENOENT) + dentry = ERR_PTR(RETERR(-ESTALE)); + else + dentry = (void *)parent; + return dentry; +} + +static const char *possible_leak = "Possible disk space leak."; + +/* re-bind existing name at @from_coord in @from_dir to point to @to_inode. + + Helper function called from hashed_rename() */ +static int +replace_name(struct inode *to_inode /* inode where @from_coord is + * to be re-targeted at */ , + struct inode *from_dir /* directory where @from_coord + * lives */ , + struct inode *from_inode /* inode @from_coord + * originally point to */ , + coord_t * from_coord /* where directory entry is in + * the tree */ , + lock_handle * from_lh /* lock handle on @from_coord */ ) +{ + item_plugin *from_item; + int result; + znode *node; + + coord_clear_iplug(from_coord); + node = from_coord->node; + result = zload(node); + if (result != 0) + return result; + from_item = item_plugin_by_coord(from_coord); + if (item_type_by_coord(from_coord) == DIR_ENTRY_ITEM_TYPE) { + reiser4_key to_key; + + build_sd_key(to_inode, &to_key); + + /* everything is found and prepared to change directory entry + at @from_coord to point to @to_inode. + + @to_inode is just about to get new name, so bump its link + counter. + + */ + result = reiser4_add_nlink(to_inode, from_dir, 0); + if (result != 0) { + /* Don't issue warning: this may be plain -EMLINK */ + zrelse(node); + return result; + } + + result = from_item->s.dir.update_key(from_coord, &to_key, from_lh); + if (result != 0) { + reiser4_del_nlink(to_inode, from_dir, 0); + zrelse(node); + return result; + } + + /* @from_inode just lost its name, he-he. + + If @from_inode was directory, it contained dotdot pointing + to @from_dir. @from_dir i_nlink will be decreased when + iput() will be called on @from_inode. + + If file-system is not ADG (hard-links are + supported on directories), iput(from_inode) will not remove + @from_inode, and thus above is incorrect, but hard-links on + directories are problematic in many other respects. + */ + result = reiser4_del_nlink(from_inode, from_dir, 0); + if (result != 0) { + warning("nikita-2330", + "Cannot remove link from source: %i. %s", + result, possible_leak); + } + /* Has to return success, because entry is already + * modified. */ + result = 0; + + /* NOTE-NIKITA consider calling plugin method in stead of + accessing inode fields directly. */ + from_dir->i_mtime = CURRENT_TIME; + } else { + warning("nikita-2326", "Unexpected item type"); + result = RETERR(-EIO); + } + zrelse(node); + return result; +} + +/* add new entry pointing to @inode into @dir at @coord, locked by @lh + + Helper function used by hashed_rename(). */ +static int +add_name(struct inode *inode /* inode where @coord is to be + * re-targeted at */ , + struct inode *dir /* directory where @coord lives */ , + struct dentry *name /* new name */ , + coord_t * coord /* where directory entry is in the tree */ , + lock_handle * lh /* lock handle on @coord */ , + int is_dir /* true, if @inode is directory */ ) +{ + int result; + reiser4_dir_entry_desc entry; + + assert("nikita-2333", lh->node == coord->node); + assert("nikita-2334", is_dir == S_ISDIR(inode->i_mode)); + + memset(&entry, 0, sizeof entry); + entry.obj = inode; + /* build key of directory entry description */ + inode_dir_plugin(dir)->build_entry_key(dir, &name->d_name, &entry.key); + + /* ext2 does this in different order: first inserts new entry, + then increases directory nlink. We don't want do this, + because reiser4_add_nlink() calls ->add_link() plugin + method that can fail for whatever reason, leaving as with + cleanup problems. + */ + /* @inode is getting new name */ + reiser4_add_nlink(inode, dir, 0); + /* create @new_name in @new_dir pointing to + @old_inode */ + result = WITH_COORD(coord, + inode_dir_item_plugin(dir)->s.dir.add_entry(dir, + coord, + lh, + name, + &entry)); + if (result != 0) { + int result2; + result2 = reiser4_del_nlink(inode, dir, 0); + if (result2 != 0) { + warning("nikita-2327", "Cannot drop link on %lli %i. %s", + (unsigned long long)get_inode_oid(inode), + result2, possible_leak); + } + } else + INODE_INC_FIELD(dir, i_size); + return result; +} + +static reiser4_block_nr +hashed_estimate_rename( + struct inode *old_dir /* directory where @old is located */, + struct dentry *old_name /* old name */, + struct inode *new_dir /* directory where @new is located */, + struct dentry *new_name /* new name */) +{ + reiser4_block_nr res1, res2; + dir_plugin *p_parent_old, *p_parent_new; + file_plugin *p_child_old, *p_child_new; + + assert("vpf-311", old_dir != NULL); + assert("vpf-312", new_dir != NULL); + assert("vpf-313", old_name != NULL); + assert("vpf-314", new_name != NULL); + + p_parent_old = inode_dir_plugin(old_dir); + p_parent_new = inode_dir_plugin(new_dir); + p_child_old = inode_file_plugin(old_name->d_inode); + if (new_name->d_inode) + p_child_new = inode_file_plugin(new_name->d_inode); + else + p_child_new = 0; + + /* find_entry - can insert one leaf. */ + res1 = res2 = 1; + + /* replace_name */ + { + /* reiser4_add_nlink(p_child_old) and reiser4_del_nlink(p_child_old) */ + res1 += 2 * p_child_old->estimate.update(old_name->d_inode); + /* update key */ + res1 += 1; + /* reiser4_del_nlink(p_child_new) */ + if (p_child_new) + res1 += p_child_new->estimate.update(new_name->d_inode); + } + + /* else add_name */ + { + /* reiser4_add_nlink(p_parent_new) and reiser4_del_nlink(p_parent_new) */ + res2 += 2 * inode_file_plugin(new_dir)->estimate.update(new_dir); + /* reiser4_add_nlink(p_parent_old) */ + res2 += p_child_old->estimate.update(old_name->d_inode); + /* add_entry(p_parent_new) */ + res2 += p_parent_new->estimate.add_entry(new_dir); + /* reiser4_del_nlink(p_parent_old) */ + res2 += p_child_old->estimate.update(old_name->d_inode); + } + + res1 = res1 < res2 ? res2 : res1; + + + /* reiser4_write_sd(p_parent_new) */ + res1 += inode_file_plugin(new_dir)->estimate.update(new_dir); + + /* reiser4_write_sd(p_child_new) */ + if (p_child_new) + res1 += p_child_new->estimate.update(new_name->d_inode); + + /* hashed_rem_entry(p_parent_old) */ + res1 += p_parent_old->estimate.rem_entry(old_dir); + + /* reiser4_del_nlink(p_child_old) */ + res1 += p_child_old->estimate.update(old_name->d_inode); + + /* replace_name */ + { + /* reiser4_add_nlink(p_parent_dir_new) */ + res1 += inode_file_plugin(new_dir)->estimate.update(new_dir); + /* update_key */ + res1 += 1; + /* reiser4_del_nlink(p_parent_new) */ + res1 += inode_file_plugin(new_dir)->estimate.update(new_dir); + /* reiser4_del_nlink(p_parent_old) */ + res1 += inode_file_plugin(old_dir)->estimate.update(old_dir); + } + + /* reiser4_write_sd(p_parent_old) */ + res1 += inode_file_plugin(old_dir)->estimate.update(old_dir); + + /* reiser4_write_sd(p_child_old) */ + res1 += p_child_old->estimate.update(old_name->d_inode); + + return res1; +} + +static int +hashed_rename_estimate_and_grab( + struct inode *old_dir /* directory where @old is located */ , + struct dentry *old_name /* old name */ , + struct inode *new_dir /* directory where @new is located */ , + struct dentry *new_name /* new name */ ) +{ + reiser4_block_nr reserve; + + reserve = hashed_estimate_rename(old_dir, old_name, new_dir, new_name); + + if (reiser4_grab_space(reserve, BA_CAN_COMMIT)) + return RETERR(-ENOSPC); + + return 0; +} + +/* check whether @old_inode and @new_inode can be moved within file system + * tree. This singles out attempts to rename pseudo-files, for example. */ +static int +can_rename(struct inode *old_dir, struct inode *old_inode, + struct inode *new_dir, struct inode *new_inode) +{ + file_plugin *fplug; + dir_plugin *dplug; + + assert("nikita-3370", old_inode != NULL); + + dplug = inode_dir_plugin(new_dir); + fplug = inode_file_plugin(old_inode); + + if (dplug == NULL) + return RETERR(-ENOTDIR); + else if (dplug->create_child == NULL) + return RETERR(-EPERM); + else if (!fplug->can_add_link(old_inode)) + return RETERR(-EMLINK); + else if (new_inode != NULL) { + fplug = inode_file_plugin(new_inode); + if (fplug->can_rem_link != NULL && + !fplug->can_rem_link(new_inode)) + return RETERR(-EBUSY); + } + return 0; +} + +/* ->rename directory plugin method implementation for hashed directories. + plugin->u.dir.rename + See comments in the body. + + It is arguable that this function can be made generic so, that it will be + applicable to any kind of directory plugin that deals with directories + composed out of directory entries. The only obstacle here is that we don't + have any data-type to represent directory entry. This should be + re-considered when more than one different directory plugin will be + implemented. +*/ +reiser4_internal int +rename_hashed(struct inode *old_dir /* directory where @old is located */ , + struct dentry *old_name /* old name */ , + struct inode *new_dir /* directory where @new is located */ , + struct dentry *new_name /* new name */ ) +{ + /* From `The Open Group Base Specifications Issue 6' + + + If either the old or new argument names a symbolic link, rename() + shall operate on the symbolic link itself, and shall not resolve + the last component of the argument. If the old argument and the new + argument resolve to the same existing file, rename() shall return + successfully and perform no other action. + + [this is done by VFS: vfs_rename()] + + + If the old argument points to the pathname of a file that is not a + directory, the new argument shall not point to the pathname of a + directory. + + [checked by VFS: vfs_rename->may_delete()] + + If the link named by the new argument exists, it shall + be removed and old renamed to new. In this case, a link named new + shall remain visible to other processes throughout the renaming + operation and refer either to the file referred to by new or old + before the operation began. + + [we should assure this] + + Write access permission is required for + both the directory containing old and the directory containing new. + + [checked by VFS: vfs_rename->may_delete(), may_create()] + + If the old argument points to the pathname of a directory, the new + argument shall not point to the pathname of a file that is not a + directory. + + [checked by VFS: vfs_rename->may_delete()] + + If the directory named by the new argument exists, it + shall be removed and old renamed to new. In this case, a link named + new shall exist throughout the renaming operation and shall refer + either to the directory referred to by new or old before the + operation began. + + [we should assure this] + + If new names an existing directory, it shall be + required to be an empty directory. + + [we should check this] + + If the old argument points to a pathname of a symbolic link, the + symbolic link shall be renamed. If the new argument points to a + pathname of a symbolic link, the symbolic link shall be removed. + + The new pathname shall not contain a path prefix that names + old. Write access permission is required for the directory + containing old and the directory containing new. If the old + argument points to the pathname of a directory, write access + permission may be required for the directory named by old, and, if + it exists, the directory named by new. + + [checked by VFS: vfs_rename(), vfs_rename_dir()] + + If the link named by the new argument exists and the file's link + count becomes 0 when it is removed and no process has the file + open, the space occupied by the file shall be freed and the file + shall no longer be accessible. If one or more processes have the + file open when the last link is removed, the link shall be removed + before rename() returns, but the removal of the file contents shall + be postponed until all references to the file are closed. + + [iput() handles this, but we can do this manually, a la + reiser4_unlink()] + + Upon successful completion, rename() shall mark for update the + st_ctime and st_mtime fields of the parent directory of each file. + + [N/A] + + */ + + int result; + int is_dir; /* is @old_name directory */ + + struct inode *old_inode; + struct inode *new_inode; + + reiser4_dir_entry_desc old_entry; + reiser4_dir_entry_desc new_entry; + + coord_t *new_coord; + + reiser4_dentry_fsdata *new_fsdata; + + lock_handle new_lh; + + dir_plugin *dplug; + file_plugin *fplug; + + assert("nikita-2318", old_dir != NULL); + assert("nikita-2319", new_dir != NULL); + assert("nikita-2320", old_name != NULL); + assert("nikita-2321", new_name != NULL); + + old_inode = old_name->d_inode; + new_inode = new_name->d_inode; + + dplug = inode_dir_plugin(old_dir); + fplug = NULL; + + new_fsdata = reiser4_get_dentry_fsdata(new_name); + if (IS_ERR(new_fsdata)) + return PTR_ERR(new_fsdata); + + new_coord = &new_fsdata->dec.entry_coord; + coord_clear_iplug(new_coord); + + is_dir = S_ISDIR(old_inode->i_mode); + + assert("nikita-3461", old_inode->i_nlink >= 1 + !!is_dir); + + /* if target is existing directory and it's not empty---return error. + + This check is done specifically, because is_dir_empty() requires + tree traversal and have to be done before locks are taken. + */ + if (is_dir && new_inode != NULL && is_dir_empty(new_inode) != 0) + return RETERR(-ENOTEMPTY); + + result = can_rename(old_dir, old_inode, new_dir, new_inode); + if (result != 0) + return result; + + result = hashed_rename_estimate_and_grab(old_dir, old_name, + new_dir, new_name); + if (result != 0) + return result; + + init_lh(&new_lh); + + /* find entry for @new_name */ + result = find_entry(new_dir, + new_name, &new_lh, ZNODE_WRITE_LOCK, &new_entry); + + if (IS_CBKERR(result)) { + done_lh(&new_lh); + return result; + } + + seal_done(&new_fsdata->dec.entry_seal); + + /* add or replace name for @old_inode as @new_name */ + if (new_inode != NULL) { + /* target (@new_name) exists. */ + /* Not clear what to do with objects that are + both directories and files at the same time. */ + if (result == CBK_COORD_FOUND) { + result = replace_name(old_inode, + new_dir, + new_inode, + new_coord, + &new_lh); + if (result == 0) + fplug = inode_file_plugin(new_inode); + } else if (result == CBK_COORD_NOTFOUND) { + /* VFS told us that @new_name is bound to existing + inode, but we failed to find directory entry. */ + warning("nikita-2324", "Target not found"); + result = RETERR(-ENOENT); + } + } else { + /* target (@new_name) doesn't exists. */ + if (result == CBK_COORD_NOTFOUND) + result = add_name(old_inode, + new_dir, + new_name, + new_coord, + &new_lh, is_dir); + else if (result == CBK_COORD_FOUND) { + /* VFS told us that @new_name is "negative" dentry, + but we found directory entry. */ + warning("nikita-2331", "Target found unexpectedly"); + result = RETERR(-EIO); + } + } + + assert("nikita-3462", ergo(result == 0, + old_inode->i_nlink >= 2 + !!is_dir)); + + /* We are done with all modifications to the @new_dir, release lock on + node. */ + done_lh(&new_lh); + + if (fplug != NULL) { + /* detach @new_inode from name-space */ + result = fplug->detach(new_inode, new_dir); + if (result != 0) + warning("nikita-2330", "Cannot detach %lli: %i. %s", + (unsigned long long)get_inode_oid(new_inode), + result, possible_leak); + } + + if (new_inode != NULL) + reiser4_mark_inode_dirty(new_inode); + + if (result == 0) { + memset(&old_entry, 0, sizeof old_entry); + old_entry.obj = old_inode; + + dplug->build_entry_key(old_dir, + &old_name->d_name, &old_entry.key); + + /* At this stage new name was introduced for + @old_inode. @old_inode, @new_dir, and @new_inode i_nlink + counters were updated. + + We want to remove @old_name now. If @old_inode wasn't + directory this is simple. + */ + result = rem_entry_hashed(old_dir, old_name, &old_entry); + if (result != 0 && result != -ENOMEM) { + warning("nikita-2335", + "Cannot remove old name: %i", result); + } else { + result = reiser4_del_nlink(old_inode, old_dir, 0); + if (result != 0 && result != -ENOMEM) { + warning("nikita-2337", + "Cannot drop link on old: %i", result); + } + } + + if (result == 0 && is_dir) { + /* @old_inode is directory. We also have to update + dotdot entry. */ + coord_t *dotdot_coord; + lock_handle dotdot_lh; + struct dentry dotdot_name; + reiser4_dir_entry_desc dotdot_entry; + reiser4_dentry_fsdata dataonstack; + reiser4_dentry_fsdata *fsdata; + + memset(&dataonstack, 0, sizeof dataonstack); + memset(&dotdot_entry, 0, sizeof dotdot_entry); + dotdot_entry.obj = old_dir; + memset(&dotdot_name, 0, sizeof dotdot_name); + dotdot_name.d_name.name = ".."; + dotdot_name.d_name.len = 2; + /* + * allocate ->d_fsdata on the stack to avoid using + * reiser4_get_dentry_fsdata(). Locking is not needed, + * because dentry is private to the current thread. + */ + dotdot_name.d_fsdata = &dataonstack; + init_lh(&dotdot_lh); + + fsdata = &dataonstack; + dotdot_coord = &fsdata->dec.entry_coord; + coord_clear_iplug(dotdot_coord); + + result = find_entry(old_inode, &dotdot_name, &dotdot_lh, + ZNODE_WRITE_LOCK, &dotdot_entry); + if (result == 0) { + /* replace_name() decreases i_nlink on + * @old_dir */ + result = replace_name(new_dir, + old_inode, + old_dir, + dotdot_coord, + &dotdot_lh); + } else + result = RETERR(-EIO); + done_lh(&dotdot_lh); + } + } + reiser4_update_dir(new_dir); + reiser4_update_dir(old_dir); + reiser4_mark_inode_dirty(old_inode); + if (result == 0) { + file_plugin *fplug; + + if (new_inode != NULL) { + /* add safe-link for target file (in case we removed + * last reference to the poor fellow */ + fplug = inode_file_plugin(new_inode); + if (fplug->not_linked(new_inode)) + result = safe_link_add(new_inode, SAFE_UNLINK); + } + } + return result; +} + +/* ->add_entry() method for hashed directory object plugin. + plugin->u.dir.add_entry +*/ +reiser4_internal int +add_entry_hashed(struct inode *object /* directory to add new name + * in */ , + struct dentry *where /* new name */ , + reiser4_object_create_data * data UNUSED_ARG /* parameters + * of new + * object */ , + reiser4_dir_entry_desc * entry /* parameters of new + * directory entry */ ) +{ + int result; + coord_t *coord; + lock_handle lh; + reiser4_dentry_fsdata *fsdata; + reiser4_block_nr reserve; + + assert("nikita-1114", object != NULL); + assert("nikita-1250", where != NULL); + + fsdata = reiser4_get_dentry_fsdata(where); + if (unlikely(IS_ERR(fsdata))) + return PTR_ERR(fsdata); + + reserve = inode_dir_plugin(object)->estimate.add_entry(object); + if (reiser4_grab_space(reserve, BA_CAN_COMMIT)) + return RETERR(-ENOSPC); + + init_lh(&lh); + coord = &fsdata->dec.entry_coord; + coord_clear_iplug(coord); + + /* check for this entry in a directory. This is plugin method. */ + result = find_entry(object, where, &lh, ZNODE_WRITE_LOCK, entry); + if (likely(result == -ENOENT)) { + /* add new entry. Just pass control to the directory + item plugin. */ + assert("nikita-1709", inode_dir_item_plugin(object)); + assert("nikita-2230", coord->node == lh.node); + seal_done(&fsdata->dec.entry_seal); + result = inode_dir_item_plugin(object)->s.dir.add_entry(object, coord, &lh, where, entry); + if (result == 0) { + adjust_dir_file(object, where, fsdata->dec.pos + 1, +1); + INODE_INC_FIELD(object, i_size); + } + } else if (result == 0) { + assert("nikita-2232", coord->node == lh.node); + result = RETERR(-EEXIST); + } + done_lh(&lh); + + return result; +} + +/* ->rem_entry() method for hashed directory object plugin. + plugin->u.dir.rem_entry + */ +reiser4_internal int +rem_entry_hashed(struct inode *object /* directory from which entry + * is begin removed */ , + struct dentry *where /* name that is being + * removed */ , + reiser4_dir_entry_desc * entry /* description of entry being + * removed */ ) +{ + int result; + coord_t *coord; + lock_handle lh; + reiser4_dentry_fsdata *fsdata; + __u64 tograb; + + /* yes, nested function, so what? Sue me. */ + int rem_entry(void) { + item_plugin *iplug; + struct inode *child; + + iplug = inode_dir_item_plugin(object); + child = where->d_inode; + assert("nikita-3399", child != NULL); + + /* check that we are really destroying an entry for @child */ + if (REISER4_DEBUG) { + int result; + reiser4_key key; + + result = iplug->s.dir.extract_key(coord, &key); + if (result != 0) + return result; + if (get_key_objectid(&key) != get_inode_oid(child)) { + warning("nikita-3397", + "rem_entry: %#llx != %#llx\n", + get_key_objectid(&key), + (unsigned long long)get_inode_oid(child)); + return RETERR(-EIO); + } + } + return iplug->s.dir.rem_entry(object, + &where->d_name, coord, &lh, entry); + } + + assert("nikita-1124", object != NULL); + assert("nikita-1125", where != NULL); + + tograb = inode_dir_plugin(object)->estimate.rem_entry(object); + result = reiser4_grab_space(tograb, BA_CAN_COMMIT | BA_RESERVED); + if (result != 0) + return RETERR(-ENOSPC); + + init_lh(&lh); + + /* check for this entry in a directory. This is plugin method. */ + result = find_entry(object, where, &lh, ZNODE_WRITE_LOCK, entry); + fsdata = reiser4_get_dentry_fsdata(where); + if (IS_ERR(fsdata)) + return PTR_ERR(fsdata); + + coord = &fsdata->dec.entry_coord; + + assert("nikita-3404", + get_inode_oid(where->d_inode) != get_inode_oid(object) || + object->i_size <= 1); + + coord_clear_iplug(coord); + if (result == 0) { + /* remove entry. Just pass control to the directory item + plugin. */ + assert("vs-542", inode_dir_item_plugin(object)); + seal_done(&fsdata->dec.entry_seal); + adjust_dir_file(object, where, fsdata->dec.pos, -1); + result = WITH_COORD(coord, rem_entry()); + if (result == 0) { + if (object->i_size >= 1) + INODE_DEC_FIELD(object, i_size); + else { + warning("nikita-2509", "Dir %llu is runt", + (unsigned long long)get_inode_oid(object)); + result = RETERR(-EIO); + } + + assert("nikita-3405", where->d_inode->i_nlink != 1 || + where->d_inode->i_size != 2 || + inode_dir_plugin(where->d_inode) == NULL); + } + } + done_lh(&lh); + + return result; +} + +static int entry_actor(reiser4_tree * tree /* tree being scanned */ , + coord_t * coord /* current coord */ , + lock_handle * lh /* current lock handle */ , + void *args /* argument to scan */ ); + +/* + * argument package used by entry_actor to scan entries with identical keys. + */ +typedef struct entry_actor_args { + /* name we are looking for */ + const char *name; + /* key of directory entry. entry_actor() scans through sequence of + * items/units having the same key */ + reiser4_key *key; + /* how many entries with duplicate key was scanned so far. */ + int non_uniq; +#if REISER4_USE_COLLISION_LIMIT || REISER4_STATS + /* scan limit */ + int max_non_uniq; +#endif + /* return parameter: set to true, if ->name wasn't found */ + int not_found; + /* what type of lock to take when moving to the next node during + * scan */ + znode_lock_mode mode; + + /* last coord that was visited during scan */ + coord_t last_coord; + /* last node locked during scan */ + lock_handle last_lh; + /* inode of directory */ + const struct inode *inode; +} entry_actor_args; + +static int +check_entry(const struct inode *dir, coord_t *coord, const struct qstr *name) +{ + return WITH_COORD(coord, check_item(dir, coord, name->name)); +} + +/* Look for given @name within directory @dir. + + This is called during lookup, creation and removal of directory + entries. + + First calculate key that directory entry for @name would have. Search + for this key in the tree. If such key is found, scan all items with + the same key, checking name in each directory entry along the way. +*/ +static int +find_entry(struct inode *dir /* directory to scan */, + struct dentry *de /* name to search for */, + lock_handle * lh /* resulting lock handle */, + znode_lock_mode mode /* required lock mode */, + reiser4_dir_entry_desc * entry /* parameters of found directory + * entry */) +{ + const struct qstr *name; + seal_t *seal; + coord_t *coord; + int result; + __u32 flags; + de_location *dec; + reiser4_dentry_fsdata *fsdata; + + assert("nikita-1130", lh != NULL); + assert("nikita-1128", dir != NULL); + + name = &de->d_name; + assert("nikita-1129", name != NULL); + + /* dentry private data don't require lock, because dentry + manipulations are protected by i_sem on parent. + + This is not so for inodes, because there is no -the- parent in + inode case. + */ + fsdata = reiser4_get_dentry_fsdata(de); + if (IS_ERR(fsdata)) + return PTR_ERR(fsdata); + dec = &fsdata->dec; + + coord = &dec->entry_coord; + coord_clear_iplug(coord); + seal = &dec->entry_seal; + /* compose key of directory entry for @name */ + inode_dir_plugin(dir)->build_entry_key(dir, name, &entry->key); + + if (seal_is_set(seal)) { + /* check seal */ + result = seal_validate(seal, coord, &entry->key, + lh, mode, ZNODE_LOCK_LOPRI); + if (result == 0) { + /* key was found. Check that it is really item we are + looking for. */ + result = check_entry(dir, coord, name); + if (result == 0) + return 0; + } + } + flags = (mode == ZNODE_WRITE_LOCK) ? CBK_FOR_INSERT : 0; + /* + * find place in the tree where directory item should be located. + */ + result = object_lookup(dir, + &entry->key, + coord, + lh, + mode, + FIND_EXACT, + LEAF_LEVEL, + LEAF_LEVEL, + flags, + 0/*ra_info*/); + + if (result == CBK_COORD_FOUND) { + entry_actor_args arg; + + /* fast path: no hash collisions */ + result = check_entry(dir, coord, name); + if (result == 0) { + seal_init(seal, coord, &entry->key); + dec->pos = 0; + } else if (result > 0) { + /* Iterate through all units with the same keys. */ + arg.name = name->name; + arg.key = &entry->key; + arg.not_found = 0; + arg.non_uniq = 0; +#if REISER4_USE_COLLISION_LIMIT + arg.max_non_uniq = max_hash_collisions(dir); + assert("nikita-2851", arg.max_non_uniq > 1); +#endif + arg.mode = mode; + arg.inode = dir; + coord_init_zero(&arg.last_coord); + init_lh(&arg.last_lh); + + result = iterate_tree(tree_by_inode(dir), coord, lh, + entry_actor, &arg, mode, 1); + /* if end of the tree or extent was reached during + scanning. */ + if (arg.not_found || (result == -E_NO_NEIGHBOR)) { + /* step back */ + done_lh(lh); + + result = zload(arg.last_coord.node); + if (result == 0) { + coord_clear_iplug(&arg.last_coord); + coord_dup(coord, &arg.last_coord); + move_lh(lh, &arg.last_lh); + result = RETERR(-ENOENT); + zrelse(arg.last_coord.node); + --arg.non_uniq; + } + } + + done_lh(&arg.last_lh); + if (result == 0) + seal_init(seal, coord, &entry->key); + + if (result == 0 || result == -ENOENT) { + assert("nikita-2580", arg.non_uniq > 0); + dec->pos = arg.non_uniq - 1; + } + } + } else + dec->pos = -1; + return result; +} + +/* Function called by find_entry() to look for given name in the directory. */ +static int +entry_actor(reiser4_tree * tree UNUSED_ARG /* tree being scanned */ , + coord_t * coord /* current coord */ , + lock_handle * lh /* current lock handle */ , + void *entry_actor_arg /* argument to scan */ ) +{ + reiser4_key unit_key; + entry_actor_args *args; + + assert("nikita-1131", tree != NULL); + assert("nikita-1132", coord != NULL); + assert("nikita-1133", entry_actor_arg != NULL); + + args = entry_actor_arg; + ++args->non_uniq; +#if REISER4_USE_COLLISION_LIMIT + if (args->non_uniq > args->max_non_uniq) { + args->not_found = 1; + /* hash collision overflow. */ + return RETERR(-EBUSY); + } +#endif + + /* + * did we just reach the end of the sequence of items/units with + * identical keys? + */ + if (!keyeq(args->key, unit_key_by_coord(coord, &unit_key))) { + assert("nikita-1791", keylt(args->key, unit_key_by_coord(coord, &unit_key))); + args->not_found = 1; + args->last_coord.between = AFTER_UNIT; + return 0; + } + + coord_dup(&args->last_coord, coord); + /* + * did scan just moved to the next node? + */ + if (args->last_lh.node != lh->node) { + int lock_result; + + /* + * if so, lock new node with the mode requested by the caller + */ + done_lh(&args->last_lh); + assert("nikita-1896", znode_is_any_locked(lh->node)); + lock_result = longterm_lock_znode(&args->last_lh, lh->node, + args->mode, ZNODE_LOCK_HIPRI); + if (lock_result != 0) + return lock_result; + } + return check_item(args->inode, coord, args->name); +} + +/* + * return 0 iff @coord contains a directory entry for the file with the name + * @name. + */ +static int +check_item(const struct inode *dir, const coord_t * coord, const char *name) +{ + item_plugin *iplug; + char buf[DE_NAME_BUF_LEN]; + + iplug = item_plugin_by_coord(coord); + if (iplug == NULL) { + warning("nikita-1135", "Cannot get item plugin"); + print_coord("coord", coord, 1); + return RETERR(-EIO); + } else if (item_id_by_coord(coord) != item_id_by_plugin(inode_dir_item_plugin(dir))) { + /* item id of current item does not match to id of items a + directory is built of */ + warning("nikita-1136", "Wrong item plugin"); + print_coord("coord", coord, 1); + return RETERR(-EIO); + } + assert("nikita-1137", iplug->s.dir.extract_name); + + /* Compare name stored in this entry with name we are looking for. + + NOTE-NIKITA Here should go code for support of something like + unicode, code tables, etc. + */ + return !!strcmp(name, iplug->s.dir.extract_name(coord, buf)); +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/dir/hashed_dir.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/dir/hashed_dir.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,46 @@ +/* Copyright 2001, 2002, 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Directory plugin using hashes (see fs/reiser4/plugin/hash.c) to map + file names to to files. See hashed_dir.c */ + +#if !defined( __HASHED_DIR_H__ ) +#define __HASHED_DIR_H__ + +#include "../../forward.h" + +#include /* for struct inode */ +#include /* for struct dentry */ + +/* create sd for directory file. Create stat-data, dot, and dotdot. */ +extern int init_hashed(struct inode *object, struct inode *parent, reiser4_object_create_data *); +extern int done_hashed(struct inode *object); +extern int detach_hashed(struct inode *object, struct inode *parent); +extern int owns_item_hashed(const struct inode *inode, const coord_t * coord); +extern int lookup_hashed(struct inode *inode, struct dentry **dentry); +extern int rename_hashed(struct inode *old_dir, + struct dentry *old_name, struct inode *new_dir, struct dentry *new_name); +extern int add_entry_hashed(struct inode *object, + struct dentry *where, reiser4_object_create_data *, reiser4_dir_entry_desc * entry); +extern int rem_entry_hashed(struct inode *object, struct dentry *where, reiser4_dir_entry_desc * entry); +extern reiser4_block_nr estimate_rename_hashed(struct inode *old_dir, + struct dentry *old_name, + struct inode *new_dir, + struct dentry *new_name); +extern reiser4_block_nr estimate_unlink_hashed(struct inode *parent, + struct inode *object); + +extern struct dentry *get_parent_hashed(struct inode *child); + +/* __HASHED_DIR_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/dir/pseudo_dir.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/dir/pseudo_dir.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,97 @@ +/* Copyright 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Directory plugin for pseudo files that operate like a directory. */ + +#include "../../debug.h" +#include "../../inode.h" +#include "../pseudo/pseudo.h" +#include "dir.h" + +#include /* for struct inode */ +#include /* for struct dentry */ + +/* implementation of ->lookup() method for pseudo files. */ +reiser4_internal int lookup_pseudo(struct inode * parent, struct dentry **dentry) +{ + pseudo_plugin *pplug; + int result; + + /* + * call ->lookup method of pseudo plugin + */ + + pplug = reiser4_inode_data(parent)->file_plugin_data.pseudo_info.plugin; + assert("nikita-3222", pplug->lookup != NULL); + result = pplug->lookup(parent, dentry); + if (result == -ENOENT) + result = lookup_pseudo_file(parent, dentry); + return result; +} + + +/* ->readdir() method for pseudo file acting like a directory */ +reiser4_internal int +readdir_pseudo(struct file *f, void *dirent, filldir_t filld) +{ + pseudo_plugin *pplug; + struct inode *inode; + struct dentry *dentry; + int result = 0; + + dentry = f->f_dentry; + inode = dentry->d_inode; + pplug = reiser4_inode_data(inode)->file_plugin_data.pseudo_info.plugin; + if (pplug->readdir != NULL) + /* + * if pseudo plugin defines ->readdir() method---call it to do + * actual work. + */ + result = pplug->readdir(f, dirent, filld); + else { + ino_t ino; + int i; + + /* + * if there is no ->readdir() method in the pseudo plugin, + * make sure that at least dot and dotdot are returned to keep + * user-level happy. + */ + + i = f->f_pos; + switch (i) { + case 0: + ino = get_inode_oid(dentry->d_inode); + if (filld(dirent, ".", 1, i, ino, DT_DIR) < 0) + break; + f->f_pos++; + i++; + /* fallthrough */ + case 1: + ino = parent_ino(dentry); + if (filld(dirent, "..", 2, i, ino, DT_DIR) < 0) + break; + f->f_pos++; + i++; + /* fallthrough */ + } + } + return result; +} + +/* pseudo files are not serializable (currently). So, this should just return an + * error. */ +reiser4_internal struct dentry * +get_parent_pseudo(struct inode *child) +{ + return ERR_PTR(RETERR(-ENOTSUPP)); +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/dir/pseudo_dir.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/dir/pseudo_dir.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,29 @@ +/* Copyright 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Directory plugin for pseudo files. See pseudo_dir.c for details. */ + +#if !defined( __PSEUDO_DIR_H__ ) +#define __PSEUDO_DIR_H__ + +#include "../../forward.h" + +#include /* for struct inode */ +#include /* for struct dentry */ + +extern int lookup_pseudo(struct inode * parent, struct dentry **dentry); +extern int readdir_pseudo(struct file *f, void *dirent, filldir_t filld); +extern struct dentry *get_parent_pseudo(struct inode *child); + +/* __PSEUDO_DIR_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/disk_format/disk_format40.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/disk_format/disk_format40.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,556 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "../../debug.h" +#include "../../dformat.h" +#include "../../key.h" +#include "../node/node.h" +#include "../space/space_allocator.h" +#include "disk_format40.h" +#include "../plugin.h" +#include "../../txnmgr.h" +#include "../../jnode.h" +#include "../../tree.h" +#include "../../super.h" +#include "../../wander.h" +#include "../../inode.h" +#include "../../ktxnmgrd.h" +#include "../../status_flags.h" + +#include /* for __u?? */ +#include /* for struct super_block */ +#include + +/* reiser 4.0 default disk layout */ + +/* Amount of free blocks needed to perform release_format40 when fs gets + mounted RW: 1 for SB, 1 for non-leaves in overwrite set, 2 for tx header + & tx record. */ +#define RELEASE_RESERVED 4 + +/* functions to access fields of format40_disk_super_block */ +static __u64 +get_format40_block_count(const format40_disk_super_block * sb) +{ + return d64tocpu(&sb->block_count); +} + +static __u64 +get_format40_free_blocks(const format40_disk_super_block * sb) +{ + return d64tocpu(&sb->free_blocks); +} + +static __u64 +get_format40_root_block(const format40_disk_super_block * sb) +{ + return d64tocpu(&sb->root_block); +} + +static __u16 +get_format40_tree_height(const format40_disk_super_block * sb) +{ + return d16tocpu(&sb->tree_height); +} + +static __u64 +get_format40_file_count(const format40_disk_super_block * sb) +{ + return d64tocpu(&sb->file_count); +} + +static __u64 +get_format40_oid(const format40_disk_super_block * sb) +{ + return d64tocpu(&sb->oid); +} + +static __u32 +get_format40_mkfs_id(const format40_disk_super_block * sb) +{ + return d32tocpu(&sb->mkfs_id); +} + +static __u64 +get_format40_flags(const format40_disk_super_block * sb) +{ + return d64tocpu(&sb->flags); +} + +static format40_super_info * +get_sb_info(struct super_block *super) +{ + return &get_super_private(super)->u.format40; +} + +static int +consult_diskmap(struct super_block *s) +{ + format40_super_info *info; + journal_location *jloc; + + info = get_sb_info(s); + jloc = &get_super_private(s)->jloc; + /* Default format-specific locations, if there is nothing in + * diskmap */ + jloc->footer = FORMAT40_JOURNAL_FOOTER_BLOCKNR; + jloc->header = FORMAT40_JOURNAL_HEADER_BLOCKNR; + info->loc.super = FORMAT40_OFFSET / s->s_blocksize; +#ifdef CONFIG_REISER4_BADBLOCKS + reiser4_get_diskmap_value(FORMAT40_PLUGIN_DISKMAP_ID, FORMAT40_JF, + &jloc->footer); + reiser4_get_diskmap_value(FORMAT40_PLUGIN_DISKMAP_ID, FORMAT40_JH, + &jloc->header); + reiser4_get_diskmap_value(FORMAT40_PLUGIN_DISKMAP_ID, FORMAT40_SUPER, + &info->loc.super); +#endif + return 0; +} + +/* find any valid super block of disk_format40 (even if the first + super block is destroyed), will change block numbers of actual journal header/footer (jf/jh) + if needed */ +static struct buffer_head * +find_a_disk_format40_super_block(struct super_block *s) +{ + struct buffer_head *super_bh; + format40_disk_super_block *disk_sb; + format40_super_info *info; + + assert("umka-487", s != NULL); + + info = get_sb_info(s); + + super_bh = sb_bread(s, info->loc.super); + if (super_bh == NULL) + return ERR_PTR(RETERR(-EIO)); + + disk_sb = (format40_disk_super_block *) super_bh->b_data; + if (strncmp(disk_sb->magic, FORMAT40_MAGIC, sizeof(FORMAT40_MAGIC))) { + brelse(super_bh); + return ERR_PTR(RETERR(-EINVAL)); + } + + reiser4_set_block_count(s, d64tocpu(&disk_sb->block_count)); + reiser4_set_data_blocks(s, d64tocpu(&disk_sb->block_count) - + d64tocpu(&disk_sb->free_blocks)); + reiser4_set_free_blocks(s, (d64tocpu(&disk_sb->free_blocks))); + + return super_bh; +} + +/* find the most recent version of super block. This is called after journal is + replayed */ +static struct buffer_head * +read_super_block(struct super_block *s UNUSED_ARG) +{ + /* Here the most recent superblock copy has to be read. However, as + journal replay isn't complete, we are using + find_a_disk_format40_super_block() function. */ + return find_a_disk_format40_super_block(s); +} + +static int +get_super_jnode(struct super_block *s) +{ + reiser4_super_info_data *sbinfo = get_super_private(s); + jnode *sb_jnode; + int ret; + + sb_jnode = alloc_io_head(&get_sb_info(s)->loc.super); + + ret = jload(sb_jnode); + + if (ret) { + drop_io_head(sb_jnode); + return ret; + } + + pin_jnode_data(sb_jnode); + jrelse(sb_jnode); + + sbinfo->u.format40.sb_jnode = sb_jnode; + + return 0; +} + +static void +done_super_jnode(struct super_block *s) +{ + jnode *sb_jnode = get_super_private(s)->u.format40.sb_jnode; + + if (sb_jnode) { + unpin_jnode_data(sb_jnode); + drop_io_head(sb_jnode); + } +} + +typedef enum format40_init_stage { + NONE_DONE = 0, + CONSULT_DISKMAP, + FIND_A_SUPER, + INIT_JOURNAL_INFO, + INIT_EFLUSH, + INIT_STATUS, + JOURNAL_REPLAY, + READ_SUPER, + KEY_CHECK, + INIT_OID, + INIT_TREE, + JOURNAL_RECOVER, + INIT_SA, + INIT_JNODE, + ALL_DONE +} format40_init_stage; + +static int +try_init_format40(struct super_block *s, format40_init_stage *stage) +{ + int result; + struct buffer_head *super_bh; + reiser4_super_info_data *sbinfo; + format40_disk_super_block sb; + /* FIXME-NIKITA ugly work-around: keep copy of on-disk super-block */ + format40_disk_super_block *sb_copy = &sb; + tree_level height; + reiser4_block_nr root_block; + node_plugin *nplug; + + cassert(sizeof sb == 512); + + assert("vs-475", s != NULL); + assert("vs-474", get_super_private(s)); + + /* initialize reiser4_super_info_data */ + sbinfo = get_super_private(s); + + *stage = NONE_DONE; + + result = consult_diskmap(s); + if (result) + return result; + *stage = CONSULT_DISKMAP; + + super_bh = find_a_disk_format40_super_block(s); + if (IS_ERR(super_bh)) + return PTR_ERR(super_bh); + brelse(super_bh); + *stage = FIND_A_SUPER; + + /* map jnodes for journal control blocks (header, footer) to disk */ + result = init_journal_info(s); + if (result) + return result; + *stage = INIT_JOURNAL_INFO; + + result = eflush_init_at(s); + if (result) + return result; + *stage = INIT_EFLUSH; + + /* ok, we are sure that filesystem format is a format40 format */ + /* Now check it's state */ + result = reiser4_status_init(FORMAT40_STATUS_BLOCKNR); + if (result != 0 && result != -EINVAL) + /* -EINVAL means there is no magic, so probably just old + * fs. */ + return result; + *stage = INIT_STATUS; + + result = reiser4_status_query(NULL, NULL); + if (result == REISER4_STATUS_MOUNT_WARN) + printk("Warning, mounting filesystem with errors\n"); + if (result == REISER4_STATUS_MOUNT_RO) { + printk("Warning, mounting filesystem with fatal errors, forcing read-only mount\n"); + /* FIXME: here we should actually enforce read-only mount, + * only it is unsupported yet. */ + } + + result = reiser4_journal_replay(s); + if (result) + return result; + *stage = JOURNAL_REPLAY; + + super_bh = read_super_block(s); + if (IS_ERR(super_bh)) + return PTR_ERR(super_bh); + *stage = READ_SUPER; + + memcpy(sb_copy, ((format40_disk_super_block *) super_bh->b_data), sizeof (*sb_copy)); + brelse(super_bh); + + if (!equi(REISER4_LARGE_KEY, + get_format40_flags(sb_copy) & (1 << FORMAT40_LARGE_KEYS))) { + warning("nikita-3228", "Key format mismatch. " + "Only %s keys are supported.", + REISER4_LARGE_KEY ? "large" : "small"); + return RETERR(-EINVAL); + } + *stage = KEY_CHECK; + + result = oid_init_allocator(s, get_format40_file_count(sb_copy), get_format40_oid(sb_copy)); + if (result) + return result; + *stage = INIT_OID; + + /* get things necessary to init reiser4_tree */ + root_block = get_format40_root_block(sb_copy); + height = get_format40_tree_height(sb_copy); + nplug = node_plugin_by_id(NODE40_ID); + + sbinfo->tree.super = s; + /* init reiser4_tree for the filesystem */ + result = init_tree(&sbinfo->tree, &root_block, height, nplug); + if (result) + return result; + *stage = INIT_TREE; + + /* initialize reiser4_super_info_data */ + sbinfo->default_uid = 0; + sbinfo->default_gid = 0; + + reiser4_set_mkfs_id(s, get_format40_mkfs_id(sb_copy)); + reiser4_set_block_count(s, get_format40_block_count(sb_copy)); + reiser4_set_free_blocks(s, get_format40_free_blocks(sb_copy)); + + sbinfo->fsuid = 0; + sbinfo->fs_flags |= (1 << REISER4_ADG); /* hard links for directories + * are not supported */ + sbinfo->fs_flags |= (1 << REISER4_ONE_NODE_PLUGIN); /* all nodes in + * layout 40 are + * of one + * plugin */ + /* sbinfo->tmgr is initialized already */ + + /* recover sb data which were logged separately from sb block */ + + /* NOTE-NIKITA: reiser4_journal_recover_sb_data() calls + * oid_init_allocator() and reiser4_set_free_blocks() with new + * data. What's the reason to call them above? */ + result = reiser4_journal_recover_sb_data(s); + if (result != 0) + return result; + *stage = JOURNAL_RECOVER; + + /* Set number of used blocks. The number of used blocks is not stored + neither in on-disk super block nor in the journal footer blocks. At + this moment actual values of total blocks and free block counters are + set in the reiser4 super block (in-memory structure) and we can + calculate number of used blocks from them. */ + reiser4_set_data_blocks(s, + reiser4_block_count(s) - reiser4_free_blocks(s)); + +#if REISER4_DEBUG + sbinfo->min_blocks_used = + 16 /* reserved area */ + + 2 /* super blocks */ + + 2 /* journal footer and header */; +#endif + + /* init disk space allocator */ + result = sa_init_allocator(get_space_allocator(s), s, 0); + if (result) + return result; + *stage = INIT_SA; + + result = get_super_jnode(s); + if (result == 0) + *stage = ALL_DONE; + return result; +} + +/* plugin->u.format.get_ready */ +reiser4_internal int +get_ready_format40(struct super_block *s, void *data UNUSED_ARG) +{ + int result; + format40_init_stage stage; + + result = try_init_format40(s, &stage); + switch (stage) { + case ALL_DONE: + assert("nikita-3458", result == 0); + break; + case INIT_JNODE: + done_super_jnode(s); + case INIT_SA: + sa_destroy_allocator(get_space_allocator(s), s); + case JOURNAL_RECOVER: + case INIT_TREE: + done_tree(&get_super_private(s)->tree); + case INIT_OID: + case KEY_CHECK: + case READ_SUPER: + case JOURNAL_REPLAY: + case INIT_STATUS: + reiser4_status_finish(); + case INIT_EFLUSH: + eflush_done_at(s); + case INIT_JOURNAL_INFO: + done_journal_info(s); + case FIND_A_SUPER: + case CONSULT_DISKMAP: + case NONE_DONE: + break; + default: + impossible("nikita-3457", "init stage: %i", stage); + } + + if (!rofs_super(s) && reiser4_free_blocks(s) < RELEASE_RESERVED) + return RETERR(-ENOSPC); + + return result; +} + +static void +pack_format40_super(const struct super_block *s, char *data) +{ + format40_disk_super_block *super_data = (format40_disk_super_block *) data; + reiser4_super_info_data *sbinfo = get_super_private(s); + + assert("zam-591", data != NULL); + + cputod64(reiser4_free_committed_blocks(s), &super_data->free_blocks); + cputod64(sbinfo->tree.root_block, &super_data->root_block); + + cputod64(oid_next(s), &super_data->oid); + cputod64(oids_used(s), &super_data->file_count); + + cputod16(sbinfo->tree.height, &super_data->tree_height); +} + +/* plugin->u.format.log_super + return a jnode which should be added to transaction when the super block + gets logged */ +reiser4_internal jnode * +log_super_format40(struct super_block *s) +{ + jnode *sb_jnode; + + sb_jnode = get_super_private(s)->u.format40.sb_jnode; + + jload(sb_jnode); + + pack_format40_super(s, jdata(sb_jnode)); + + jrelse(sb_jnode); + + return sb_jnode; +} + +/* plugin->u.format.release */ +reiser4_internal int +release_format40(struct super_block *s) +{ + int ret; + reiser4_super_info_data *sbinfo; + + sbinfo = get_super_private(s); + assert("zam-579", sbinfo != NULL); + + if (!rofs_super(s)) { + ret = capture_super_block(s); + if (ret != 0) + warning("vs-898", "capture_super_block failed: %d", ret); + + ret = txnmgr_force_commit_all(s, 1); + if (ret != 0) + warning("jmacd-74438", "txn_force failed: %d", ret); + + all_grabbed2free(); + } + + sa_destroy_allocator(&sbinfo->space_allocator, s); + done_journal_info(s); + eflush_done_at(s); + done_super_jnode(s); + + return 0; +} + +#define FORMAT40_ROOT_LOCALITY 41 +#define FORMAT40_ROOT_OBJECTID 42 + +/* plugin->u.format.root_dir_key */ +reiser4_internal const reiser4_key * +root_dir_key_format40(const struct super_block *super UNUSED_ARG) +{ + static const reiser4_key FORMAT40_ROOT_DIR_KEY = { + .el = {{(FORMAT40_ROOT_LOCALITY << 4) | KEY_SD_MINOR}, +#if REISER4_LARGE_KEY + {0ull}, +#endif + {FORMAT40_ROOT_OBJECTID}, {0ull}} + }; + + return &FORMAT40_ROOT_DIR_KEY; +} + +/* plugin->u.format.print_info */ +reiser4_internal void +print_info_format40(const struct super_block *s) +{ +#if 0 + format40_disk_super_block *sb_copy; + + sb_copy = &get_super_private(s)->u.format40.actual_sb; + + printk("\tblock count %llu\n" + "\tfree blocks %llu\n" + "\troot_block %llu\n" + "\ttail policy %s\n" + "\tmin free oid %llu\n" + "\tfile count %llu\n" + "\ttree height %d\n", + get_format40_block_count(sb_copy), + get_format40_free_blocks(sb_copy), + get_format40_root_block(sb_copy), + formatting_plugin_by_id(get_format40_formatting_policy(sb_copy))->h.label, + get_format40_oid(sb_copy), get_format40_file_count(sb_copy), get_format40_tree_height(sb_copy)); +#endif +} + +/* plugin->u.format.check_open. + Check the opened object for validness. For now it checks for the valid oid & + locality only, can be improved later and it its work may depend on the mount + options. */ +reiser4_internal int +check_open_format40(const struct inode *object) { + oid_t max, oid; + + max = oid_next(object->i_sb) - 1; + + /* Check the oid. */ + oid = get_inode_oid(object); + if (oid > max) { + warning("vpf-1360", "The object with the oid %llu " + "greater then the max used oid %llu found.", + (unsigned long long)oid, + (unsigned long long)max); + + return RETERR(-EIO); + } + + /* Check the locality. */ + oid = reiser4_inode_data(object)->locality_id; + if (oid > max) { + warning("vpf-1360", "The object with the locality %llu " + "greater then the max used oid %llu found.", + (unsigned long long)oid, + (unsigned long long)max); + + return RETERR(-EIO); + } + + return 0; +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/disk_format/disk_format40.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/disk_format/disk_format40.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,100 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* this file contains: + - definition of ondisk super block of standart disk layout for + reiser 4.0 (layout 40) + - definition of layout 40 specific portion of in-core super block + - declarations of functions implementing methods of layout plugin + for layout 40 + - declarations of functions used to get/set fields in layout 40 super block +*/ + +#ifndef __DISK_FORMAT40_H__ +#define __DISK_FORMAT40_H__ + +/* magic for default reiser4 layout */ +#define FORMAT40_MAGIC "ReIsEr40FoRmAt" +#define FORMAT40_OFFSET (REISER4_MASTER_OFFSET + PAGE_CACHE_SIZE) + +#include "../../dformat.h" + +#include /* for struct super_block */ + +typedef enum { + FORMAT40_LARGE_KEYS +} format40_flags; + +/* ondisk super block for format 40. It is 512 bytes long */ +typedef struct format40_disk_super_block { + /* 0 */ d64 block_count; + /* number of block in a filesystem */ + /* 8 */ d64 free_blocks; + /* number of free blocks */ + /* 16 */ d64 root_block; + /* filesystem tree root block */ + /* 24 */ d64 oid; + /* smallest free objectid */ + /* 32 */ d64 file_count; + /* number of files in a filesystem */ + /* 40 */ d64 flushes; + /* number of times super block was + flushed. Needed if format 40 + will have few super blocks */ + /* 48 */ d32 mkfs_id; + /* unique identifier of fs */ + /* 52 */ char magic[16]; + /* magic string ReIsEr40FoRmAt */ + /* 68 */ d16 tree_height; + /* height of filesystem tree */ + /* 70 */ d16 formatting_policy; + /* 72 */ d64 flags; + /* 72 */ char not_used[432]; +} format40_disk_super_block; + +/* format 40 specific part of reiser4_super_info_data */ +typedef struct format40_super_info { +/* format40_disk_super_block actual_sb; */ + jnode *sb_jnode; + struct { + reiser4_block_nr super; + } loc; +} format40_super_info; + +/* Defines for journal header and footer respectively. */ +#define FORMAT40_JOURNAL_HEADER_BLOCKNR \ + ((REISER4_MASTER_OFFSET / PAGE_CACHE_SIZE) + 3) + +#define FORMAT40_JOURNAL_FOOTER_BLOCKNR \ + ((REISER4_MASTER_OFFSET / PAGE_CACHE_SIZE) + 4) + +#define FORMAT40_STATUS_BLOCKNR \ + ((REISER4_MASTER_OFFSET / PAGE_CACHE_SIZE) + 5) + +/* Diskmap declarations */ +#define FORMAT40_PLUGIN_DISKMAP_ID ((REISER4_FORMAT_PLUGIN_TYPE<<16) | (FORMAT40_ID)) +#define FORMAT40_SUPER 1 +#define FORMAT40_JH 2 +#define FORMAT40_JF 3 + +/* declarations of functions implementing methods of layout plugin for + format 40. The functions theirself are in disk_format40.c */ +int get_ready_format40(struct super_block *, void *data); +const reiser4_key *root_dir_key_format40(const struct super_block *); +int release_format40(struct super_block *s); +jnode *log_super_format40(struct super_block *s); +void print_info_format40(const struct super_block *s); +int check_open_format40(const struct inode *object); + +/* __DISK_FORMAT40_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/disk_format/disk_format.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/disk_format/disk_format.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,38 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "../../debug.h" +#include "../plugin_header.h" +#include "disk_format40.h" +#include "disk_format.h" +#include "../plugin.h" + +/* initialization of disk layout plugins */ +disk_format_plugin format_plugins[LAST_FORMAT_ID] = { + [FORMAT40_ID] = { + .h = { + .type_id = REISER4_FORMAT_PLUGIN_TYPE, + .id = FORMAT40_ID, + .pops = NULL, + .label = "reiser40", + .desc = "standard disk layout for reiser40", + .linkage = TYPE_SAFE_LIST_LINK_ZERO, + }, + .get_ready = get_ready_format40, + .root_dir_key = root_dir_key_format40, + .release = release_format40, + .log_super = log_super_format40, + .print_info = print_info_format40, + .check_open = check_open_format40 + } +}; + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/disk_format/disk_format.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/disk_format/disk_format.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,41 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* identifiers for disk layouts, they are also used as indexes in array of disk + plugins */ + +#if !defined( __REISER4_DISK_FORMAT_H__ ) +#define __REISER4_DISK_FORMAT_H__ + +typedef enum { + /* standard reiser4 disk layout plugin id */ + FORMAT40_ID, + LAST_FORMAT_ID +} disk_format_id; + +/* __REISER4_DISK_FORMAT_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ + + + + + + + + + + + + + + diff -puN /dev/null fs/reiser4/plugin/fibration.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/fibration.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,173 @@ +/* Copyright 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Directory fibrations */ + +/* + * Suppose we have a directory tree with sources of some project. During + * compilation .o files are created within this tree. This makes access + * to the original source files less efficient, because source files are + * now "diluted" by object files: default directory plugin uses prefix + * of a file name as a part of the key for directory entry (and this + * part is also inherited by the key of file body). This means that + * foo.o will be located close to foo.c and foo.h in the tree. + * + * To avoid this effect directory plugin fill highest 7 (unused + * originally) bits of the second component of the directory entry key + * by bit-pattern depending on the file name (see + * fs/reiser4/kassign.c:build_entry_key_common()). These bits are called + * "fibre". Fibre of the file name key is inherited by key of stat data + * and keys of file body (in the case of REISER4_LARGE_KEY). + * + * Fibre for a given file is chosen by per-directory fibration + * plugin. Names within given fibre are ordered lexicographically. + */ + +#include "../debug.h" +#include "plugin_header.h" +#include "plugin.h" +#include "../super.h" +#include "../inode.h" + +#include + +static const int fibre_shift = 57; + +#define FIBRE_NO(n) (((__u64)(n)) << fibre_shift) + +/* + * Trivial fibration: all files of directory are just ordered + * lexicographically. + */ +static __u64 fibre_trivial(const struct inode *dir, const char *name, int len) +{ + return FIBRE_NO(0); +} + +/* + * dot-o fibration: place .o files after all others. + */ +static __u64 fibre_dot_o(const struct inode *dir, const char *name, int len) +{ + /* special treatment for .*\.o */ + if (len > 2 && name[len - 1] == 'o' && name[len - 2] == '.') + return FIBRE_NO(1); + else + return FIBRE_NO(0); +} + +/* + * ext.1 fibration: subdivide directory into 128 fibrations one for each + * 7bit extension character (file "foo.h" goes into fibre "h"), plus + * default fibre for the rest. + */ +static __u64 fibre_ext_1(const struct inode *dir, const char *name, int len) +{ + if (len > 2 && name[len - 2] == '.') + return FIBRE_NO(name[len - 1]); + else + return FIBRE_NO(0); +} + +/* + * ext.3 fibration: try to separate files with different 3-character + * extensions from each other. + */ +static __u64 fibre_ext_3(const struct inode *dir, const char *name, int len) +{ + if (len > 4 && name[len - 4] == '.') + return FIBRE_NO(name[len - 3] + name[len - 2] + name[len - 1]); + else + return FIBRE_NO(0); +} + +static int +change_fibration(struct inode * inode, reiser4_plugin * plugin) +{ + int result; + + assert("nikita-3503", inode != NULL); + assert("nikita-3504", plugin != NULL); + + assert("nikita-3505", is_reiser4_inode(inode)); + assert("nikita-3506", inode_dir_plugin(inode) != NULL); + assert("nikita-3507", plugin->h.type_id == REISER4_FIBRATION_PLUGIN_TYPE); + + result = 0; + if (inode_fibration_plugin(inode) == NULL || + inode_fibration_plugin(inode)->h.id != plugin->h.id) { + if (is_dir_empty(inode) == 0) + result = plugin_set_fibration(&reiser4_inode_data(inode)->pset, + &plugin->fibration); + else + result = RETERR(-ENOTEMPTY); + + } + return result; +} + +static reiser4_plugin_ops fibration_plugin_ops = { + .init = NULL, + .load = NULL, + .save_len = NULL, + .save = NULL, + .change = change_fibration +}; + +/* fibration plugins */ +fibration_plugin fibration_plugins[LAST_FIBRATION_ID] = { + [FIBRATION_LEXICOGRAPHIC] = { + .h = { + .type_id = REISER4_FIBRATION_PLUGIN_TYPE, + .id = FIBRATION_LEXICOGRAPHIC, + .pops = &fibration_plugin_ops, + .label = "lexicographic", + .desc = "no fibration", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .fibre = fibre_trivial + }, + [FIBRATION_DOT_O] = { + .h = { + .type_id = REISER4_FIBRATION_PLUGIN_TYPE, + .id = FIBRATION_DOT_O, + .pops = &fibration_plugin_ops, + .label = "dot-o", + .desc = "fibrate .o files separately", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .fibre = fibre_dot_o + }, + [FIBRATION_EXT_1] = { + .h = { + .type_id = REISER4_FIBRATION_PLUGIN_TYPE, + .id = FIBRATION_EXT_1, + .pops = &fibration_plugin_ops, + .label = "ext-1", + .desc = "fibrate file by single character extension", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .fibre = fibre_ext_1 + }, + [FIBRATION_EXT_3] = { + .h = { + .type_id = REISER4_FIBRATION_PLUGIN_TYPE, + .id = FIBRATION_EXT_3, + .pops = &fibration_plugin_ops, + .label = "ext-3", + .desc = "fibrate file by three character extension", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .fibre = fibre_ext_3 + } +}; + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/fibration.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/fibration.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,37 @@ +/* Copyright 2004 by Hans Reiser, licensing governed by reiser4/README */ + +/* Fibration plugin used by hashed directory plugin to segment content + * of directory. See fs/reiser4/plugin/fibration.c for more on this. */ + +#if !defined( __FS_REISER4_PLUGIN_FIBRATION_H__ ) +#define __FS_REISER4_PLUGIN_FIBRATION_H__ + +#include "plugin_header.h" + +typedef struct fibration_plugin { + /* generic fields */ + plugin_header h; + + __u64 (*fibre)(const struct inode *dir, const char *name, int len); +} fibration_plugin; + +typedef enum { + FIBRATION_LEXICOGRAPHIC, + FIBRATION_DOT_O, + FIBRATION_EXT_1, + FIBRATION_EXT_3, + LAST_FIBRATION_ID +} reiser4_fibration_id; + +/* __FS_REISER4_PLUGIN_FIBRATION_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/file/file.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/file/file.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,2740 @@ +/* Copyright 2001, 2002, 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +#include "../../inode.h" +#include "../../super.h" +#include "../../tree_walk.h" +#include "../../carry.h" +#include "../../page_cache.h" +#include "../../ioctl.h" +#include "../object.h" +#include "../../safe_link.h" +#include "funcs.h" + +#include +#include +#include + +/* this file contains file plugin methods of reiser4 unix files. + + Those files are either built of tail items only (FORMATTING_ID) or of extent + items only (EXTENT_POINTER_ID) or empty (have no items but stat data) */ + +static int unpack(struct inode *inode, int forever); + +/* get unix file plugin specific portion of inode */ +reiser4_internal unix_file_info_t * +unix_file_inode_data(const struct inode * inode) +{ + return &reiser4_inode_data(inode)->file_plugin_data.unix_file_info; +} + +static int +file_is_built_of_tails(const struct inode *inode) +{ + return unix_file_inode_data(inode)->container == UF_CONTAINER_TAILS; +} + +#if REISER4_DEBUG + +static int +file_is_built_of_extents(const struct inode *inode) +{ + return unix_file_inode_data(inode)->container == UF_CONTAINER_EXTENTS; +} + +static int +file_is_empty(const struct inode *inode) +{ + return unix_file_inode_data(inode)->container == UF_CONTAINER_EMPTY; +} + +#endif + +static int +file_state_is_unknown(const struct inode *inode) +{ + return unix_file_inode_data(inode)->container == UF_CONTAINER_UNKNOWN; +} + +static void +set_file_state_extents(struct inode *inode) +{ + unix_file_inode_data(inode)->container = UF_CONTAINER_EXTENTS; +} + +static void +set_file_state_tails(struct inode *inode) +{ + unix_file_inode_data(inode)->container = UF_CONTAINER_TAILS; +} + +static void +set_file_state_empty(struct inode *inode) +{ + unix_file_inode_data(inode)->container = UF_CONTAINER_EMPTY; +} + +static void +set_file_state_unknown(struct inode *inode) +{ + unix_file_inode_data(inode)->container = UF_CONTAINER_UNKNOWN; +} + +static int +less_than_ldk(znode *node, const reiser4_key *key) +{ + return UNDER_RW(dk, current_tree, read, keylt(key, znode_get_ld_key(node))); +} + +reiser4_internal int +equal_to_rdk(znode *node, const reiser4_key *key) +{ + return UNDER_RW(dk, current_tree, read, keyeq(key, znode_get_rd_key(node))); +} + +#if REISER4_DEBUG + +static int +less_than_rdk(znode *node, const reiser4_key *key) +{ + return UNDER_RW(dk, current_tree, read, keylt(key, znode_get_rd_key(node))); +} + +static int +equal_to_ldk(znode *node, const reiser4_key *key) +{ + return UNDER_RW(dk, current_tree, read, keyeq(key, znode_get_ld_key(node))); +} + +/* get key of item next to one @coord is set to */ +static reiser4_key * +get_next_item_key(const coord_t *coord, reiser4_key *next_key) +{ + if (coord->item_pos == node_num_items(coord->node) - 1) { + /* get key of next item if it is in right neighbor */ + UNDER_RW_VOID(dk, znode_get_tree(coord->node), read, + *next_key = *znode_get_rd_key(coord->node)); + } else { + /* get key of next item if it is in the same node */ + coord_t next; + + coord_dup_nocheck(&next, coord); + next.unit_pos = 0; + check_me("vs-730", coord_next_item(&next) == 0); + item_key_by_coord(&next, next_key); + } + return next_key; +} + +static int +item_of_that_file(const coord_t *coord, const reiser4_key *key) +{ + reiser4_key max_possible; + item_plugin *iplug; + + iplug = item_plugin_by_coord(coord); + assert("vs-1011", iplug->b.max_key_inside); + return keylt(key, iplug->b.max_key_inside(coord, &max_possible)); +} + +static int +check_coord(const coord_t *coord, const reiser4_key *key) +{ + coord_t twin; + + if (!REISER4_DEBUG) + return 1; + node_plugin_by_node(coord->node)->lookup(coord->node, key, FIND_MAX_NOT_MORE_THAN, &twin); + return coords_equal(coord, &twin); +} + +#endif /* REISER4_DEBUG */ + +static void +init_uf_coord(uf_coord_t *uf_coord, lock_handle *lh) +{ + coord_init_zero(&uf_coord->coord); + coord_clear_iplug(&uf_coord->coord); + uf_coord->lh = lh; + init_lh(lh); + memset(&uf_coord->extension, 0, sizeof(uf_coord->extension)); + uf_coord->valid = 0; +} + +static inline void +validate_extended_coord(uf_coord_t *uf_coord, loff_t offset) +{ + assert("vs-1333", uf_coord->valid == 0); + assert("vs-1348", item_plugin_by_coord(&uf_coord->coord)->s.file.init_coord_extension); + + item_body_by_coord(&uf_coord->coord); + item_plugin_by_coord(&uf_coord->coord)->s.file.init_coord_extension(uf_coord, offset); +} + +reiser4_internal write_mode_t +how_to_write(uf_coord_t *uf_coord, const reiser4_key *key) +{ + write_mode_t result; + coord_t *coord; + ON_DEBUG(reiser4_key check); + + coord = &uf_coord->coord; + + assert("vs-1252", znode_is_wlocked(coord->node)); + assert("vs-1253", znode_is_loaded(coord->node)); + + if (uf_coord->valid == 1) { + assert("vs-1332", check_coord(coord, key)); + return (coord->between == AFTER_UNIT) ? APPEND_ITEM : OVERWRITE_ITEM; + } + + if (less_than_ldk(coord->node, key)) { + assert("vs-1014", get_key_offset(key) == 0); + + coord_init_before_first_item(coord, coord->node); + uf_coord->valid = 1; + result = FIRST_ITEM; + goto ok; + } + + assert("vs-1335", less_than_rdk(coord->node, key)); + + if (node_is_empty(coord->node)) { + assert("vs-879", znode_get_level(coord->node) == LEAF_LEVEL); + assert("vs-880", get_key_offset(key) == 0); + /* + * Situation that check below tried to handle is follows: some + * other thread writes to (other) file and has to insert empty + * leaf between two adjacent extents. Generally, we are not + * supposed to muck with this node. But it is possible that + * said other thread fails due to some error (out of disk + * space, for example) and leaves empty leaf + * lingering. Nothing prevents us from reusing it. + */ + assert("vs-1000", UNDER_RW(dk, current_tree, read, + keylt(key, znode_get_rd_key(coord->node)))); + assert("vs-1002", coord->between == EMPTY_NODE); + result = FIRST_ITEM; + uf_coord->valid = 1; + goto ok; + } + + assert("vs-1336", coord->item_pos < node_num_items(coord->node)); + assert("vs-1007", ergo(coord->between == AFTER_UNIT || coord->between == AT_UNIT, keyle(item_key_by_coord(coord, &check), key))); + assert("vs-1008", ergo(coord->between == AFTER_UNIT || coord->between == AT_UNIT, keylt(key, get_next_item_key(coord, &check)))); + + switch(coord->between) { + case AFTER_ITEM: + uf_coord->valid = 1; + result = FIRST_ITEM; + break; + case AFTER_UNIT: + assert("vs-1323", (item_is_tail(coord) || item_is_extent(coord)) && item_of_that_file(coord, key)); + assert("vs-1208", keyeq(item_plugin_by_coord(coord)->s.file.append_key(coord, &check), key)); + result = APPEND_ITEM; + validate_extended_coord(uf_coord, get_key_offset(key)); + break; + case AT_UNIT: + /* FIXME: it would be nice to check that coord matches to key */ + assert("vs-1324", (item_is_tail(coord) || item_is_extent(coord)) && item_of_that_file(coord, key)); + validate_extended_coord(uf_coord, get_key_offset(key)); + result = OVERWRITE_ITEM; + break; + default: + assert("vs-1337", 0); + result = OVERWRITE_ITEM; + break; + } + +ok: + assert("vs-1349", uf_coord->valid == 1); + assert("vs-1332", check_coord(coord, key)); + return result; +} + +/* obtain lock on right neighbor and drop lock on current node */ +reiser4_internal int +goto_right_neighbor(coord_t * coord, lock_handle * lh) +{ + int result; + lock_handle lh_right; + + assert("vs-1100", znode_is_locked(coord->node)); + + init_lh(&lh_right); + result = reiser4_get_right_neighbor( + &lh_right, coord->node, + znode_is_wlocked(coord->node) ? ZNODE_WRITE_LOCK : ZNODE_READ_LOCK, + GN_CAN_USE_UPPER_LEVELS); + if (result) { + done_lh(&lh_right); + return result; + } + + done_lh(lh); + + coord_init_first_unit_nocheck(coord, lh_right.node); + move_lh(lh, &lh_right); + + return 0; + +} + +/* this is to be used after find_file_item and in find_file_item_nohint to determine real state of file */ +static void +set_file_state(struct inode *inode, int cbk_result, tree_level level) +{ + assert("vs-1649", inode != NULL); + + if (cbk_errored(cbk_result)) + /* error happened in find_file_item */ + return; + + assert("vs-1164", level == LEAF_LEVEL || level == TWIG_LEVEL); + + if (inode_get_flag(inode, REISER4_PART_CONV)) { + set_file_state_unknown(inode); + return; + } + + if (file_state_is_unknown(inode)) { + if (cbk_result == CBK_COORD_NOTFOUND) + set_file_state_empty(inode); + else if (level == LEAF_LEVEL) + set_file_state_tails(inode); + else + set_file_state_extents(inode); + } else { + /* file state is known, check that it is set correctly */ + assert("vs-1161", ergo(cbk_result == CBK_COORD_NOTFOUND, + file_is_empty(inode))); + assert("vs-1162", ergo(level == LEAF_LEVEL && cbk_result == CBK_COORD_FOUND, + file_is_built_of_tails(inode))); + assert("vs-1165", ergo(level == TWIG_LEVEL && cbk_result == CBK_COORD_FOUND, + file_is_built_of_extents(inode))); + } +} + +static int +find_file_item(hint_t *hint, /* coord, lock handle and seal are here */ + const reiser4_key *key, /* key of position in a file of next read/write */ + znode_lock_mode lock_mode, /* which lock (read/write) to put on returned node */ + ra_info_t *ra_info, + struct inode *inode) +{ + int result; + coord_t *coord; + lock_handle *lh; + __u32 cbk_flags; + + assert("nikita-3030", schedulable()); + assert("vs-1707", hint != NULL); + + coord = &hint->ext_coord.coord; + lh = hint->ext_coord.lh; + init_lh(lh); + + result = hint_validate(hint, key, 1/*check key*/, lock_mode); + if (!result) { + if (coord->between == AFTER_UNIT && equal_to_rdk(coord->node, key)) { + result = goto_right_neighbor(coord, lh); + if (result == -E_NO_NEIGHBOR) + return RETERR(-EIO); + if (result) + return result; + assert("vs-1152", equal_to_ldk(coord->node, key)); + /* we moved to different node. Invalidate coord extension, zload is necessary to init it + again */ + hint->ext_coord.valid = 0; + } + + set_file_state(inode, CBK_COORD_FOUND, znode_get_level(coord->node)); + return CBK_COORD_FOUND; + } + + coord_init_zero(coord); + cbk_flags = (lock_mode == ZNODE_READ_LOCK) ? CBK_UNIQUE : (CBK_UNIQUE | CBK_FOR_INSERT); + if (inode != NULL) { + result = object_lookup(inode, + key, + coord, + lh, + lock_mode, + FIND_MAX_NOT_MORE_THAN, + TWIG_LEVEL, + LEAF_LEVEL, + cbk_flags, + ra_info); + } else { + result = coord_by_key(current_tree, + key, + coord, + lh, + lock_mode, + FIND_MAX_NOT_MORE_THAN, + TWIG_LEVEL, + LEAF_LEVEL, + cbk_flags, + ra_info); + } + + set_file_state(inode, result, znode_get_level(coord->node)); + + /* FIXME: we might already have coord extension initialized */ + hint->ext_coord.valid = 0; + return result; +} + +reiser4_internal int +find_file_item_nohint(coord_t *coord, lock_handle *lh, const reiser4_key *key, + znode_lock_mode lock_mode, struct inode *inode) +{ + int result; + + result = object_lookup(inode, key, coord, lh, lock_mode, + FIND_MAX_NOT_MORE_THAN, + TWIG_LEVEL, LEAF_LEVEL, + (lock_mode == ZNODE_READ_LOCK) ? CBK_UNIQUE : (CBK_UNIQUE | CBK_FOR_INSERT), + NULL /* ra_info */); + set_file_state(inode, result, znode_get_level(coord->node)); + return result; +} + +/* plugin->u.file.write_flowom = NULL + plugin->u.file.read_flow = NULL */ + +reiser4_internal void +hint_init_zero(hint_t *hint) +{ + memset(hint, 0, sizeof (*hint)); +} + +/* find position of last byte of last item of the file plus 1. This is used by truncate and mmap to find real file + size */ +static int +find_file_size(struct inode *inode, loff_t *file_size) +{ + int result; + reiser4_key key; + coord_t coord; + lock_handle lh; + item_plugin *iplug; + + assert("vs-1247", inode_file_plugin(inode)->key_by_inode == key_by_inode_unix_file); + key_by_inode_unix_file(inode, get_key_offset(max_key()), &key); + + init_lh(&lh); + result = find_file_item_nohint(&coord, &lh, &key, ZNODE_READ_LOCK, inode); + if (cbk_errored(result)) { + /* error happened */ + done_lh(&lh); + return result; + } + + if (result == CBK_COORD_NOTFOUND) { + /* empty file */ + done_lh(&lh); + *file_size = 0; + return 0; + } + + /* there are items of this file (at least one) */ + /*coord_clear_iplug(&coord);*/ + result = zload(coord.node); + if (unlikely(result)) { + done_lh(&lh); + return result; + } + iplug = item_plugin_by_coord(&coord); + + assert("vs-853", iplug->s.file.append_key); + iplug->s.file.append_key(&coord, &key); + + *file_size = get_key_offset(&key); + + zrelse(coord.node); + done_lh(&lh); + + return 0; +} + +static int +find_file_state(unix_file_info_t *uf_info) +{ + int result; + + assert("vs-1628", ea_obtained(uf_info)); + + result = 0; + if (uf_info->container == UF_CONTAINER_UNKNOWN) { + loff_t file_size; + + result = find_file_size(unix_file_info_to_inode(uf_info), &file_size); + } + assert("vs-1074", ergo(result == 0, uf_info->container != UF_CONTAINER_UNKNOWN)); + return result; +} + +/* estimate and reserve space needed to truncate page which gets partially truncated: one block for page itself, stat + data update (estimate_one_insert_into_item) and one item insertion (estimate_one_insert_into_item) which may happen + if page corresponds to hole extent and unallocated one will have to be created */ +static int reserve_partial_page(reiser4_tree *tree) +{ + grab_space_enable(); + return reiser4_grab_reserved(reiser4_get_current_sb(), + 1 + + 2 * estimate_one_insert_into_item(tree), + BA_CAN_COMMIT); +} + +/* estimate and reserve space needed to cut one item and update one stat data */ +static int +reserve_cut_iteration(reiser4_tree *tree) +{ + __u64 estimate = estimate_one_item_removal(tree) + + estimate_one_insert_into_item(tree); + + assert("nikita-3172", lock_stack_isclean(get_current_lock_stack())); + + grab_space_enable(); + /* We need to double our estimate now that we can delete more than one + node. */ + return reiser4_grab_reserved(reiser4_get_current_sb(), estimate*2, + BA_CAN_COMMIT); +} + +reiser4_internal int +update_file_size(struct inode *inode, reiser4_key * key, int update_sd) +{ + int result = 0; + INODE_SET_FIELD(inode, i_size, get_key_offset(key)); + if (update_sd) { + inode->i_ctime = inode->i_mtime = CURRENT_TIME; + result = reiser4_update_sd(inode); + } + return result; +} + +/* cut file items one by one starting from the last one until new file size (inode->i_size) is reached. Reserve space + and update file stat data on every single cut from the tree */ +reiser4_internal int +cut_file_items(struct inode *inode, loff_t new_size, int update_sd, loff_t cur_size, + int (*update_actor)(struct inode *, reiser4_key *, int)) +{ + reiser4_key from_key, to_key; + reiser4_key smallest_removed; + file_plugin * fplug = inode_file_plugin(inode); + int result; + int progress = 0; + + assert("vs-1248", + fplug == file_plugin_by_id(UNIX_FILE_PLUGIN_ID) || + fplug == file_plugin_by_id(CRC_FILE_PLUGIN_ID)); + + fplug->key_by_inode(inode, new_size, &from_key); + to_key = from_key; + set_key_offset(&to_key, cur_size - 1/*get_key_offset(max_key())*/); + /* this loop normally runs just once */ + while (1) { + result = reserve_cut_iteration(tree_by_inode(inode)); + if (result) + break; + + result = cut_tree_object(current_tree, &from_key, &to_key, + &smallest_removed, inode, 1, &progress); + if (result == -E_REPEAT) { + /* -E_REPEAT is a signal to interrupt a long file truncation process */ + if (progress) { + result = update_actor(inode, &smallest_removed, update_sd); + if (result) + break; + } + all_grabbed2free(); + reiser4_release_reserved(inode->i_sb); + + /* cut_tree_object() was interrupted probably because + * current atom requires commit, we have to release + * transaction handle to allow atom commit. */ + txn_restart_current(); + continue; + } + if (result && !(result == CBK_COORD_NOTFOUND && new_size == 0 && inode->i_size == 0)) + break; + + set_key_offset(&smallest_removed, new_size); + /* Final sd update after the file gets its correct size */ + result = update_actor(inode, &smallest_removed, update_sd); + break; + } + all_grabbed2free(); + reiser4_release_reserved(inode->i_sb); + + return result; +} + +int find_or_create_extent(struct page *page); + +/* part of unix_file_truncate: it is called when truncate is used to make file shorter */ +static int +shorten_file(struct inode *inode, loff_t new_size) +{ + int result; + struct page *page; + int padd_from; + unsigned long index; + char *kaddr; + + /* all items of ordinary reiser4 file are grouped together. That is why we can use cut_tree. Plan B files (for + instance) can not be truncated that simply */ + result = cut_file_items(inode, new_size, 1/*update_sd*/, get_key_offset(max_key()), update_file_size); + if (result) + return result; + + assert("vs-1105", new_size == inode->i_size); + if (new_size == 0) { + set_file_state_empty(inode); + return 0; + } + + /* FIXME: not sure how crypto files will work here. Probably they will not. */ + result = find_file_state(unix_file_inode_data(inode)); + if (result) + return result; + if (file_is_built_of_tails(inode)) + /* No need to worry about zeroing last page after new file end */ + return 0; + + padd_from = inode->i_size & (PAGE_CACHE_SIZE - 1); + if (!padd_from) + /* file is truncated to page boundary */ + return 0; + + result = reserve_partial_page(tree_by_inode(inode)); + if (result) { + reiser4_release_reserved(inode->i_sb); + return result; + } + + /* last page is partially truncated - zero its content */ + index = (inode->i_size >> PAGE_CACHE_SHIFT); + page = read_cache_page(inode->i_mapping, index, readpage_unix_file/*filler*/, 0); + if (IS_ERR(page)) { + all_grabbed2free(); + reiser4_release_reserved(inode->i_sb); + if (likely(PTR_ERR(page) == -EINVAL)) { + /* looks like file is built of tail items */ + return 0; + } + return PTR_ERR(page); + } + wait_on_page_locked(page); + if (!PageUptodate(page)) { + all_grabbed2free(); + page_cache_release(page); + reiser4_release_reserved(inode->i_sb); + return RETERR(-EIO); + } + + /* if page correspons to hole extent unit - unallocated one will be created here. This is not necessary */ + result = find_or_create_extent(page); + + /* FIXME: cut_file_items has already updated inode. Probably it would be better to update it here when file is + really truncated */ + all_grabbed2free(); + if (result) { + page_cache_release(page); + reiser4_release_reserved(inode->i_sb); + return result; + } + + lock_page(page); + assert("vs-1066", PageLocked(page)); + kaddr = kmap_atomic(page, KM_USER0); + memset(kaddr + padd_from, 0, PAGE_CACHE_SIZE - padd_from); + flush_dcache_page(page); + kunmap_atomic(kaddr, KM_USER0); + unlock_page(page); + page_cache_release(page); + reiser4_release_reserved(inode->i_sb); + return 0; +} + +static loff_t +write_flow(hint_t *, struct file *, struct inode *, const char *buf, loff_t count, loff_t pos, int exclusive); + +/* it is called when truncate is used to make file longer and when write position is set past real end of file. It + appends file which has size @cur_size with hole of certain size (@hole_size). It returns 0 on success, error code + otherwise */ +static int +append_hole(hint_t *hint, struct inode *inode, loff_t new_size, int exclusive) +{ + int result; + loff_t written; + loff_t hole_size; + + assert("vs-1107", inode->i_size < new_size); + + result = 0; + hole_size = new_size - inode->i_size; + written = write_flow(hint, NULL, inode, NULL/*buf*/, hole_size, + inode->i_size, exclusive); + if (written != hole_size) { + /* return error because file is not expanded as required */ + if (written > 0) + result = RETERR(-ENOSPC); + else + result = written; + } else { + assert("vs-1081", inode->i_size == new_size); + } + return result; +} + +/* this either cuts or add items of/to the file so that items match new_size. It is used in unix_file_setattr when it is + used to truncate +VS-FIXME-HANS: explain that +and in unix_file_delete */ +static int +truncate_file_body(struct inode *inode, loff_t new_size) +{ + int result; + hint_t hint; + + hint_init_zero(&hint); + if (inode->i_size < new_size) + result = append_hole(&hint, inode, new_size, 1/* exclusive access is obtained */); + else + result = shorten_file(inode, new_size); + + return result; +} + +/* plugin->u.file.truncate + all the work is done on reiser4_setattr->unix_file_setattr->truncate_file_body +*/ +reiser4_internal int +truncate_unix_file(struct inode *inode, loff_t new_size) +{ + return 0; +} + +/* plugin->u.write_sd_by_inode = write_sd_by_inode_common */ + +/* get access hint (seal, coord, key, level) stored in reiser4 private part of + struct file if it was stored in a previous access to the file */ +reiser4_internal int +load_file_hint(struct file *file, hint_t *hint) +{ + reiser4_file_fsdata *fsdata; + + if (file) { + fsdata = reiser4_get_file_fsdata(file); + if (IS_ERR(fsdata)) + return PTR_ERR(fsdata); + + if (seal_is_set(&fsdata->reg.hint.seal)) { + *hint = fsdata->reg.hint; + /* force re-validation of the coord on the first + * iteration of the read/write loop. */ + hint->ext_coord.valid = 0; + assert("nikita-19892", coords_equal(&hint->seal.coord1, + &hint->ext_coord.coord)); + return 0; + } + memset(&fsdata->reg.hint, 0, sizeof(hint_t)); + } + hint_init_zero(hint); + return 0; +} + +/* this copies hint for future tree accesses back to reiser4 private part of + struct file */ +reiser4_internal void +save_file_hint(struct file *file, const hint_t *hint) +{ + reiser4_file_fsdata *fsdata; + + if (!file || !seal_is_set(&hint->seal)) + return; + + fsdata = reiser4_get_file_fsdata(file); + assert("vs-965", !IS_ERR(fsdata)); + assert("nikita-19891", + coords_equal(&hint->seal.coord1, &hint->ext_coord.coord)); + fsdata->reg.hint = *hint; + return; +} + +reiser4_internal void +unset_hint(hint_t *hint) +{ + assert("vs-1315", hint); + hint->ext_coord.valid = 0; + seal_done(&hint->seal); +} + +/* coord must be set properly. So, that set_hint has nothing to do */ +reiser4_internal void +set_hint(hint_t *hint, const reiser4_key *key, znode_lock_mode mode) +{ + ON_DEBUG(coord_t *coord = &hint->ext_coord.coord); + assert("vs-1207", WITH_DATA(coord->node, check_coord(coord, key))); + + seal_init(&hint->seal, &hint->ext_coord.coord, key); + hint->offset = get_key_offset(key); + hint->mode = mode; +} + +reiser4_internal int +hint_is_set(const hint_t *hint) +{ + return seal_is_set(&hint->seal); +} + +#if REISER4_DEBUG +static int all_but_offset_key_eq(const reiser4_key *k1, const reiser4_key *k2) +{ + return (get_key_locality(k1) == get_key_locality(k2) && + get_key_type(k1) == get_key_type(k2) && + get_key_band(k1) == get_key_band(k2) && + get_key_ordering(k1) == get_key_ordering(k2) && + get_key_objectid(k1) == get_key_objectid(k2)); +} +#endif + +reiser4_internal int +hint_validate(hint_t *hint, const reiser4_key *key, int check_key, znode_lock_mode lock_mode) +{ + if (!hint || !hint_is_set(hint) || hint->mode != lock_mode) + /* hint either not set or set by different operation */ + return RETERR(-E_REPEAT); + + assert("vs-1277", all_but_offset_key_eq(key, &hint->seal.key)); + + if (check_key && get_key_offset(key) != hint->offset) + /* hint is set for different key */ + return RETERR(-E_REPEAT); + + return seal_validate(&hint->seal, &hint->ext_coord.coord, key, + hint->ext_coord.lh, + lock_mode, ZNODE_LOCK_LOPRI); +} + +/* look for place at twig level for extent corresponding to page, call extent's writepage method to create + unallocated extent if it does not exist yet, initialize jnode, capture page */ +reiser4_internal int +find_or_create_extent(struct page *page) +{ + int result; + uf_coord_t uf_coord; + coord_t *coord; + lock_handle lh; + reiser4_key key; + item_plugin *iplug; + znode *loaded; + struct inode *inode; + + assert("vs-1065", page->mapping && page->mapping->host); + inode = page->mapping->host; + + /* get key of first byte of the page */ + key_by_inode_unix_file(inode, (loff_t) page->index << PAGE_CACHE_SHIFT, &key); + + init_uf_coord(&uf_coord, &lh); + coord = &uf_coord.coord; + + result = find_file_item_nohint(coord, &lh, &key, ZNODE_WRITE_LOCK, inode); + if (IS_CBKERR(result)) { + done_lh(&lh); + return result; + } + + /*coord_clear_iplug(coord);*/ + result = zload(coord->node); + if (result) { + done_lh(&lh); + return result; + } + loaded = coord->node; + + /* get plugin of extent item */ + iplug = item_plugin_by_id(EXTENT_POINTER_ID); + result = iplug->s.file.capture(&key, &uf_coord, page, how_to_write(&uf_coord, &key)); + assert("vs-429378", result != -E_REPEAT); + zrelse(loaded); + done_lh(&lh); + return result; +} + +#if REISER4_USE_EFLUSH +static int inode_has_eflushed_jnodes(struct inode * inode) +{ + reiser4_tree * tree = &get_super_private(inode->i_sb)->tree; + int ret; + + RLOCK_TREE(tree); + ret = radix_tree_tagged(jnode_tree_by_inode(inode), EFLUSH_TAG_ANONYMOUS); + RUNLOCK_TREE(tree); + return ret; +} +# else +#define inode_has_eflushed_jnodes(inode) (0) +#endif + +/* Check mapping for existence of not captured dirty pages. This returns !0 if either page tree contains pages tagged + PAGECACHE_TAG_REISER4_MOVED or if eflushed jnode tree is not empty */ +static int +inode_has_anonymous_pages(struct inode *inode) +{ + return (mapping_tagged(inode->i_mapping, PAGECACHE_TAG_REISER4_MOVED) || + inode_has_eflushed_jnodes(inode)); +} + +static int +capture_page_and_create_extent(struct page *page) +{ + int result; + struct inode *inode; + + assert("vs-1084", page->mapping && page->mapping->host); + inode = page->mapping->host; + assert("vs-1139", file_is_built_of_extents(inode)); + /* page belongs to file */ + assert("vs-1393", inode->i_size > ((loff_t) page->index << PAGE_CACHE_SHIFT)); + + /* page capture may require extent creation (if it does not exist yet) and stat data's update (number of blocks + changes on extent creation) */ + grab_space_enable (); + result = reiser4_grab_space(2 * estimate_one_insert_into_item(tree_by_inode(inode)), BA_CAN_COMMIT); + if (likely(!result)) + result = find_or_create_extent(page); + + all_grabbed2free(); + if (result != 0) + SetPageError(page); + return result; +} + +/* plugin->u.file.capturepage handler */ +reiser4_internal int +capturepage_unix_file(struct page * page) { + int result; + + page_cache_get(page); + unlock_page(page); + result = capture_page_and_create_extent(page); + lock_page(page); + page_cache_release(page); + return result; +} + +static void +redirty_inode(struct inode *inode) +{ + spin_lock(&inode_lock); + inode->i_state |= I_DIRTY; + spin_unlock(&inode_lock); +} + +/* + * Support for "anonymous" pages and jnodes. + * + * When file is write-accessed through mmap pages can be dirtied from the user + * level. In this case kernel is not notified until one of following happens: + * + * (1) msync() + * + * (2) truncate() (either explicit or through unlink) + * + * (3) VM scanner starts reclaiming mapped pages, dirtying them before + * starting write-back. + * + * As a result of (3) ->writepage may be called on a dirty page without + * jnode. Such page is called "anonymous" in reiser4. Certain work-loads + * (iozone) generate huge number of anonymous pages. Emergency flush handles + * this situation by creating jnode for anonymous page, starting IO on the + * page, and marking jnode with JNODE_KEEPME bit so that it's not throw out of + * memory. Such jnode is also called anonymous. + * + * reiser4_sync_sb() method tries to insert anonymous pages and jnodes into + * tree. This is done by capture_anonymous_*() functions below. + * + */ + +/* this returns 1 if it captured page */ +static int +capture_anonymous_page(struct page *pg, int keepme) +{ + struct address_space *mapping; + jnode *node; + int result; + + if (PageWriteback(pg)) + /* FIXME: do nothing? */ + return 0; + + mapping = pg->mapping; + + lock_page(pg); + /* page is guaranteed to be in the mapping, because we are operating under rw-semaphore. */ + assert("nikita-3336", pg->mapping == mapping); + node = jnode_of_page(pg); + unlock_page(pg); + if (!IS_ERR(node)) { + result = jload(node); + assert("nikita-3334", result == 0); + assert("nikita-3335", jnode_page(node) == pg); + result = capture_page_and_create_extent(pg); + if (result == 0) { + /* + * node will be captured into atom by + * capture_page_and_create_extent(). Atom + * cannot commit (because we have open + * transaction handle), and node cannot be + * truncated, because we have non-exclusive + * access to the file. + */ + assert("nikita-3327", node->atom != NULL); + JF_CLR(node, JNODE_KEEPME); + result = 1; + } else + warning("nikita-3329", + "Cannot capture anon page: %i", result); + jrelse(node); + jput(node); + } else + result = PTR_ERR(node); + + return result; +} + + +#define CAPTURE_APAGE_BURST (1024l) + +/* look for pages tagged REISER4_MOVED starting from the index-th page, return + number of captured pages, update index to next page after the last found + one */ +static int +capture_anonymous_pages(struct address_space *mapping, pgoff_t *index, + int to_capture) +{ + int result; + struct pagevec pvec; + unsigned found_pages; + int count; + int i; + int nr; + + pagevec_init(&pvec, 0); + count = min(pagevec_space(&pvec), (unsigned)to_capture); + nr = 0; + + found_pages = pagevec_lookup_tag(&pvec, mapping, index, PAGECACHE_TAG_REISER4_MOVED, count); + if (found_pages != 0) { + for (i = 0; i < pagevec_count(&pvec); i ++) { + /* tag PAGECACHE_TAG_REISER4_MOVED will be cleared by + set_page_dirty_internal which is called when jnode + is captured */ + result = capture_anonymous_page(pvec.pages[i], 0); + if (result == 1) + nr ++; + else if (result < 0) { + warning("vs-1454", "failed for moved page: result=%d, captured=%d)\n", + result, i); + pagevec_release(&pvec); + return result; + } else { + /* result == 0. capture_anonymous_page returns 0 for Writeback-ed page */ + ; + } + } + pagevec_release(&pvec); + } else + /* there are no starting from *index */ + *index = (pgoff_t)-1; + + return nr; +} + +static int +capture_anonymous_jnodes(struct address_space *mapping, + pgoff_t *from, pgoff_t to, + int to_capture) +{ +#if REISER4_USE_EFLUSH + int found_jnodes; + int count; + int nr; + int i; + int result; + jnode *jvec[PAGEVEC_SIZE]; + reiser4_tree *tree; + + count = min(PAGEVEC_SIZE, to_capture); + nr = 0; + result = 0; + + tree = &get_super_private(mapping->host->i_sb)->tree; + RLOCK_TREE(tree); + found_jnodes = radix_tree_gang_lookup_tag(jnode_tree_by_inode(mapping->host), + (void **)&jvec, *from, count, + EFLUSH_TAG_ANONYMOUS); + if (found_jnodes == 0) { + /* there are no anonymous jnodes from index from down to the + end of file */ + RUNLOCK_TREE(tree); + *from = to; + return 0; + } + + for (i = 0; i < found_jnodes; i ++) { + if (index_jnode(jvec[i]) < to) + jref(jvec[i]); + else { + found_jnodes = i; + break; + } + } + + RUNLOCK_TREE(tree); + if (found_jnodes == 0) { + /* there are no anonymous jnodes in the gived range of + indexes */ + *from = to; + return 0; + } + + /* there are anonymous jnodes from given range */ + + /* start i/o for eflushed nodes */ + for (i = 0; i < found_jnodes; i ++) + jstartio(jvec[i]); + + for (i = 0; i < found_jnodes; i ++) { + result = jload(jvec[i]); + if (result == 0) { + result = capture_anonymous_page(jnode_page(jvec[i]), 0); + if (result == 1) + nr ++; + else if (result < 0) { + jrelse(jvec[i]); + warning("nikita-3328", + "failed for anonymous jnode: result=%i, captured %d\n", + result, i); + break; + } else { + /* result == 0. capture_anonymous_page returns 0 for Writeback-ed page */ + ; + } + jrelse(jvec[i]); + } else { + warning("vs-1454", "jload for anonymous jnode failed: result=%i, captured %d\n", + result, i); + break; + } + } + *from = index_jnode(jvec[found_jnodes - 1]) + 1; + + for (i = 0; i < found_jnodes; i ++) + jput(jvec[i]); + if (result) + return result; + return nr; +#else /* REISER4_USE_EFLUSH */ + return 0; +#endif +} + +/* + * Commit atom of the jnode of a page. + */ +static int +sync_page(struct page *page) +{ + int result; + do { + jnode *node; + txn_atom *atom; + + lock_page(page); + node = jprivate(page); + if (node != NULL) + atom = UNDER_SPIN(jnode, node, jnode_get_atom(node)); + else + atom = NULL; + unlock_page(page); + result = sync_atom(atom); + } while (result == -E_REPEAT); +/* ZAM-FIXME-HANS: document the logic of this loop, is it just to handle the case where more pages get added to the atom while we are syncing it? */ + assert("nikita-3485", ergo(result == 0, + get_current_context()->trans->atom == NULL)); + return result; +} + +/* + * Commit atoms of pages on @pages list. + * call sync_page for each page from mapping's page tree + */ +static int +sync_page_list(struct inode *inode) +{ + int result; + struct address_space *mapping; + unsigned long from; /* start index for radix_tree_gang_lookup */ + unsigned int found; /* return value for radix_tree_gang_lookup */ + + mapping = inode->i_mapping; + from = 0; + result = 0; + read_lock_irq(&mapping->tree_lock); + while (result == 0) { + struct page *page; + + found = radix_tree_gang_lookup(&mapping->page_tree, (void **)&page, from, 1); + assert("", found < 2); + if (found == 0) + break; + + /* page may not leave radix tree because it is protected from truncating by inode->i_sem downed by + sys_fsync */ + page_cache_get(page); + read_unlock_irq(&mapping->tree_lock); + + from = page->index + 1; + + result = sync_page(page); + + page_cache_release(page); + read_lock_irq(&mapping->tree_lock); + } + + read_unlock_irq(&mapping->tree_lock); + return result; +} + +static int +commit_file_atoms(struct inode *inode) +{ + int result; + unix_file_info_t *uf_info; + reiser4_context *ctx; + + /* + * close current transaction + */ + + ctx = get_current_context(); + txn_restart(ctx); + + uf_info = unix_file_inode_data(inode); + + /* + * finish extent<->tail conversion if necessary + */ + + get_exclusive_access(uf_info); + if (inode_get_flag(inode, REISER4_PART_CONV)) { + result = finish_conversion(inode); + if (result != 0) { + drop_exclusive_access(uf_info); + return result; + } + } + + /* + * find what items file is made from + */ + + result = find_file_state(uf_info); + drop_exclusive_access(uf_info); + if (result != 0) + return result; + + /* + * file state cannot change because we are under ->i_sem + */ + + switch(uf_info->container) { + case UF_CONTAINER_EXTENTS: + result = + /* + * when we are called by + * filemap_fdatawrite-> + * do_writepages()-> + * reiser4_writepages() + * + * inode->i_mapping->dirty_pages are spices into + * ->io_pages, leaving ->dirty_pages dirty. + * + * When we are called from + * reiser4_fsync()->sync_unix_file(), we have to + * commit atoms of all pages on the ->dirty_list. + * + * So for simplicity we just commit ->io_pages and + * ->dirty_pages. + */ + sync_page_list(inode); + break; + case UF_CONTAINER_TAILS: + /* + * NOTE-NIKITA probably we can be smarter for tails. For now + * just commit all existing atoms. + */ + result = txnmgr_force_commit_all(inode->i_sb, 0); + break; + case UF_CONTAINER_EMPTY: + result = 0; + break; + case UF_CONTAINER_UNKNOWN: + default: + result = -EIO; + break; + } + + /* + * commit current transaction: there can be captured nodes from + * find_file_state() and finish_conversion(). + */ + txn_restart(ctx); + return result; +} + +reiser4_internal int +capture_unix_file(struct inode *inode, struct writeback_control *wbc) +{ + int result; + unix_file_info_t *uf_info; + pgoff_t pindex, jindex, nr_pages; + long to_capture; + + if (!inode_has_anonymous_pages(inode)) + return 0; + + uf_info = unix_file_inode_data(inode); + + result = 0; + pindex = 0; + jindex = 0; + nr_pages = (i_size_read(inode) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + do { + reiser4_context ctx; + + if (wbc->sync_mode != WB_SYNC_ALL) + to_capture = min(wbc->nr_to_write, CAPTURE_APAGE_BURST); + else + to_capture = CAPTURE_APAGE_BURST; + + init_context(&ctx, inode->i_sb); + /* avoid recursive calls to ->sync_inodes */ + ctx.nobalance = 1; + assert("zam-760", lock_stack_isclean(get_current_lock_stack())); + /* + * locking: creation of extent requires read-semaphore on + * file. _But_, this function can also be called in the + * context of write system call from + * balance_dirty_pages(). So, write keeps semaphore (possible + * in write mode) on file A, and this function tries to + * acquire semaphore on (possibly) different file B. A/B + * deadlock is on a way. To avoid this try-lock is used + * here. When invoked from sys_fsync() and sys_fdatasync(), + * this function is out of reiser4 context and may safely + * sleep on semaphore. + */ + if (is_in_reiser4_context()) { + if (down_read_trylock(&uf_info->latch) == 0) { +/* ZAM-FIXME-HANS: please explain this error handling here, grep for + * all instances of returning EBUSY, and tell me whether any of them + * represent busy loops that we should recode. Also tell me whether + * any of them fail to return EBUSY to user space, and if yes, then + * recode them to not use the EBUSY macro.*/ + result = RETERR(-EBUSY); + reiser4_exit_context(&ctx); + break; + } + } else + down_read(&uf_info->latch); + LOCK_CNT_INC(inode_sem_r); + + while (to_capture > 0) { + pgoff_t start; + + assert("vs-1727", jindex <= pindex); + if (pindex == jindex) { + start = pindex; + result = capture_anonymous_pages(inode->i_mapping, &pindex, to_capture); + if (result < 0) + break; + to_capture -= result; + wbc->nr_to_write -= result; + if (start + result == pindex) { + jindex = pindex; + continue; + } + if (to_capture <= 0) + break; + } + /* deal with anonymous jnodes between jindex and pindex */ + result = capture_anonymous_jnodes(inode->i_mapping, &jindex, pindex, to_capture); + if (result < 0) + break; + to_capture -= result; + wbc->nr_to_write -= result; + + if (jindex == (pgoff_t)-1) { + assert("vs-1728", pindex == (pgoff_t)-1); + break; + } + } + if (to_capture <= 0) + /* there may be left more pages */ + redirty_inode(inode); + + up_read(&uf_info->latch); + LOCK_CNT_DEC(inode_sem_r); + if (result < 0) { + /* error happened */ + reiser4_exit_context(&ctx); + return result; + } + if (wbc->sync_mode != WB_SYNC_ALL) { + reiser4_exit_context(&ctx); + return 0; + } + result = commit_file_atoms(inode); + reiser4_exit_context(&ctx); + if (pindex >= nr_pages && jindex == pindex) + break; + } while (1); + + return result; +} + +/* + * ->sync() method for unix file. + * + * We are trying to be smart here. Instead of committing all atoms (original + * solution), we scan dirty pages of this file and commit all atoms they are + * part of. + * + * Situation is complicated by anonymous pages: i.e., extent-less pages + * dirtied through mmap. Fortunately sys_fsync() first calls + * filemap_fdatawrite() that will ultimately call reiser4_writepages(), insert + * all missing extents and capture anonymous pages. + */ +reiser4_internal int +sync_unix_file(struct inode *inode, int datasync) +{ + int result; + reiser4_context *ctx; + + ctx = get_current_context(); + assert("nikita-3486", ctx->trans->atom == NULL); + result = commit_file_atoms(inode); + assert("nikita-3484", ergo(result == 0, ctx->trans->atom == NULL)); + if (result == 0 && !datasync) { + do { + /* commit "meta-data"---stat data in our case */ + lock_handle lh; + coord_t coord; + reiser4_key key; + + coord_init_zero(&coord); + init_lh(&lh); + /* locate stat-data in a tree and return with znode + * locked */ + result = locate_inode_sd(inode, &key, &coord, &lh); + if (result == 0) { + jnode *node; + txn_atom *atom; + + node = jref(ZJNODE(coord.node)); + done_lh(&lh); + txn_restart(ctx); + LOCK_JNODE(node); + atom = jnode_get_atom(node); + UNLOCK_JNODE(node); + result = sync_atom(atom); + jput(node); + } else + done_lh(&lh); + } while (result == -E_REPEAT); + } + return result; +} + +/* plugin->u.file.readpage + page must be not out of file. This is called either via page fault and in + that case vp is struct file *file, or on truncate when last page of a file + is to be read to perform its partial truncate and in that case vp is 0 +*/ +reiser4_internal int +readpage_unix_file(void *vp, struct page *page) +{ + int result; + struct inode *inode; + lock_handle lh; + reiser4_key key; + item_plugin *iplug; + hint_t hint; + coord_t *coord; + struct file *file; + + assert("vs-1062", PageLocked(page)); + assert("vs-1061", page->mapping && page->mapping->host); + assert("vs-1078", (page->mapping->host->i_size > ((loff_t) page->index << PAGE_CACHE_SHIFT))); + + inode = page->mapping->host; + file = vp; + result = load_file_hint(file, &hint); + if (result) + return result; + init_lh(&lh); + hint.ext_coord.lh = &lh; + + /* get key of first byte of the page */ + key_by_inode_unix_file(inode, (loff_t) page->index << PAGE_CACHE_SHIFT, &key); + + /* look for file metadata corresponding to first byte of page */ + unlock_page(page); + result = find_file_item(&hint, &key, ZNODE_READ_LOCK, 0/* ra_info */, inode); + lock_page(page); + if (result != CBK_COORD_FOUND) { + /* this indicates file corruption */ + done_lh(&lh); + return result; + } + + if (PageUptodate(page)) { + done_lh(&lh); + unlock_page(page); + return 0; + } + + coord = &hint.ext_coord.coord; + result = zload(coord->node); + if (result) { + done_lh(&lh); + return result; + } + + if (hint.ext_coord.valid == 0) + validate_extended_coord(&hint.ext_coord, (loff_t) page->index << PAGE_CACHE_SHIFT); + + if (!coord_is_existing_unit(coord)) { + /* this indicates corruption */ + warning("vs-280", + "Looking for page %lu of file %llu (size %lli). " + "No file items found (%d). File is corrupted?\n", + page->index, (unsigned long long)get_inode_oid(inode), + inode->i_size, result); + + zrelse(coord->node); + done_lh(&lh); + return RETERR(-EIO); + } + + /* get plugin of found item or use plugin if extent if there are no + one */ + iplug = item_plugin_by_coord(coord); + if (iplug->s.file.readpage) + result = iplug->s.file.readpage(coord, page); + else + result = RETERR(-EINVAL); + + if (!result) { + set_key_offset(&key, (loff_t) (page->index + 1) << PAGE_CACHE_SHIFT); + /* FIXME should call set_hint() */ + unset_hint(&hint); + } else + unset_hint(&hint); + zrelse(coord->node); + done_lh(&lh); + + save_file_hint(file, &hint); + + assert("vs-979", ergo(result == 0, (PageLocked(page) || PageUptodate(page)))); + + return result; +} + +/* returns 1 if file of that size (@new_size) has to be stored in unformatted + nodes */ +/* Audited by: green(2002.06.15) */ +static int +should_have_notail(const unix_file_info_t *uf_info, loff_t new_size) +{ + if (!uf_info->tplug) + return 1; + return !uf_info->tplug->have_tail(unix_file_info_to_inode(uf_info), + new_size); + +} + +static reiser4_block_nr unix_file_estimate_read(struct inode *inode, + loff_t count UNUSED_ARG) +{ + /* We should reserve one block, because of updating of the stat data + item */ + assert("vs-1249", inode_file_plugin(inode)->estimate.update == estimate_update_common); + return estimate_update_common(inode); +} + +#define NR_PAGES_TO_PIN 8 + +static int +get_nr_pages_nr_bytes(unsigned long addr, size_t count, int *nr_pages) +{ + int nr_bytes; + + /* number of pages through which count bytes starting of address addr + are spread */ + *nr_pages = ((addr + count + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT) - + (addr >> PAGE_CACHE_SHIFT); + if (*nr_pages > NR_PAGES_TO_PIN) { + *nr_pages = NR_PAGES_TO_PIN; + nr_bytes = (*nr_pages * PAGE_CACHE_SIZE) - (addr & (PAGE_CACHE_SIZE - 1)); + } else + nr_bytes = count; + + return nr_bytes; +} + +static size_t +adjust_nr_bytes(unsigned long addr, size_t count, int nr_pages) +{ + if (count > nr_pages * PAGE_CACHE_SIZE) + return (nr_pages * PAGE_CACHE_SIZE) - (addr & (PAGE_CACHE_SIZE - 1)); + return count; +} + +static int +reiser4_get_user_pages(struct page **pages, unsigned long addr, int nr_pages, + int rw) +{ + down_read(¤t->mm->mmap_sem); + nr_pages = get_user_pages(current, current->mm, addr, + nr_pages, (rw == READ), 0, + pages, NULL); + up_read(¤t->mm->mmap_sem); + return nr_pages; +} + +static void +reiser4_put_user_pages(struct page **pages, int nr_pages) +{ + int i; + + for (i = 0; i < nr_pages; i ++) + page_cache_release(pages[i]); +} + +/* this is called with nonexclusive access obtained, file's container can not change */ +static size_t +read_file(hint_t *hint, file_container_t container, + struct file *file, /* file to write to */ + char *buf, /* address of user-space buffer */ + size_t count, /* number of bytes to write */ + loff_t *off) +{ + int result; + struct inode *inode; + flow_t flow; + int (*read_f) (struct file *, flow_t *, hint_t *); + coord_t *coord; + znode *loaded; + + inode = file->f_dentry->d_inode; + + /* we have nonexclusive access (NA) obtained. File's container may not + change until we drop NA. If possible - calculate read function + beforehand */ + switch(container) { + case UF_CONTAINER_EXTENTS: + read_f = item_plugin_by_id(EXTENT_POINTER_ID)->s.file.read; + break; + + case UF_CONTAINER_TAILS: + /* this is read-ahead for tails-only files */ + result = reiser4_file_readahead(file, *off, count); + if (result) + return result; + + read_f = item_plugin_by_id(FORMATTING_ID)->s.file.read; + break; + + case UF_CONTAINER_UNKNOWN: + read_f = 0; + break; + + case UF_CONTAINER_EMPTY: + default: + warning("vs-1297", "File (ino %llu) has unexpected state: %d\n", + (unsigned long long)get_inode_oid(inode), container); + return RETERR(-EIO); + } + + /* build flow */ + assert("vs-1250", inode_file_plugin(inode)->flow_by_inode == flow_by_inode_unix_file); + result = flow_by_inode_unix_file(inode, buf, 1 /* user space */ , count, *off, READ_OP, &flow); + if (unlikely(result)) + return result; + + /* get seal and coord sealed with it from reiser4 private data + of struct file. The coord will tell us where our last read + of this file finished, and the seal will help to determine + if that location is still valid. + */ + coord = &hint->ext_coord.coord; + while (flow.length && result == 0) { + result = find_file_item(hint, &flow.key, ZNODE_READ_LOCK, NULL, inode); + if (cbk_errored(result)) + /* error happened */ + break; + + if (coord->between != AT_UNIT) + /* there were no items corresponding to given offset */ + break; + + loaded = coord->node; + result = zload(loaded); + if (unlikely(result)) + break; + + if (hint->ext_coord.valid == 0) + validate_extended_coord(&hint->ext_coord, get_key_offset(&flow.key)); + + /* call item's read method */ + if (!read_f) + read_f = item_plugin_by_coord(coord)->s.file.read; + result = read_f(file, &flow, hint); + zrelse(loaded); + done_lh(hint->ext_coord.lh); + } + + return (count - flow.length) ? (count - flow.length) : result; +} + +static int is_user_space(const char *buf) +{ + return (unsigned long)buf < PAGE_OFFSET; +} + +/* plugin->u.file.read + + the read method for the unix_file plugin + +*/ +reiser4_internal ssize_t +read_unix_file(struct file *file, char *buf, size_t read_amount, loff_t *off) +{ + int result; + struct inode *inode; + lock_handle lh; + hint_t hint; + unix_file_info_t *uf_info; + struct page *pages[NR_PAGES_TO_PIN]; + int nr_pages; + size_t count, read, left; + reiser4_block_nr needed; + loff_t size; + int user_space; + + if (unlikely(read_amount == 0)) + return 0; + + inode = file->f_dentry->d_inode; + assert("vs-972", !inode_get_flag(inode, REISER4_NO_SD)); + + uf_info = unix_file_inode_data(inode); + + needed = unix_file_estimate_read(inode, read_amount); + result = reiser4_grab_space(needed, BA_CAN_COMMIT); + if (result != 0) + return result; + + result = load_file_hint(file, &hint); + if (result) + return result; + init_lh(&lh); + hint.ext_coord.lh = &lh; + + left = read_amount; + count = 0; + user_space = is_user_space(buf); + nr_pages = 0; + while (left > 0) { + unsigned long addr; + size_t to_read; + + addr = (unsigned long)buf; + txn_restart_current(); + + size = i_size_read(inode); + if (*off >= size) + /* position to read from is past the end of file */ + break; + if (*off + left > size) + left = size - *off; + + if (user_space) { + to_read = get_nr_pages_nr_bytes(addr, left, &nr_pages); + nr_pages = reiser4_get_user_pages(pages, addr, nr_pages, READ); + if (nr_pages < 0) { + result = nr_pages; + break; + } + to_read = adjust_nr_bytes(addr, to_read, nr_pages); + /* get_user_pages might create a transaction */ + txn_restart_current(); + } else + to_read = left; + + get_nonexclusive_access(uf_info, 0); + + /* define more precisely read size now when filesize can not change */ + if (*off >= inode->i_size) { + if (user_space) + reiser4_put_user_pages(pages, nr_pages); + + /* position to read from is past the end of file */ + drop_nonexclusive_access(uf_info); + break; + } + if (*off + left > inode->i_size) + left = inode->i_size - *off; + if (*off + to_read > inode->i_size) + to_read = inode->i_size - *off; + + assert("vs-1706", to_read <= left); + read = read_file(&hint, uf_info->container, file, buf, to_read, off); + + if (user_space) + reiser4_put_user_pages(pages, nr_pages); + + drop_nonexclusive_access(uf_info); + + if (read < 0) { + result = read; + break; + } + left -= read; + buf += read; + + /* update position in a file */ + *off += read; + /* total number of read bytes */ + count += read; + } + save_file_hint(file, &hint); + + if (count) { + /* something was read. Update inode's atime and stat data */ + update_atime(inode); + } + + /* return number of read bytes or error code if nothing is read */ + return count ? count : result; +} + +typedef int (*write_f_t)(struct inode *, flow_t *, hint_t *, int grabbed, write_mode_t); + +/* This searches for write position in the tree and calls write method of + appropriate item to actually copy user data into filesystem. This loops + until all the data from flow @f are written to a file. */ +static loff_t +append_and_or_overwrite(hint_t *hint, struct file *file, struct inode *inode, flow_t *flow, + int exclusive /* if 1 - exclusive access on a file is obtained */) +{ + int result; + lock_handle lh; + loff_t to_write; + write_f_t write_f; + file_container_t cur_container, new_container; + znode *loaded; + unix_file_info_t *uf_info; + + assert("nikita-3031", schedulable()); + assert("vs-1109", get_current_context()->grabbed_blocks == 0); + assert("vs-1708", hint != NULL); + + init_lh(&lh); + hint->ext_coord.lh = &lh; + + result = 0; + uf_info = unix_file_inode_data(inode); + + to_write = flow->length; + while (flow->length) { + + assert("vs-1123", get_current_context()->grabbed_blocks == 0); + + if (to_write == flow->length) { + /* it may happend that find_file_item will have to insert empty node to the tree (empty leaf + node between two extent items) */ + result = reiser4_grab_space_force(1 + estimate_one_insert_item(tree_by_inode(inode)), 0); + if (result) + return result; + } + /* when hint is set - hint's coord matches seal's coord */ + assert("nikita-19894", + !hint_is_set(hint) || + coords_equal(&hint->seal.coord1, &hint->ext_coord.coord)); + + /* look for file's metadata (extent or tail item) corresponding to position we write to */ + result = find_file_item(hint, &flow->key, ZNODE_WRITE_LOCK, NULL/* ra_info */, inode); + all_grabbed2free(); + if (IS_CBKERR(result)) { + /* error occurred */ + done_lh(&lh); + return result; + } + + cur_container = uf_info->container; + switch (cur_container) { + case UF_CONTAINER_EMPTY: + assert("vs-1196", get_key_offset(&flow->key) == 0); + if (should_have_notail(uf_info, get_key_offset(&flow->key) + flow->length)) { + new_container = UF_CONTAINER_EXTENTS; + write_f = item_plugin_by_id(EXTENT_POINTER_ID)->s.file.write; + } else { + new_container = UF_CONTAINER_TAILS; + write_f = item_plugin_by_id(FORMATTING_ID)->s.file.write; + } + break; + + case UF_CONTAINER_EXTENTS: + write_f = item_plugin_by_id(EXTENT_POINTER_ID)->s.file.write; + new_container = cur_container; + break; + + case UF_CONTAINER_TAILS: + if (should_have_notail(uf_info, get_key_offset(&flow->key) + flow->length)) { + done_lh(&lh); + if (!exclusive) { + drop_nonexclusive_access(uf_info); + txn_restart_current(); + get_exclusive_access(uf_info); + } + result = tail2extent(uf_info); + if (!exclusive) { + drop_exclusive_access(uf_info); + txn_restart_current(); + get_nonexclusive_access(uf_info, 0); + } + if (result) + return result; + all_grabbed2free(); + unset_hint(hint); + continue; + } + write_f = item_plugin_by_id(FORMATTING_ID)->s.file.write; + new_container = cur_container; + break; + + default: + done_lh(&lh); + return RETERR(-EIO); + } + + result = zload(lh.node); + if (result) { + done_lh(&lh); + return result; + } + loaded = lh.node; + + result = write_f(inode, + flow, + hint, + 0/* not grabbed */, + how_to_write(&hint->ext_coord, &flow->key)); + + assert("nikita-3142", get_current_context()->grabbed_blocks == 0); + /* seal has either to be not set to set properly */ + assert("nikita-19893", + ((!hint_is_set(hint) && hint->ext_coord.valid == 0) || + (coords_equal(&hint->seal.coord1, &hint->ext_coord.coord) && + keyeq(&flow->key, &hint->seal.key)))); + + if (cur_container == UF_CONTAINER_EMPTY && to_write != flow->length) { + /* file was empty and we have written something and we are having exclusive access to the file - + change file state */ + assert("vs-1195", (new_container == UF_CONTAINER_TAILS || + new_container == UF_CONTAINER_EXTENTS)); + uf_info->container = new_container; + } + zrelse(loaded); + done_lh(&lh); + if (result && result != -E_REPEAT && result != -E_DEADLOCK) + break; + preempt_point(); + } + + /* if nothing were written - there must be an error */ + assert("vs-951", ergo((to_write == flow->length), result < 0)); + assert("vs-1110", get_current_context()->grabbed_blocks == 0); + + return (to_write - flow->length) ? (to_write - flow->length) : result; +} + +/* make flow and write data (@buf) to the file. If @buf == 0 - hole of size @count will be created. This is called with + uf_info->latch either read- or write-locked */ +static loff_t +write_flow(hint_t *hint, struct file *file, struct inode *inode, const char *buf, loff_t count, loff_t pos, + int exclusive) +{ + int result; + flow_t flow; + + assert("vs-1251", inode_file_plugin(inode)->flow_by_inode == flow_by_inode_unix_file); + + result = flow_by_inode_unix_file(inode, + (char *)buf, 1 /* user space */, count, pos, WRITE_OP, &flow); + if (result) + return result; + + return append_and_or_overwrite(hint, file, inode, &flow, exclusive); +} + +static struct page * +unix_file_filemap_nopage(struct vm_area_struct *area, unsigned long address, int * unused) +{ + struct page *page; + struct inode *inode; + reiser4_context ctx; + + inode = area->vm_file->f_dentry->d_inode; + init_context(&ctx, inode->i_sb); + + /* block filemap_nopage if copy on capture is processing with a node of this file */ + down_read(&reiser4_inode_data(inode)->coc_sem); + /* second argument is to note that current atom may exist */ + get_nonexclusive_access(unix_file_inode_data(inode), 1); + + page = filemap_nopage(area, address, 0); + + drop_nonexclusive_access(unix_file_inode_data(inode)); + up_read(&reiser4_inode_data(inode)->coc_sem); + + /*txn_restart_current();*/ + + reiser4_exit_context(&ctx); + return page; +} + +static struct vm_operations_struct unix_file_vm_ops = { + .nopage = unix_file_filemap_nopage, +}; + +/* This function takes care about @file's pages. First of all it checks if + filesystems readonly and if so gets out. Otherwise, it throws out all + pages of file if it was mapped for read and going to be mapped for write + and consists of tails. This is done in order to not manage few copies + of the data (first in page cache and second one in tails them selves) + for the case of mapping files consisting tails. + + Here also tail2extent conversion is performed if it is allowed and file + is going to be written or mapped for write. This functions may be called + from write_unix_file() or mmap_unix_file(). */ +static int +check_pages_unix_file(struct inode *inode) +{ + reiser4_invalidate_pages(inode->i_mapping, 0, + (inode->i_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT, 0); + return unpack(inode, 0 /* not forever */); +} + +/* plugin->u.file.mmap + make sure that file is built of extent blocks. An estimation is in tail2extent */ + +/* This sets inode flags: file has mapping. if file is mmaped with VM_MAYWRITE - invalidate pages and convert. */ +reiser4_internal int +mmap_unix_file(struct file *file, struct vm_area_struct *vma) +{ + int result; + struct inode *inode; + unix_file_info_t *uf_info; + + inode = file->f_dentry->d_inode; + uf_info = unix_file_inode_data(inode); + + get_exclusive_access(uf_info); + + if (!IS_RDONLY(inode) && (vma->vm_flags & (VM_MAYWRITE | VM_SHARED))) { + /* we need file built of extent items. If it is still built of tail items we have to convert it. Find + what items the file is built of */ + result = finish_conversion(inode); + if (result) { + drop_exclusive_access(uf_info); + return result; + } + + result = find_file_state(uf_info); + if (result != 0) { + drop_exclusive_access(uf_info); + return result; + } + + assert("vs-1648", (uf_info->container == UF_CONTAINER_TAILS || + uf_info->container == UF_CONTAINER_EXTENTS || + uf_info->container == UF_CONTAINER_EMPTY)); + if (uf_info->container == UF_CONTAINER_TAILS) { + /* invalidate all pages and convert file from tails to extents */ + result = check_pages_unix_file(inode); + if (result) { + drop_exclusive_access(uf_info); + return result; + } + } + } + + result = generic_file_mmap(file, vma); + if (result == 0) { + /* mark file as having mapping. */ + inode_set_flag(inode, REISER4_HAS_MMAP); + vma->vm_ops = &unix_file_vm_ops; + } + + drop_exclusive_access(uf_info); + return result; +} + +static ssize_t +write_file(hint_t *hint, + struct file *file, /* file to write to */ + const char *buf, /* address of user-space buffer */ + size_t count, /* number of bytes to write */ + loff_t *off /* position in file to write to */, + int exclusive) +{ + struct inode *inode; + ssize_t written; /* amount actually written so far */ + loff_t pos; /* current location in the file */ + + inode = file->f_dentry->d_inode; + + /* estimation for write is entrusted to write item plugins */ + pos = *off; + + if (inode->i_size < pos) { + /* pos is set past real end of file */ + written = append_hole(hint, inode, pos, exclusive); + if (written) + return written; + assert("vs-1081", pos == inode->i_size); + } + + /* write user data to the file */ + written = write_flow(hint, file, inode, buf, count, pos, exclusive); + if (written > 0) + /* update position in a file */ + *off = pos + written; + + /* return number of written bytes, or error code */ + return written; +} + +/* plugin->u.file.write */ +reiser4_internal ssize_t +write_unix_file(struct file *file, /* file to write to */ + const char *buf, /* address of user-space buffer */ + size_t write_amount, /* number of bytes to write */ + loff_t *off /* position in file to write to */) +{ + int result; + struct inode *inode; + hint_t hint; + unix_file_info_t *uf_info; + struct page *pages[NR_PAGES_TO_PIN]; + int nr_pages; + size_t count, written, left; + int user_space; + int try_free_space; + + if (unlikely(write_amount == 0)) + return 0; + + inode = file->f_dentry->d_inode; + assert("vs-947", !inode_get_flag(inode, REISER4_NO_SD)); + + uf_info = unix_file_inode_data(inode); + + down(&uf_info->write); + + result = generic_write_checks(file, off, &write_amount, 0); + if (result) { + up(&uf_info->write); + return result; + } + + /* linux's VM requires this. See mm/vmscan.c:shrink_list() */ + current->backing_dev_info = inode->i_mapping->backing_dev_info; + + if (inode_get_flag(inode, REISER4_PART_CONV)) { + /* we can not currently write to a file which is partially converted */ + get_exclusive_access(uf_info); + result = finish_conversion(inode); + drop_exclusive_access(uf_info); + if (result) { + current->backing_dev_info = NULL; + up(&uf_info->write); + return result; + } + } + + if (inode_get_flag(inode, REISER4_HAS_MMAP) && uf_info->container == UF_CONTAINER_TAILS) { + /* file built of tails was mmaped. So, there might be + faultin-ed pages filled by tail item contents and mapped to + process address space. + Before starting write: + + 1) block new page creation by obtaining exclusive access to + the file + + 2) unmap address space of all mmap - now it is by + reiser4_invalidate_pages which invalidate pages as well + + 3) convert file to extents to not enter here on each write + to mmaped file */ + get_exclusive_access(uf_info); + result = check_pages_unix_file(inode); + drop_exclusive_access(uf_info); + if (result) { + current->backing_dev_info = NULL; + up(&uf_info->write); + return result; + } + } + + /* UNIX behavior: clear suid bit on file modification. This cannot be + done earlier, because removing suid bit captures blocks into + transaction, which should be done after taking either exclusive or + non-exclusive access on the file. */ + result = remove_suid(file->f_dentry); + if (result != 0) { + current->backing_dev_info = NULL; + up(&uf_info->write); + return result; + } + grab_space_enable(); + + /* get seal and coord sealed with it from reiser4 private data of + * struct file */ + result = load_file_hint(file, &hint); + if (result) + return result; + + left = write_amount; + count = 0; + user_space = is_user_space(buf); + nr_pages = 0; + try_free_space = 1; + + while (left > 0) { + unsigned long addr; + size_t to_write; + int excl = 0; + + addr = (unsigned long)buf; + + /* getting exclusive or not exclusive access requires no + transaction open */ + txn_restart_current(); + + if (user_space) { + to_write = get_nr_pages_nr_bytes(addr, left, &nr_pages); + nr_pages = reiser4_get_user_pages(pages, addr, nr_pages, WRITE); + if (nr_pages < 0) { + result = nr_pages; + break; + } + to_write = adjust_nr_bytes(addr, to_write, nr_pages); + /* get_user_pages might create a transaction */ + txn_restart_current(); + } else + to_write = left; + + if (inode->i_size == 0) { + get_exclusive_access(uf_info); + excl = 1; + } else { + get_nonexclusive_access(uf_info, 0); + excl = 0; + } + + all_grabbed2free(); + written = write_file(&hint, file, buf, to_write, off, excl); + if (user_space) + reiser4_put_user_pages(pages, nr_pages); + + if (excl) + drop_exclusive_access(uf_info); + else + drop_nonexclusive_access(uf_info); + + /* With no locks held we can commit atoms in attempt to recover + * free space. */ + if ((ssize_t)written == -ENOSPC && try_free_space) { + txnmgr_force_commit_all(inode->i_sb, 0); + try_free_space = 0; + continue; + } + if ((ssize_t)written < 0) { + result = written; + break; + } + left -= written; + buf += written; + + /* total number of written bytes */ + count += written; + } + + if ((file->f_flags & O_SYNC) || IS_SYNC(inode)) { + txn_restart_current(); + result = sync_unix_file(inode, 0/* data and stat data */); + if (result) + warning("reiser4-7", "failed to sync file %llu", + (unsigned long long)get_inode_oid(inode)); + } + + up(&uf_info->write); + current->backing_dev_info = 0; + save_file_hint(file, &hint); + + return count ? count : result; +} + +/* plugin->u.file.release() convert all extent items into tail items if + necessary */ +reiser4_internal int +release_unix_file(struct inode *object, struct file *file) +{ + unix_file_info_t *uf_info; + int result; + + uf_info = unix_file_inode_data(object); + result = 0; + + get_exclusive_access(uf_info); + if (atomic_read(&file->f_dentry->d_count) == 1 && + uf_info->container == UF_CONTAINER_EXTENTS && + !should_have_notail(uf_info, object->i_size) && + !rofs_inode(object)) { + result = extent2tail(uf_info); + if (result != 0) { + warning("nikita-3233", "Failed to convert in %s (%llu)", + __FUNCTION__, (unsigned long long)get_inode_oid(object)); + } + } + drop_exclusive_access(uf_info); + return 0; +} + +static void +set_file_notail(struct inode *inode) +{ + reiser4_inode *state; + formatting_plugin *tplug; + + state = reiser4_inode_data(inode); + tplug = formatting_plugin_by_id(NEVER_TAILS_FORMATTING_ID); + plugin_set_formatting(&state->pset, tplug); + inode_set_plugin(inode, + formatting_plugin_to_plugin(tplug), PSET_FORMATTING); +} + +/* if file is built of tails - convert it to extents */ +static int +unpack(struct inode *inode, int forever) +{ + int result = 0; + unix_file_info_t *uf_info; + + + uf_info = unix_file_inode_data(inode); + assert("vs-1628", ea_obtained(uf_info)); + + result = find_file_state(uf_info); + assert("vs-1074", ergo(result == 0, uf_info->container != UF_CONTAINER_UNKNOWN)); + if (result == 0) { + if (uf_info->container == UF_CONTAINER_TAILS) + result = tail2extent(uf_info); + if (result == 0 && forever) + set_file_notail(inode); + if (result == 0) { + __u64 tograb; + + grab_space_enable(); + tograb = inode_file_plugin(inode)->estimate.update(inode); + result = reiser4_grab_space(tograb, BA_CAN_COMMIT); + if (result == 0) + update_atime(inode); + } + } + + return result; +} + +/* plugin->u.file.ioctl */ +reiser4_internal int +ioctl_unix_file(struct inode *inode, struct file *filp UNUSED_ARG, unsigned int cmd, unsigned long arg UNUSED_ARG) +{ + int result; + + switch (cmd) { + case REISER4_IOC_UNPACK: + get_exclusive_access(unix_file_inode_data(inode)); + result = unpack(inode, 1 /* forever */); + drop_exclusive_access(unix_file_inode_data(inode)); + break; + + default: + result = RETERR(-ENOSYS); + break; + } + return result; +} + +/* plugin->u.file.get_block */ +reiser4_internal int +get_block_unix_file(struct inode *inode, + sector_t block, struct buffer_head *bh_result, int create UNUSED_ARG) +{ + int result; + reiser4_key key; + coord_t coord; + lock_handle lh; + item_plugin *iplug; + + assert("vs-1091", create == 0); + + key_by_inode_unix_file(inode, (loff_t) block * current_blocksize, &key); + + init_lh(&lh); + result = find_file_item_nohint(&coord, &lh, &key, ZNODE_READ_LOCK, inode); + if (cbk_errored(result)) { + done_lh(&lh); + return result; + } + + /*coord_clear_iplug(&coord);*/ + result = zload(coord.node); + if (result) { + done_lh(&lh); + return result; + } + iplug = item_plugin_by_coord(&coord); + if (iplug->s.file.get_block) + result = iplug->s.file.get_block(&coord, block, bh_result); + else + result = RETERR(-EINVAL); + + zrelse(coord.node); + done_lh(&lh); + return result; +} + +/* plugin->u.file.flow_by_inode + initialize flow (key, length, buf, etc) */ +reiser4_internal int +flow_by_inode_unix_file(struct inode *inode /* file to build flow for */ , + char *buf /* user level buffer */ , + int user /* 1 if @buf is of user space, 0 - if it is kernel space */ , + loff_t size /* buffer size */ , + loff_t off /* offset to start operation(read/write) from */ , + rw_op op /* READ or WRITE */ , + flow_t *flow /* resulting flow */ ) +{ + assert("nikita-1100", inode != NULL); + + flow->length = size; + flow->data = buf; + flow->user = user; + flow->op = op; + assert("nikita-1931", inode_file_plugin(inode) != NULL); + assert("nikita-1932", inode_file_plugin(inode)->key_by_inode == key_by_inode_unix_file); + /* calculate key of write position and insert it into flow->key */ + return key_by_inode_unix_file(inode, off, &flow->key); +} + +/* plugin->u.file.key_by_inode */ +reiser4_internal int +key_by_inode_unix_file(struct inode *inode, loff_t off, reiser4_key *key) +{ + return key_by_inode_and_offset_common(inode, off, key); +} + +/* plugin->u.file.set_plug_in_sd = NULL + plugin->u.file.set_plug_in_inode = NULL + plugin->u.file.create_blank_sd = NULL */ +/* plugin->u.file.delete */ +/* + plugin->u.file.add_link = add_link_common + plugin->u.file.rem_link = NULL */ + +/* plugin->u.file.owns_item + this is common_file_owns_item with assertion */ +/* Audited by: green(2002.06.15) */ +reiser4_internal int +owns_item_unix_file(const struct inode *inode /* object to check against */ , + const coord_t *coord /* coord to check */ ) +{ + int result; + + result = owns_item_common(inode, coord); + if (!result) + return 0; + if (item_type_by_coord(coord) != UNIX_FILE_METADATA_ITEM_TYPE) + return 0; + assert("vs-547", + item_id_by_coord(coord) == EXTENT_POINTER_ID || + item_id_by_coord(coord) == FORMATTING_ID); + return 1; +} + +static int +setattr_truncate(struct inode *inode, struct iattr *attr) +{ + int result; + int s_result; + loff_t old_size; + reiser4_tree *tree; + + inode_check_scale(inode, inode->i_size, attr->ia_size); + + old_size = inode->i_size; + tree = tree_by_inode(inode); + + result = safe_link_grab(tree, BA_CAN_COMMIT); + if (result == 0) + result = safe_link_add(inode, SAFE_TRUNCATE); + all_grabbed2free(); + if (result == 0) + result = truncate_file_body(inode, attr->ia_size); + if (result) + warning("vs-1588", "truncate_file failed: oid %lli, " + "old size %lld, new size %lld, retval %d", + (unsigned long long)get_inode_oid(inode), + old_size, attr->ia_size, result); + + s_result = safe_link_grab(tree, BA_CAN_COMMIT); + if (s_result == 0) + s_result = safe_link_del(inode, SAFE_TRUNCATE); + if (s_result != 0) { + warning("nikita-3417", "Cannot kill safelink %lli: %i", + (unsigned long long)get_inode_oid(inode), s_result); + } + safe_link_release(tree); + all_grabbed2free(); + return result; +} + +/* plugin->u.file.setattr method */ +/* This calls inode_setattr and if truncate is in effect it also takes + exclusive inode access to avoid races */ +reiser4_internal int +setattr_unix_file(struct inode *inode, /* Object to change attributes */ + struct iattr *attr /* change description */ ) +{ + int result; + + if (attr->ia_valid & ATTR_SIZE) { + /* truncate does reservation itself and requires exclusive + * access obtained */ + unix_file_info_t *uf_info; + + uf_info = unix_file_inode_data(inode); + down(&uf_info->write); + get_exclusive_access(uf_info); + result = setattr_truncate(inode, attr); + drop_exclusive_access(uf_info); + up(&uf_info->write); + } else + result = setattr_common(inode, attr); + + return result; +} + +/* plugin->u.file.init_inode_data */ +reiser4_internal void +init_inode_data_unix_file(struct inode *inode, + reiser4_object_create_data *crd, int create) +{ + unix_file_info_t *data; + + data = unix_file_inode_data(inode); + data->container = create ? UF_CONTAINER_EMPTY : UF_CONTAINER_UNKNOWN; + init_rwsem(&data->latch); + sema_init(&data->write, 1); + data->tplug = inode_formatting_plugin(inode); + data->exclusive_use = 0; + +#if REISER4_DEBUG + data->ea_owner = 0; + atomic_set(&data->nr_neas, 0); +#endif + init_inode_ordering(inode, crd, create); +} + +/* VS-FIXME-HANS: what is pre deleting all about? */ +/* plugin->u.file.pre_delete */ +reiser4_internal int +pre_delete_unix_file(struct inode *inode) +{ + unix_file_info_t *uf_info; + int result; + + /* FIXME: put comment here */ + uf_info = unix_file_inode_data(inode); + get_exclusive_access(uf_info); + result = truncate_file_body(inode, 0/* size */); + drop_exclusive_access(uf_info); + return result; +} + +/* Reads @count bytes from @file and calls @actor for every page read. This is + needed for loop back devices support. */ +reiser4_internal ssize_t sendfile_common ( + struct file *file, loff_t *ppos, size_t count, read_actor_t actor, void __user *target) +{ + file_plugin *fplug; + struct inode *inode; + read_descriptor_t desc; + struct page *page = NULL; + int ret = 0; + + assert("umka-3108", file != NULL); + + inode = file->f_dentry->d_inode; + + desc.error = 0; + desc.written = 0; + desc.arg.data = target; + desc.count = count; + + fplug = inode_file_plugin(inode); + if (fplug->readpage == NULL) + return RETERR(-EINVAL); + + while (desc.count != 0) { + unsigned long read_request_size; + unsigned long index; + unsigned long offset; + loff_t file_size = i_size_read(inode); + + if (*ppos >= file_size) + break; + + index = *ppos >> PAGE_CACHE_SHIFT; + offset = *ppos & ~PAGE_CACHE_MASK; + + page_cache_readahead(inode->i_mapping, &file->f_ra, file, offset, (file_size + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT); + + /* determine valid read request size. */ + read_request_size = PAGE_CACHE_SIZE - offset; + if (read_request_size > desc.count) + read_request_size = desc.count; + if (*ppos + read_request_size >= file_size) { + read_request_size = file_size - *ppos; + if (read_request_size == 0) + break; + } + page = grab_cache_page(inode->i_mapping, index); + if (unlikely(page == NULL)) { + desc.error = RETERR(-ENOMEM); + break; + } + + if (PageUptodate(page)) + /* process locked, up-to-date page by read actor */ + goto actor; + + ret = fplug->readpage(file, page); + if (ret != 0) { + SetPageError(page); + ClearPageUptodate(page); + desc.error = ret; + goto fail_locked_page; + } + + lock_page(page); + if (!PageUptodate(page)) { + desc.error = RETERR(-EIO); + goto fail_locked_page; + } + + actor: + ret = actor(&desc, page, offset, read_request_size); + unlock_page(page); + page_cache_release(page); + + (*ppos) += ret; + + if (ret != read_request_size) + break; + } + + if (0) { + fail_locked_page: + unlock_page(page); + page_cache_release(page); + } + + update_atime(inode); + + if (desc.written) + return desc.written; + return desc.error; +} + +reiser4_internal ssize_t sendfile_unix_file(struct file *file, loff_t *ppos, size_t count, + read_actor_t actor, void __user *target) +{ + ssize_t ret; + struct inode *inode; + unix_file_info_t *ufo; + + inode = file->f_dentry->d_inode; + ufo = unix_file_inode_data(inode); + + down(&inode->i_sem); + inode_set_flag(inode, REISER4_HAS_MMAP); + up(&inode->i_sem); + + get_nonexclusive_access(ufo, 0); + ret = sendfile_common(file, ppos, count, actor, target); + drop_nonexclusive_access(ufo); + return ret; +} + +reiser4_internal int prepare_write_unix_file(struct file *file, struct page *page, + unsigned from, unsigned to) +{ + unix_file_info_t *uf_info; + int ret; + + uf_info = unix_file_inode_data(file->f_dentry->d_inode); + get_exclusive_access(uf_info); + ret = find_file_state(uf_info); + if (ret == 0) { + if (uf_info->container == UF_CONTAINER_TAILS) + ret = -EINVAL; + else + ret = prepare_write_common(file, page, from, to); + } + drop_exclusive_access(uf_info); + return ret; +} + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/file/file.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/file/file.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,152 @@ +/* Copyright 2001, 2002, 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +#if !defined( __REISER4_FILE_H__ ) +#define __REISER4_FILE_H__ + +/* declarations of functions implementing file plugin for unix file plugin */ +int truncate_unix_file(struct inode *, loff_t size); +int readpage_unix_file(void *, struct page *); +int capturepage_unix_file(struct page *); +int capture_unix_file(struct inode *, struct writeback_control *); +ssize_t read_unix_file(struct file *, char *buf, size_t size, loff_t *off); +ssize_t write_unix_file(struct file *, const char *buf, size_t size, loff_t *off); +int release_unix_file(struct inode *inode, struct file *); +int ioctl_unix_file(struct inode *, struct file *, unsigned int cmd, unsigned long arg); +int mmap_unix_file(struct file *, struct vm_area_struct *vma); +int get_block_unix_file(struct inode *, sector_t block, struct buffer_head *bh_result, int create); +int flow_by_inode_unix_file(struct inode *, char *buf, int user, loff_t, loff_t, rw_op, flow_t *); +int key_by_inode_unix_file(struct inode *, loff_t off, reiser4_key *); +int owns_item_unix_file(const struct inode *, const coord_t *); +int setattr_unix_file(struct inode *, struct iattr *); +void init_inode_data_unix_file(struct inode *, reiser4_object_create_data *, int create); +int pre_delete_unix_file(struct inode *); + +extern ssize_t sendfile_common ( + struct file *file, loff_t *ppos, size_t count, read_actor_t actor, void __user *target); +extern ssize_t sendfile_unix_file ( + struct file *file, loff_t *ppos, size_t count, read_actor_t actor, void __user *target); +extern int prepare_write_unix_file (struct file *, struct page *, unsigned, unsigned); + +int sync_unix_file(struct inode *, int datasync); + + +/* all the write into unix file is performed by item write method. Write method of unix file plugin only decides which + item plugin (extent or tail) and in which mode (one from the enum below) to call */ +typedef enum { + FIRST_ITEM = 1, + APPEND_ITEM = 2, + OVERWRITE_ITEM = 3 +} write_mode_t; + + +/* unix file may be in one the following states */ +typedef enum { + UF_CONTAINER_UNKNOWN = 0, + UF_CONTAINER_TAILS = 1, + UF_CONTAINER_EXTENTS = 2, + UF_CONTAINER_EMPTY = 3 +} file_container_t; + +struct formatting_plugin; +struct inode; + +/* unix file plugin specific part of reiser4 inode */ +typedef struct unix_file_info { + /* this read-write lock protects file containerization change. Accesses + which do not change file containerization (see file_container_t) + (read, readpage, writepage, write (until tail conversion is + involved)) take read-lock. Accesses which modify file + containerization (truncate, conversion from tail to extent and back) + take write-lock. */ + struct rw_semaphore latch; + /* this semaphore is used to serialize writes instead of inode->i_sem, + because write_unix_file uses get_user_pages which is to be used + under mm->mmap_sem and because it is required to take mm->mmap_sem + before inode->i_sem, so inode->i_sem would have to be up()-ed before + calling to get_user_pages which is unacceptable */ + struct semaphore write; + /* this enum specifies which items are used to build the file */ + file_container_t container; + /* plugin which controls when file is to be converted to extents and + back to tail */ + struct formatting_plugin *tplug; + /* if this is set, file is in exclusive use */ + int exclusive_use; +#if REISER4_DEBUG + /* pointer to task struct of thread owning exclusive access to file */ + void *ea_owner; + atomic_t nr_neas; + void *last_reader; +#ifdef CONFIG_FRAME_POINTER + void *where[5]; +#endif +#endif +} unix_file_info_t; + +struct unix_file_info *unix_file_inode_data(const struct inode * inode); + +#include "../item/extent.h" +#include "../item/tail.h" +#include "../item/ctail.h" + +struct uf_coord { + coord_t coord; + lock_handle *lh; + int valid; + union { + extent_coord_extension_t extent; + tail_coord_extension_t tail; + ctail_coord_extension_t ctail; + } extension; +}; + +#include "../../seal.h" + +/* structure used to speed up file operations (reads and writes). It contains + * a seal over last file item accessed. */ +struct hint { + seal_t seal; + uf_coord_t ext_coord; + loff_t offset; + znode_lock_mode mode; +#if REISER4_DEBUG && defined(CONFIG_FRAME_POINTER) + void *bt[5]; +#endif +}; + +void set_hint(hint_t *, const reiser4_key *, znode_lock_mode); +int hint_is_set(const hint_t *); +void unset_hint(hint_t *); +int hint_validate(hint_t *, const reiser4_key *, int check_key, znode_lock_mode); + + +#if REISER4_DEBUG + +/* return 1 is exclusive access is obtained, 0 - otherwise */ +static inline int ea_obtained(unix_file_info_t *uf_info) +{ + int ret; + + ret = down_read_trylock(&uf_info->latch); + if (ret) + up_read(&uf_info->latch); + return !ret; +} + +#endif + + +/* __REISER4_FILE_H__ */ +#endif + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/file/funcs.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/file/funcs.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,25 @@ +/* Copyright 2001, 2002, 2003, 2004 by Hans Reiser, licensing governed by reiser4/README */ + +/* this prototyles functions used by both file.c and tail_conversion.c */ +void get_exclusive_access(unix_file_info_t *); +void drop_exclusive_access(unix_file_info_t *); +void get_nonexclusive_access(unix_file_info_t *, int); +void drop_nonexclusive_access(unix_file_info_t *); + +int tail2extent(unix_file_info_t *); +int extent2tail(unix_file_info_t *); +int finish_conversion(struct inode *inode); + +void hint_init_zero(hint_t *); +int find_file_item_nohint(coord_t *, lock_handle *, const reiser4_key *, + znode_lock_mode, struct inode *); + +int goto_right_neighbor(coord_t *, lock_handle *); +int find_or_create_extent(struct page *); +write_mode_t how_to_write(uf_coord_t *, const reiser4_key *); + +extern inline int +cbk_errored(int cbk_result) +{ + return (cbk_result != CBK_COORD_NOTFOUND && cbk_result != CBK_COORD_FOUND); +} diff -puN /dev/null fs/reiser4/plugin/file/invert.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/file/invert.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,511 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Suppose you want to conveniently read and write a large variety of small files conveniently within a single emacs + buffer, without having a separate buffer for each 8 byte or so file. Inverts are the way to do that. An invert + provides you with the contents of a set of subfiles plus its own contents. It is a file which inherits other files + when you read it, and allows you to write to it and through it to the files that it inherits from. In order for it + to know which subfiles each part of your write should go into, there must be delimiters indicating that. It tries to + make that easy for you by providing those delimiters in what you read from it. + + When you read it, an invert performs an inverted assignment. Instead of taking an assignment command and writing a + bunch of files, it takes a bunch of files and composes an assignment command for you to read from it that if executed + would create those files. But which files? Well, that must be specified in the body of the invert using a special + syntax, and that specification is called the invert of the assignment. + + When written to, an invert performs the assignment command that is written + to it, and modifies its own body to contain the invert of that + assignment. + + In other words, writing to an invert file what you have read from it + is the identity operation. + + Malformed assignments cause write errors. Partial writes are not + supported in v4.0, but will be. + + Example: + + If an invert contains: + + /filenameA/<>+"(some text stored in the invert)+/filenameB/<> + +====================== +Each element in this definition should be an invert, and all files +should be called recursively - too. This is bad. If one of the +included files in not a regular or invert file, then we can't read +main file. + +I think to make it is possible easier: + +internal structure of invert file should be like symlink file. But +read and write method should be explitely indicated in i/o operation.. + +By default we read and write (if probably) as symlink and if we +specify ..invert at reading time that too we can specify it at write time. + +example: +/my_invert_file/..invert<- ( (/filenameA<-"(The contents of filenameA))+"(some text stored in the invert)+(/filenameB<-"(The contents of filenameB) ) ) +will create /my_invert_file as invert, and will creat /filenameA and /filenameB with specified body. + +read of /my_invert_file/..invert will be +/filenameA<-"(The contents of filenameA)+"(some text stored in the invert)+/filenameB<-"(The contents of filenameB) + +but read of /my_invert_file/ will be +The contents of filenameAsome text stored in the invertThe contents of filenameB + +we also can creat this file as +/my_invert_file/<-/filenameA+"(some text stored in the invert)+/filenameB +will create /my_invert_file , and use existing files /filenameA and /filenameB. + +and when we will read it will be as previously invert file. + +This is correct? + + vv +DEMIDOV-FIXME-HANS: + +Maybe you are right, but then you must disable writes to /my_invert_file/ and only allow writes to /my_invert_file/..invert + +Do you agree? Discuss it on reiserfs-list.... + +-Hans +======================= + + Then a read will return: + + /filenameA<-"(The contents of filenameA)+"(some text stored in the invert)+/filenameB<-"(The contents of filenameB) + + and a write of the line above to the invert will set the contents of + the invert and filenameA and filenameB to their original values. + + Note that the contents of an invert have no influence on the effect + of a write unless the write is a partial write (and a write of a + shorter file without using truncate first is a partial write). + + truncate() has no effect on filenameA and filenameB, it merely + resets the value of the invert. + + Writes to subfiles via the invert are implemented by preceding them + with truncates. + + Parse failures cause write failures. + + Questions to ponder: should the invert be acted on prior to file + close when writing to an open filedescriptor? + + Example: + + If an invert contains: + + "(This text and a pair of quotes are all that is here.) + +Then a read will return: + + "(This text and a pair of quotes are all that is here.) + +*/ + +/* OPEN method places a struct file in memory associated with invert body + and returns something like file descriptor to the user for the future access + to the invert file. + During opening we parse the body of invert and get a list of the 'entryes' + (that describes all its subfiles) and place pointer on the first struct in + reiserfs-specific part of invert inode (arbitrary decision). + + Each subfile is described by the struct inv_entry that has a pointer @sd on + in-core based stat-data and a pointer on struct file @f (if we find that the + subfile uses more then one unformated node (arbitrary decision), we load + struct file in memory, otherwise we load base stat-data (and maybe 1-2 bytes + of some other information we need) + + Since READ and WRITE methods for inverts were formulated in assignment + language, they don't contain arguments 'size' and 'offset' that make sense + only in ordinary read/write methods. + + READ method is a combination of two methods: + 1) ordinary read method (with offset=0, lenght = @f->...->i_size) for entries + with @f != 0, this method uses pointer on struct file as an argument + 2) read method for inode-less files with @sd != 0, this method uses + in-core based stat-data instead struct file as an argument. + in the first case we don't use pagecache, just copy data that we got after + cbk() into userspace. + + WRITE method for invert files is more complex. + Besides declared WRITE-interface in assignment languageb above we need + to have an opportunity to edit unwrapped body of invert file with some + text editor, it means we need GENERIC WRITE METHOD for invert file: + + my_invert_file/..invert <- "string" + + this method parses "string" and looks for correct subfile signatures, also + the parsing process splits this "string" on the set of flows in accordance + with the set of subfiles specified by this signarure. + The found list of signatures #S is compared with the opened one #I of invert + file. If it doesn't have this one (#I==0, it will be so for instance if we + have just create this invert file) the write method assignes found signature + (#I=#S;) to the invert file. Then if #I==#S, generic write method splits + itself to the some write methods for ordinary or light-weight, or call itself + recursively for invert files with corresponding flows. + I am not sure, but the list of signatures looks like what mr.Demidov means + by 'delimiters'. + + The cases when #S<#I (#I<#S) (in the sense of set-theory) are also available + and cause delete (create new) subfiles (arbitrary decision - it may looks + too complex, but this interface will be the completest). The order of entries + of list #S (#I) and inherited order on #I (#S) must coincide. + The other parsing results give malformed signature that aborts READ method + and releases all resources. + + + Format of subfile (entry) signature: + + "START_MAGIC"<>(TYPE="...",LOOKUP_ARG="...")SUBFILE_BODY"END_MAGIC" + + Legend: + + START_MAGIC - keyword indicates the start of subfile signature; + + <> indicates the start of 'subfile metadata', that is the pair + (TYPE="...",LOOKUP_ARG="...") in parenthesis separated by comma. + + TYPE - the string "type" indicates the start of one of the three words: + - ORDINARY_FILE, + - LIGHT_WEIGHT_FILE, + - INVERT_FILE; + + LOOKUP_ARG - lookup argument depends on previous type: + */ + + /************************************************************/ + /* TYPE * LOOKUP ARGUMENT */ + /************************************************************/ + /* LIGH_WEIGHT_FILE * stat-data key */ + /************************************************************/ + /* ORDINARY_FILE * filename */ + /************************************************************/ + /* INVERT_FILE * filename */ + /************************************************************/ + + /* where: + *stat-data key - the string contains stat data key of this subfile, it will be + passed to fast-access lookup method for light-weight files; + *filename - pathname of this subfile, iyt well be passed to VFS lookup methods + for ordinary and invert files; + + SUBFILE_BODY - data of this subfile (it will go to the flow) + END_MAGIC - the keyword indicates the end of subfile signature. + + The other simbols inside the signature interpreted as 'unformatted content', + which is available with VFS's read_link() (arbitraruy decision). + + NOTE: Parse method for a body of invert file uses mentioned signatures _without_ + subfile bodies. + + Now the only unclear thing is WRITE in regular light-weight subfile A that we + can describe only in assignment language: + + A <- "some_string" + + I guess we don't want to change stat-data and body items of file A + if this file exist, and size(A) != size("some_string") because this operation is + expencive, so we only do the partial write if size(A) > size("some_string") + and do truncate of the "some_string", and then do A <- "truncated string", if + size(A) < size("some_string"). This decision is also arbitrary.. + */ + +/* here is infrastructure for formated flows */ + +#define SUBFILE_HEADER_MAGIC 0x19196605 +#define FLOW_HEADER_MAGIC 0x01194304 + +#include "../plugin.h" +#include "../../debug.h" +#include "../../forward.h" +#include "../object.h" +#include "../item/item.h" +#include "../item/static_stat.h" +#include "../../dformat.h" +#include "../znode.h" +#include "../inode.h" + +#include +#include /* for struct file */ +#include /* for struct list_head */ + +typedef enum { + LIGHT_WEIGHT_FILE, + ORDINARY_FILE, + INVERT_FILE +} inv_entry_type; + +typedef struct flow_header { + d32 fl_magic; + d16 fl_nr; /* number of subfiles in the flow */ +}; + +typedef struct subfile_header { + d32 sh_magic; /* subfile magic */ + d16 sh_type; /* type of subfile: light-weight, ordinary, invert */ + d16 sh_arg_len; /* lenght of lookup argument (filename, key) */ + d32 sh_body_len; /* lenght of subfile body */ +}; + +/* functions to get/set fields of flow header */ + +static void +fl_set_magic(flow_header * fh, __u32 value) +{ + cputod32(value, &fh->fh_magic); +} + +static __u32 +fl_get_magic(flow_header * fh) +{ + return d32tocpu(&fh->fh_magic); +} +static void +fl_set_number(flow_header * fh, __u16 value) +{ + cputod16(value, &fh->fh_nr); +} +static unsigned +fl_get_number(flow_header * fh) +{ + return d16tocpu(&fh->fh_nr); +} + +/* functions to get/set fields of subfile header */ + +static void +sh_set_magic(subfile_header * sh, __u32 value) +{ + cputod32(value, &sh->sh_magic); +} + +static __u32 +sh_get_magic(subfile_header * sh) +{ + return d32tocpu(&sh->sh_magic); +} +static void +sh_set_type(subfile_header * sh, __u16 value) +{ + cputod16(value, &sh->sh_magic); +} +static unsigned +sh_get_type(subfile_header * sh) +{ + return d16tocpu(&sh->sh_magic); +} +static void +sh_set_arg_len(subfile_header * sh, __u16 value) +{ + cputod16(value, &sh->sh_arg_len); +} +static unsigned +sh_get_arg_len(subfile_header * sh) +{ + return d16tocpu(&sh->sh_arg_len); +} +static void +sh_set_body_len(subfile_header * sh, __u32 value) +{ + cputod32(value, &sh->sh_body_len); +} + +static __u32 +sh_get_body_len(subfile_header * sh) +{ + return d32tocpu(&sh->sh_body_len); +} + +/* in-core minimal stat-data, light-weight analog of inode */ + +struct incore_sd_base { + umode_t isd_mode; + nlink_t isd_nlink; + loff_t isd_size; + char *isd_data; /* 'subflow' to write */ +}; + +/* open invert create a list of invert entries, + every entry is represented by structure inv_entry */ + +struct inv_entry { + struct list_head *ie_list; + struct file *ie_file; /* this is NULL if the file doesn't + have unformated nodes */ + struct incore_sd_base *ie_sd; /* inode-less analog of struct file */ +}; + +/* allocate and init invert entry */ + +static struct inv_entry * +allocate_inv_entry(void) +{ + struct inv_entry *inv_entry; + + inv_entry = reiser4_kmalloc(sizeof (struct inv_entry), GFP_KERNEL); + if (!inv_entry) + return ERR_PTR(RETERR(-ENOMEM)); + inv_entry->ie_file = NULL; + inv_entry->ie_sd = NULL; + INIT_LIST_HEAD(&inv_entry->ie_list); + return inv_entry; +} + +static int +put_inv_entry(struct inv_entry *ientry) +{ + int result = 0; + + assert("edward-96", ientry != NULL); + assert("edward-97", ientry->ie_list != NULL); + + list_del(ientry->ie_list); + if (ientry->ie_sd != NULL) { + kfree(ientry->ie_sd); + kfree(ientry); + } + if (ientry->ie_file != NULL) + result = filp_close(ientry->file, NULL); + return result; +} + +static int +allocate_incore_sd_base(struct inv_entry *inv_entry) +{ + struct incore_sd_base *isd_base assert("edward-98", inv_entry != NULL); + assert("edward-99", inv_entry->ie_inode = NULL); + assert("edward-100", inv_entry->ie_sd = NULL); + + isd_base = reiser4_kmalloc(sizeof (struct incore_sd_base), GFP_KERNEL); + if (!isd_base) + return RETERR(-ENOMEM); + inv_entry->ie_sd = isd_base; + return 0; +} + +/* this can be installed as ->init_inv_entry () method of + item_plugins[ STATIC_STAT_DATA_IT ] (fs/reiser4/plugin/item/item.c). + Copies data from on-disk stat-data format into light-weight analog of inode . + Doesn't hanlde stat-data extensions. */ + +static void +sd_base_load(struct inv_entry *inv_entry, char *sd) +{ + reiser4_stat_data_base *sd_base; + + assert("edward-101", inv_entry != NULL); + assert("edward-101", inv_entry->ie_sd != NULL); + assert("edward-102", sd != NULL); + + sd_base = (reiser4_stat_data_base *) sd; + inv_entry->incore_sd_base->isd_mode = d16tocpu(&sd_base->mode); + inv_entry->incore_sd_base->isd_nlink = d32tocpu(&sd_base->nlink); + inv_entry->incore_sd_base->isd_size = d64tocpu(&sd_base->size); + inv_entry->incore_sd_base->isd_data = NULL; +} + +/* initialise incore stat-data */ + +static void +init_incore_sd_base(struct inv_entry *inv_entry, coord_t * coord) +{ + reiser4_plugin *plugin = item_plugin_by_coord(coord); + void *body = item_body_by_coord(coord); + + assert("edward-103", inv_entry != NULL); + assert("edward-104", plugin != NULL); + assert("edward-105", body != NULL); + + sd_base_load(inv_entry, body); +} + +/* takes a key or filename and allocates new invert_entry, + init and adds it into the list, + we use lookup_sd_by_key() for light-weight files and VFS lookup by filename */ + +int +get_inv_entry(struct inode *invert_inode, /* inode of invert's body */ + inv_entry_type type, /* LIGHT-WEIGHT or ORDINARY */ + const reiser4_key * key, /* key of invert entry stat-data */ + char *filename, /* filename of the file to be opened */ + int flags, int mode) +{ + int result; + struct inv_entry *ientry; + + assert("edward-107", invert_inode != NULL); + + ientry = allocate_inv_entry(); + if (IS_ERR(ientry)) + return (PTR_ERR(ientry)); + + if (type == LIGHT_WEIGHT_FILE) { + coord_t coord; + lock_handle lh; + + assert("edward-108", key != NULL); + + init_coord(&coord); + init_lh(&lh); + result = lookup_sd_by_key(tree_by_inode(invert_inode), ZNODE_READ_LOCK, &coord, &lh, key); + if (result == 0) + init_incore_sd_base(ientry, coord); + + done_lh(&lh); + done_coord(&coord); + return (result); + } else { + struct file *file = filp_open(filename, flags, mode); + /* FIXME_EDWARD here we need to check if we + did't follow to any mount point */ + + assert("edward-108", filename != NULL); + + if (IS_ERR(file)) + return (PTR_ERR(file)); + ientry->ie_file = file; + return 0; + } +} + +/* takes inode of invert, reads the body of this invert, parses it, + opens all invert entries and return pointer on the first inv_entry */ + +struct inv_entry * +open_invert(struct file *invert_file) +{ + +} + +ssize_t subfile_read(struct *invert_entry, flow * f) +{ + +} + +ssize_t subfile_write(struct *invert_entry, flow * f) +{ + +} + +ssize_t invert_read(struct *file, flow * f) +{ + +} + +ssize_t invert_write(struct *file, flow * f) +{ + +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/file/pseudo.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/file/pseudo.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,180 @@ +/* Copyright 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* + * Pseudo file plugin. This contains helper functions used by pseudo files. + */ + +#include "pseudo.h" +#include "../plugin.h" + +#include "../../inode.h" + +#include +#include + +/* extract pseudo file plugin, stored in @file */ +static pseudo_plugin * +get_pplug(struct file * file) +{ + struct inode *inode; + + inode = file->f_dentry->d_inode; + return reiser4_inode_data(inode)->file_plugin_data.pseudo_info.plugin; +} + +/* common routine to open pseudo file. */ +reiser4_internal int open_pseudo(struct inode * inode, struct file * file) +{ + int result; + pseudo_plugin *pplug; + + pplug = get_pplug(file); + + /* for pseudo files based on seq_file interface */ + if (pplug->read_type == PSEUDO_READ_SEQ) { + result = seq_open(file, &pplug->read.ops); + if (result == 0) { + struct seq_file *m; + + m = file->private_data; + m->private = file; + } + } else if (pplug->read_type == PSEUDO_READ_SINGLE) + /* for pseudo files containing one record */ + result = single_open(file, pplug->read.single_show, file); + else + result = 0; + + return result; +} + +/* common read method for pseudo files */ +reiser4_internal ssize_t read_pseudo(struct file *file, + char __user *buf, size_t size, loff_t *ppos) +{ + switch (get_pplug(file)->read_type) { + case PSEUDO_READ_SEQ: + case PSEUDO_READ_SINGLE: + /* seq_file behaves like pipe, requiring @ppos to always be + * address of file->f_pos */ + return seq_read(file, buf, size, &file->f_pos); + case PSEUDO_READ_FORWARD: + return get_pplug(file)->read.read(file, buf, size, ppos); + default: + return 0; + } +} + +/* common seek method for pseudo files */ +reiser4_internal loff_t seek_pseudo(struct file *file, loff_t offset, int origin) +{ + switch (get_pplug(file)->read_type) { + case PSEUDO_READ_SEQ: + case PSEUDO_READ_SINGLE: + return seq_lseek(file, offset, origin); + default: + return 0; + } +} + +/* common release method for pseudo files */ +reiser4_internal int release_pseudo(struct inode *inode, struct file *file) +{ + int result; + + switch (get_pplug(file)->read_type) { + case PSEUDO_READ_SEQ: + case PSEUDO_READ_SINGLE: + result = seq_release(inode, file); + file->private_data = NULL; + break; + default: + result = 0; + } + return result; +} + +/* pseudo files need special ->drop() method, because they don't have nlink + * and only exist while host object does. */ +reiser4_internal void drop_pseudo(struct inode * object) +{ + /* pseudo files are not protected from deletion by their ->i_nlink */ + generic_delete_inode(object); +} + +/* common write method for pseudo files */ +reiser4_internal ssize_t +write_pseudo(struct file *file, + const char __user *buf, size_t size, loff_t *ppos) +{ + ssize_t result; + + switch (get_pplug(file)->write_type) { + case PSEUDO_WRITE_STRING: { + char * inkernel; + + inkernel = getname(buf); + if (!IS_ERR(inkernel)) { + result = get_pplug(file)->write.gets(file, inkernel); + putname(inkernel); + if (result == 0) + result = size; + } else + result = PTR_ERR(inkernel); + break; + } + case PSEUDO_WRITE_FORWARD: + result = get_pplug(file)->write.write(file, buf, size, ppos); + break; + default: + result = size; + } + return result; +} + +/* on-wire serialization of pseudo files. */ + +/* this is not implemented so far (and, hence, pseudo files are not accessible + * over NFS, closing remote exploits a fortiori */ + +reiser4_internal int +wire_size_pseudo(struct inode *inode) +{ + return RETERR(-ENOTSUPP); +} + +reiser4_internal char * +wire_write_pseudo(struct inode *inode, char *start) +{ + return ERR_PTR(RETERR(-ENOTSUPP)); +} + +reiser4_internal char * +wire_read_pseudo(char *addr, reiser4_object_on_wire *obj) +{ + return ERR_PTR(RETERR(-ENOTSUPP)); +} + +reiser4_internal void +wire_done_pseudo(reiser4_object_on_wire *obj) +{ + /* nothing to do */ +} + +reiser4_internal struct dentry * +wire_get_pseudo(struct super_block *sb, reiser4_object_on_wire *obj) +{ + return ERR_PTR(RETERR(-ENOTSUPP)); +} + + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/file/pseudo.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/file/pseudo.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,39 @@ +/* Copyright 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#if !defined(__REISER4_PSEUDO_FILE_H__) +#define __REISER4_PSEUDO_FILE_H__ + +#include "../plugin.h" + +#include + +extern int open_pseudo(struct inode * inode, struct file * file); +extern ssize_t read_pseudo(struct file *file, + char __user *buf, size_t size, loff_t *ppos); +extern ssize_t write_pseudo(struct file *file, + const char __user *buf, size_t size, loff_t *ppos); +extern loff_t seek_pseudo(struct file *file, loff_t offset, int origin); +extern int release_pseudo(struct inode *inode, struct file *file); +extern void drop_pseudo(struct inode * object); + +extern int wire_size_pseudo(struct inode *inode); +extern char *wire_write_pseudo(struct inode *inode, char *start); +extern char *wire_read_pseudo(char *addr, reiser4_object_on_wire *obj); +extern void wire_done_pseudo(reiser4_object_on_wire *obj); +extern struct dentry *wire_get_pseudo(struct super_block *sb, + reiser4_object_on_wire *obj); + +/* __REISER4_PSEUDO_FILE_H__ */ +#endif + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ + diff -puN /dev/null fs/reiser4/plugin/file/symfile.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/file/symfile.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,98 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Symfiles are a generalization of Unix symlinks. + + A symfile when read behaves as though you took its contents and + substituted them into the reiser4 naming system as the right hand side + of an assignment, and then read that which you had assigned to it. + + A key issue for symfiles is how to implement writes through to + subfiles. In general, one must have some method of determining what + of that which is written to the symfile is written to what subfile. + This can be done by use of custom plugin methods written by users, or + by using a few general methods we provide for those willing to endure + the insertion of delimiters into what is read. + + Writing to symfiles without delimiters to denote what is written to + what subfile is not supported by any plugins we provide in this + release. Our most sophisticated support for writes is that embodied + by the invert plugin (see invert.c). + + A read only version of the /etc/passwd file might be + constructed as a symfile whose contents are as follows: + + /etc/passwd/userlines/* + + or + + /etc/passwd/userlines/demidov+/etc/passwd/userlines/edward+/etc/passwd/userlines/reiser+/etc/passwd/userlines/root + + or + + /etc/passwd/userlines/(demidov+edward+reiser+root) + + A symfile with contents + + /filenameA+"(some text stored in the uninvertable symfile)+/filenameB + + will return when read + + The contents of filenameAsome text stored in the uninvertable symfileThe contents of filenameB + + and write of what has been read will not be possible to implement as + an identity operation because there are no delimiters denoting the + boundaries of what is to be written to what subfile. + + Note that one could make this a read/write symfile if one specified + delimiters, and the write method understood those delimiters delimited + what was written to subfiles. + + So, specifying the symfile in a manner that allows writes: + + /etc/passwd/userlines/demidov+"( + )+/etc/passwd/userlines/edward+"( + )+/etc/passwd/userlines/reiser+"( + )+/etc/passwd/userlines/root+"( + ) + + or + + /etc/passwd/userlines/(demidov+"( + )+edward+"( + )+reiser+"( + )+root+"( + )) + + and the file demidov might be specified as: + + /etc/passwd/userlines/demidov/username+"(:)+/etc/passwd/userlines/demidov/password+"(:)+/etc/passwd/userlines/demidov/userid+"(:)+/etc/passwd/userlines/demidov/groupid+"(:)+/etc/passwd/userlines/demidov/gecos+"(:)+/etc/passwd/userlines/demidov/home+"(:)+/etc/passwd/userlines/demidov/shell + + or + + /etc/passwd/userlines/demidov/(username+"(:)+password+"(:)+userid+"(:)+groupid+"(:)+gecos+"(:)+home+"(:)+shell) + + Notice that if the file demidov has a carriage return in it, the + parsing fails, but then if you put carriage returns in the wrong place + in a normal /etc/passwd file it breaks things also. + + Note that it is forbidden to have no text between two interpolations + if one wants to be able to define what parts of a write go to what + subfiles referenced in an interpolation. + + If one wants to be able to add new lines by writing to the file, one + must either write a custom plugin for /etc/passwd that knows how to + name an added line, or one must use an invert, or one must use a more + sophisticated symfile syntax that we are not planning to write for + version 4.0. +*/ + + + + + + + + + + + diff -puN /dev/null fs/reiser4/plugin/file/tail_conversion.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/file/tail_conversion.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,720 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "../../inode.h" +#include "../../super.h" +#include "../../page_cache.h" +#include "../../carry.h" +#include "../../lib.h" +#include "../../safe_link.h" +#include "../../vfs_ops.h" +#include "funcs.h" + +#include + +/* this file contains: + tail2extent and extent2tail */ + + +/* exclusive access to a file is acquired when file state changes: tail2extent, empty2tail, extent2tail, etc */ +reiser4_internal void +get_exclusive_access(unix_file_info_t *uf_info) +{ + assert("nikita-3028", schedulable()); + assert("nikita-3047", LOCK_CNT_NIL(inode_sem_w)); + assert("nikita-3048", LOCK_CNT_NIL(inode_sem_r)); + /* + * "deadlock detection": sometimes we commit a transaction under + * rw-semaphore on a file. Such commit can deadlock with another + * thread that captured some block (hence preventing atom from being + * committed) and waits on rw-semaphore. + */ + assert("nikita-3361", get_current_context()->trans->atom == NULL); + BUG_ON(get_current_context()->trans->atom != NULL); + LOCK_CNT_INC(inode_sem_w); + down_write(&uf_info->latch); + assert("vs-1713", uf_info->ea_owner == NULL); + assert("vs-1713", atomic_read(&uf_info->nr_neas) == 0); + ON_DEBUG(uf_info->ea_owner = current); +} + +reiser4_internal void +drop_exclusive_access(unix_file_info_t *uf_info) +{ + assert("vs-1714", uf_info->ea_owner == current); + assert("vs-1715", atomic_read(&uf_info->nr_neas) == 0); + ON_DEBUG(uf_info->ea_owner = NULL); + up_write(&uf_info->latch); + assert("nikita-3049", LOCK_CNT_NIL(inode_sem_r)); + assert("nikita-3049", LOCK_CNT_GTZ(inode_sem_w)); + LOCK_CNT_DEC(inode_sem_w); +} + +/* nonexclusive access to a file is acquired for read, write, readpage */ +reiser4_internal void +get_nonexclusive_access(unix_file_info_t *uf_info, int atom_may_exist) +{ + assert("nikita-3029", schedulable()); + /* unix_file_filemap_nopage may call this when current atom exist already */ + assert("nikita-3361", ergo(atom_may_exist == 0, get_current_context()->trans->atom == NULL)); + BUG_ON(atom_may_exist == 0 && get_current_context()->trans->atom != NULL); + down_read(&uf_info->latch); + LOCK_CNT_INC(inode_sem_r); + assert("vs-1716", uf_info->ea_owner == NULL); + ON_DEBUG(atomic_inc(&uf_info->nr_neas)); +#if REISER4_DEBUG + uf_info->last_reader = current; +#ifdef CONFIG_FRAME_POINTER + uf_info->where[0] = __builtin_return_address(0); + uf_info->where[1] = __builtin_return_address(1); + uf_info->where[2] = __builtin_return_address(2); + uf_info->where[3] = __builtin_return_address(3); + uf_info->where[4] = __builtin_return_address(4); +#endif +#endif +} + +reiser4_internal void +drop_nonexclusive_access(unix_file_info_t *uf_info) +{ + assert("vs-1718", uf_info->ea_owner == NULL); + assert("vs-1719", atomic_read(&uf_info->nr_neas) > 0); + ON_DEBUG(atomic_dec(&uf_info->nr_neas)); + up_read(&uf_info->latch); + LOCK_CNT_DEC(inode_sem_r); +} + +/* part of tail2extent. Cut all items covering @count bytes starting from + @offset */ +/* Audited by: green(2002.06.15) */ +static int +cut_formatting_items(struct inode *inode, loff_t offset, int count) +{ + reiser4_key from, to; + + /* AUDIT: How about putting an assertion here, what would check + all provided range is covered by tail items only? */ + /* key of first byte in the range to be cut */ + key_by_inode_unix_file(inode, offset, &from); + + /* key of last byte in that range */ + to = from; + set_key_offset(&to, (__u64) (offset + count - 1)); + + /* cut everything between those keys */ + return cut_tree(tree_by_inode(inode), &from, &to, inode, 0); +} + +static void +release_all_pages(struct page **pages, unsigned nr_pages) +{ + unsigned i; + + for (i = 0; i < nr_pages; i++) { + if (pages[i] == NULL) { + unsigned j; + for (j = i + 1; j < nr_pages; j ++) + assert("vs-1620", pages[j] == NULL); + break; + } + page_cache_release(pages[i]); + pages[i] = NULL; + } +} + +/* part of tail2extent. replace tail items with extent one. Content of tail + items (@count bytes) being cut are copied already into + pages. extent_writepage method is called to create extents corresponding to + those pages */ +static int +replace(struct inode *inode, struct page **pages, unsigned nr_pages, int count) +{ + int result; + unsigned i; + STORE_COUNTERS; + + if (nr_pages == 0) + return 0; + + assert("vs-596", pages[0]); + + /* cut copied items */ + result = cut_formatting_items(inode, (loff_t) pages[0]->index << PAGE_CACHE_SHIFT, count); + if (result) + return result; + + CHECK_COUNTERS; + + /* put into tree replacement for just removed items: extent item, namely */ + for (i = 0; i < nr_pages; i++) { + result = add_to_page_cache_lru(pages[i], inode->i_mapping, + pages[i]->index, mapping_gfp_mask(inode->i_mapping)); + if (result) + break; + unlock_page(pages[i]); + result = find_or_create_extent(pages[i]); + if (result) + break; + SetPageUptodate(pages[i]); + } + return result; +} + +#define TAIL2EXTENT_PAGE_NUM 3 /* number of pages to fill before cutting tail + * items */ + +static int +reserve_tail2extent_iteration(struct inode *inode) +{ + reiser4_block_nr unformatted_nodes; + reiser4_tree *tree; + + tree = tree_by_inode(inode); + + /* number of unformatted nodes which will be created */ + unformatted_nodes = TAIL2EXTENT_PAGE_NUM; + + /* + * space required for one iteration of extent->tail conversion: + * + * 1. kill N tail items + * + * 2. insert TAIL2EXTENT_PAGE_NUM unformatted nodes + * + * 3. insert TAIL2EXTENT_PAGE_NUM (worst-case single-block + * extents) extent units. + * + * 4. drilling to the leaf level by coord_by_key() + * + * 5. possible update of stat-data + * + */ + grab_space_enable(); + return reiser4_grab_space + (2 * tree->height + + TAIL2EXTENT_PAGE_NUM + + TAIL2EXTENT_PAGE_NUM * estimate_one_insert_into_item(tree) + + 1 + estimate_one_insert_item(tree) + + inode_file_plugin(inode)->estimate.update(inode), + BA_CAN_COMMIT); +} + +/* this is used by tail2extent and extent2tail to detect where previous uncompleted conversion stopped */ +static int +find_start(struct inode *object, reiser4_plugin_id id, __u64 *offset) +{ + int result; + lock_handle lh; + coord_t coord; + unix_file_info_t *ufo; + int found; + reiser4_key key; + + ufo = unix_file_inode_data(object); + init_lh(&lh); + result = 0; + found = 0; + key_by_inode_unix_file(object, *offset, &key); + do { + init_lh(&lh); + result = find_file_item_nohint(&coord, &lh, &key, + ZNODE_READ_LOCK, object); + + if (result == CBK_COORD_FOUND) { + if (coord.between == AT_UNIT) { + /*coord_clear_iplug(&coord);*/ + result = zload(coord.node); + if (result == 0) { + if (item_id_by_coord(&coord) == id) + found = 1; + else + item_plugin_by_coord(&coord)->s.file.append_key(&coord, &key); + zrelse(coord.node); + } + } else + result = RETERR(-ENOENT); + } + done_lh(&lh); + } while (result == 0 && !found); + *offset = get_key_offset(&key); + return result; +} + +/* clear stat data's flag indicating that conversion is being converted */ +static int +complete_conversion(struct inode *inode) +{ + int result; + + all_grabbed2free(); + grab_space_enable(); + result = reiser4_grab_space(inode_file_plugin(inode)->estimate.update(inode), + BA_CAN_COMMIT); + if (result == 0) { + inode_clr_flag(inode, REISER4_PART_CONV); + result = reiser4_update_sd(inode); + } + if (result) + warning("vs-1696", "Failed to clear converting bit of %llu: %i", + (unsigned long long)get_inode_oid(inode), result); + return 0; +} + +reiser4_internal int +tail2extent(unix_file_info_t *uf_info) +{ + int result; + reiser4_key key; /* key of next byte to be moved to page */ + ON_DEBUG(reiser4_key tmp;) + char *p_data; /* data of page */ + unsigned page_off = 0, /* offset within the page where to copy data */ + count; /* number of bytes of item which can be + * copied to page */ + struct page *pages[TAIL2EXTENT_PAGE_NUM]; + struct page *page; + int done; /* set to 1 when all file is read */ + char *item; + int i; + struct inode *inode; + __u64 offset; + int first_iteration; + int bytes; + + assert("nikita-3362", ea_obtained(uf_info)); + inode = unix_file_info_to_inode(uf_info); + assert("nikita-3412", !IS_RDONLY(inode)); + assert("vs-1649", uf_info->container != UF_CONTAINER_EXTENTS); + + offset = 0; + if (inode_get_flag(inode, REISER4_PART_CONV)) { + /* find_start() doesn't need block reservation */ + result = find_start(inode, FORMATTING_ID, &offset); + if (result == -ENOENT) { + /* no tail items found, everything is converted */ + uf_info->container = UF_CONTAINER_EXTENTS; + complete_conversion(inode); + return 0; + } else if (result != 0) + /* some other error */ + return result; + } + + /* get key of first byte of a file */ + key_by_inode_unix_file(inode, offset, &key); + + done = 0; + result = 0; + first_iteration = 1; + while (done == 0) { + memset(pages, 0, sizeof (pages)); + all_grabbed2free(); + result = reserve_tail2extent_iteration(inode); + if (result != 0) + goto out; + if (first_iteration) { + inode_set_flag(inode, REISER4_PART_CONV); + reiser4_update_sd(inode); + first_iteration = 0; + } + bytes = 0; + for (i = 0; i < sizeof_array(pages) && done == 0; i++) { + assert("vs-598", (get_key_offset(&key) & ~PAGE_CACHE_MASK) == 0); + page = alloc_page(mapping_gfp_mask(inode->i_mapping)); + if (!page) { + result = RETERR(-ENOMEM); + goto error; + } + + page->index = (unsigned long) (get_key_offset(&key) >> PAGE_CACHE_SHIFT); + /* usually when one is going to longterm lock znode (as + find_file_item does, for instance) he must not hold + locked pages. However, there is an exception for + case tail2extent. Pages appearing here are not + reachable to everyone else, they are clean, they do + not have jnodes attached so keeping them locked do + not risk deadlock appearance + */ + assert("vs-983", !PagePrivate(page)); + + for (page_off = 0; page_off < PAGE_CACHE_SIZE;) { + coord_t coord; + lock_handle lh; + + /* get next item */ + /* FIXME: we might want to readahead here */ + init_lh(&lh); + result = find_file_item_nohint(&coord, &lh, &key, ZNODE_READ_LOCK, inode); + if (cbk_errored(result) || result == CBK_COORD_NOTFOUND) { + /* error happened of not items of file were found */ + done_lh(&lh); + page_cache_release(page); + goto error; + } + + if (coord.between == AFTER_UNIT) { + /* this is used to detect end of file when inode->i_size can not be used */ + done_lh(&lh); + done = 1; + p_data = kmap_atomic(page, KM_USER0); + memset(p_data + page_off, 0, PAGE_CACHE_SIZE - page_off); + kunmap_atomic(p_data, KM_USER0); + break; + } + + result = zload(coord.node); + if (result) { + page_cache_release(page); + done_lh(&lh); + goto error; + } + assert("vs-562", owns_item_unix_file(inode, &coord)); + assert("vs-856", coord.between == AT_UNIT); + assert("green-11", keyeq(&key, unit_key_by_coord(&coord, &tmp))); + item = ((char *)item_body_by_coord(&coord)) + coord.unit_pos; + + /* how many bytes to copy */ + count = item_length_by_coord(&coord) - coord.unit_pos; + /* limit length of copy to end of page */ + if (count > PAGE_CACHE_SIZE - page_off) + count = PAGE_CACHE_SIZE - page_off; + + /* kmap/kunmap are necessary for pages which are not addressable by direct kernel + virtual addresses */ + p_data = kmap_atomic(page, KM_USER0); + /* copy item (as much as will fit starting from the beginning of the item) into the + page */ + memcpy(p_data + page_off, item, (unsigned) count); + kunmap_atomic(p_data, KM_USER0); + + page_off += count; + bytes += count; + set_key_offset(&key, get_key_offset(&key) + count); + + zrelse(coord.node); + done_lh(&lh); + } /* end of loop which fills one page by content of formatting items */ + + if (page_off) { + /* something was copied into page */ + pages[i] = page; + } else { + page_cache_release(page); + assert("vs-1648", done == 1); + break; + } + } /* end of loop through pages of one conversion iteration */ + + if (i > 0) { + result = replace(inode, pages, i, bytes); + release_all_pages(pages, sizeof_array(pages)); + if (result) + goto error; + /* throttle the conversion */ + reiser4_throttle_write(inode); + } + } + + if (result == 0) { + /* file is converted to extent items */ + assert("vs-1697", inode_get_flag(inode, REISER4_PART_CONV)); + + uf_info->container = UF_CONTAINER_EXTENTS; + complete_conversion(inode); + } else { + /* conversion is not complete. Inode was already marked as + * REISER4_PART_CONV and stat-data were updated at the first + * iteration of the loop above. */ + error: + release_all_pages(pages, sizeof_array(pages)); + warning("nikita-2282", "Partial conversion of %llu: %i", + (unsigned long long)get_inode_oid(inode), result); + } + + out: + all_grabbed2free(); + return result; +} + + +/* part of extent2tail. Page contains data which are to be put into tree by + tail items. Use tail_write for this. flow is composed like in + unix_file_write. The only difference is that data for writing are in + kernel space */ +/* Audited by: green(2002.06.15) */ +static int +write_page_by_tail(struct inode *inode, struct page *page, unsigned count) +{ + flow_t f; + hint_t hint; + coord_t *coord; + lock_handle lh; + znode *loaded; + item_plugin *iplug; + int result; + + result = 0; + + assert("vs-1089", count); + assert("vs-1647", inode_file_plugin(inode)->flow_by_inode == flow_by_inode_unix_file); + + /* build flow */ + /* FIXME: do not kmap here */ + flow_by_inode_unix_file(inode, kmap(page), 0 /* not user space */ , + count, (loff_t) (page->index << PAGE_CACHE_SHIFT), WRITE_OP, &f); + iplug = item_plugin_by_id(FORMATTING_ID); + hint_init_zero(&hint); + init_lh(&lh); + hint.ext_coord.lh = &lh; + coord = &hint.ext_coord.coord; + while (f.length) { + result = find_file_item_nohint(coord, &lh, &f.key, ZNODE_WRITE_LOCK, inode); + if (IS_CBKERR(result)) + break; + + assert("vs-957", ergo(result == CBK_COORD_NOTFOUND, get_key_offset(&f.key) == 0)); + assert("vs-958", ergo(result == CBK_COORD_FOUND, get_key_offset(&f.key) != 0)); + + /*coord_clear_iplug(coord);*/ + result = zload(coord->node); + if (result) + break; + loaded = coord->node; + + result = iplug->s.file.write(inode, &f, &hint, 1/*grabbed*/, how_to_write(&hint.ext_coord, &f.key)); + zrelse(loaded); + done_lh(&lh); + + if (result == -E_REPEAT) + result = 0; + else if (result) + break; + } + + done_lh(&lh); + kunmap(page); + + /* result of write is 0 or error */ + assert("vs-589", result <= 0); + /* if result is 0 - all @count bytes is written completely */ + assert("vs-588", ergo(result == 0, f.length == 0)); + return result; +} + +static int +reserve_extent2tail_iteration(struct inode *inode) +{ + reiser4_tree *tree; + + tree = tree_by_inode(inode); + /* + * reserve blocks for (in this order): + * + * 1. removal of extent item + * + * 2. insertion of tail by insert_flow() + * + * 3. drilling to the leaf level by coord_by_key() + * + * 4. possible update of stat-data + */ + grab_space_enable(); + return reiser4_grab_space + (estimate_one_item_removal(tree) + + estimate_insert_flow(tree->height) + + 1 + estimate_one_insert_item(tree) + + inode_file_plugin(inode)->estimate.update(inode), + BA_CAN_COMMIT); +} + +/* for every page of file: read page, cut part of extent pointing to this page, + put data of page tree by tail item */ +reiser4_internal int +extent2tail(unix_file_info_t *uf_info) +{ + int result; + struct inode *inode; + struct page *page; + unsigned long num_pages, i; + unsigned long start_page; + reiser4_key from; + reiser4_key to; + unsigned count; + __u64 offset; + + assert("nikita-3362", ea_obtained(uf_info)); + inode = unix_file_info_to_inode(uf_info); + assert("nikita-3412", !IS_RDONLY(inode)); + assert("vs-1649", uf_info->container != UF_CONTAINER_TAILS); + + offset = 0; + if (inode_get_flag(inode, REISER4_PART_CONV)) { + /* find_start() doesn't need block reservation */ + result = find_start(inode, EXTENT_POINTER_ID, &offset); + if (result == -ENOENT) { + /* no extent found, everything is converted */ + uf_info->container = UF_CONTAINER_TAILS; + complete_conversion(inode); + return 0; + } else if (result != 0) + /* some other error */ + return result; + } + + /* number of pages in the file */ + num_pages = + (inode->i_size - offset + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + start_page = offset >> PAGE_CACHE_SHIFT; + + key_by_inode_unix_file(inode, offset, &from); + to = from; + + result = 0; + for (i = 0; i < num_pages; i++) { + __u64 start_byte; + + all_grabbed2free(); + result = reserve_extent2tail_iteration(inode); + if (result != 0) + break; + if (i == 0) { + inode_set_flag(inode, REISER4_PART_CONV); + reiser4_update_sd(inode); + } + + page = read_cache_page(inode->i_mapping, + (unsigned) (i + start_page), + readpage_unix_file/*filler*/, 0); + if (IS_ERR(page)) { + result = PTR_ERR(page); + break; + } + + wait_on_page_locked(page); + + if (!PageUptodate(page)) { + page_cache_release(page); + result = RETERR(-EIO); + break; + } + + /* cut part of file we have read */ + start_byte = (__u64) (i << PAGE_CACHE_SHIFT) + offset; + set_key_offset(&from, start_byte); + set_key_offset(&to, start_byte + PAGE_CACHE_SIZE - 1); + /* + * cut_tree_object() returns -E_REPEAT to allow atom + * commits during over-long truncates. But + * extent->tail conversion should be performed in one + * transaction. + */ + result = cut_tree(tree_by_inode(inode), &from, &to, inode, 0); + + if (result) { + page_cache_release(page); + break; + } + + /* put page data into tree via tail_write */ + count = PAGE_CACHE_SIZE; + if (i == num_pages - 1) + count = (inode->i_size & ~PAGE_CACHE_MASK) ? : PAGE_CACHE_SIZE; + result = write_page_by_tail(inode, page, count); + if (result) { + page_cache_release(page); + break; + } + + /* release page */ + lock_page(page); + /* page is already detached from jnode and mapping. */ + assert("vs-1086", page->mapping == NULL); + assert("nikita-2690", (!PagePrivate(page) && page->private == 0)); + /* waiting for writeback completion with page lock held is + * perfectly valid. */ + wait_on_page_writeback(page); + drop_page(page); + /* release reference taken by read_cache_page() above */ + page_cache_release(page); + } + + if (i == num_pages) { + /* file is converted to formatted items */ + assert("vs-1698", inode_get_flag(inode, REISER4_PART_CONV)); + assert("vs-1260", inode_has_no_jnodes(reiser4_inode_data(inode))); + + uf_info->container = UF_CONTAINER_TAILS; + complete_conversion(inode); + } else { + /* conversion is not complete. Inode was already marked as + * REISER4_PART_CONV and stat-data were updated at the first + * iteration of the loop above. */ + warning("nikita-2282", + "Partial conversion of %llu: %lu of %lu: %i", + (unsigned long long)get_inode_oid(inode), i, + num_pages, result); + } + all_grabbed2free(); + return result; +} + +/* this is used to find which conversion did not complete */ +static int +find_first_item(struct inode *inode) +{ + coord_t coord; + lock_handle lh; + reiser4_key key; + int result; + + coord_init_zero(&coord); + init_lh(&lh); + key_by_inode_unix_file(inode, 0, &key); + result = find_file_item_nohint(&coord, &lh, &key, ZNODE_READ_LOCK, inode); + if (result == CBK_COORD_FOUND) { + if (coord.between == AT_UNIT) { + /*coord_clear_iplug(&coord);*/ + result = zload(coord.node); + if (result == 0) { + result = item_id_by_coord(&coord); + zrelse(coord.node); + if (result != EXTENT_POINTER_ID && result != FORMATTING_ID) + result = RETERR(-EIO); + } + } else + result = RETERR(-EIO); + } + done_lh(&lh); + return result; +} + +/* exclusive access is obtained. File may be "partially converted" - that is file body may have both formatting and + extent items. Find which conversion did not completed and complete */ +reiser4_internal int +finish_conversion(struct inode *inode) +{ + int result; + + if (inode_get_flag(inode, REISER4_PART_CONV)) { + result = find_first_item(inode); + if (result == EXTENT_POINTER_ID) + /* first item is extent, therefore there was incomplete tail2extent conversion. Complete it */ + result = tail2extent(unix_file_inode_data(inode)); + else if (result == FORMATTING_ID) + /* first item is formatting item, therefore there was incomplete extent2tail + conversion. Complete it */ + result = extent2tail(unix_file_inode_data(inode)); + } else + result = 0; + assert("vs-1712", ergo(result == 0, !inode_get_flag(inode, REISER4_PART_CONV))); + return result; +} + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/hash.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/hash.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,346 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Hash functions */ + +#include "../debug.h" +#include "plugin_header.h" +#include "plugin.h" +#include "../super.h" +#include "../inode.h" +#include "../plugin/dir/dir.h" + +#include + +/* old rupasov (yura) hash */ +static __u64 +hash_rupasov(const unsigned char *name /* name to hash */ , + int len /* @name's length */ ) +{ + int i; + int j; + int pow; + __u64 a; + __u64 c; + + assert("nikita-672", name != NULL); + assert("nikita-673", len >= 0); + + for (pow = 1, i = 1; i < len; ++i) + pow = pow * 10; + + if (len == 1) + a = name[0] - 48; + else + a = (name[0] - 48) * pow; + + for (i = 1; i < len; ++i) { + c = name[i] - 48; + for (pow = 1, j = i; j < len - 1; ++j) + pow = pow * 10; + a = a + c * pow; + } + for (; i < 40; ++i) { + c = '0' - 48; + for (pow = 1, j = i; j < len - 1; ++j) + pow = pow * 10; + a = a + c * pow; + } + + for (; i < 256; ++i) { + c = i; + for (pow = 1, j = i; j < len - 1; ++j) + pow = pow * 10; + a = a + c * pow; + } + + a = a << 7; + return a; +} + +/* r5 hash */ +static __u64 +hash_r5(const unsigned char *name /* name to hash */ , + int len UNUSED_ARG /* @name's length */ ) +{ + __u64 a = 0; + + assert("nikita-674", name != NULL); + assert("nikita-675", len >= 0); + + while (*name) { + a += *name << 4; + a += *name >> 4; + a *= 11; + name++; + } + return a; +} + +/* Keyed 32-bit hash function using TEA in a Davis-Meyer function + H0 = Key + Hi = E Mi(Hi-1) + Hi-1 + + (see Applied Cryptography, 2nd edition, p448). + + Jeremy Fitzhardinge 1998 + + Jeremy has agreed to the contents of reiserfs/README. -Hans + + This code was blindly upgraded to __u64 by s/__u32/__u64/g. +*/ +static __u64 +hash_tea(const unsigned char *name /* name to hash */ , + int len /* @name's length */ ) +{ + __u64 k[] = { 0x9464a485u, 0x542e1a94u, 0x3e846bffu, 0xb75bcfc3u }; + + __u64 h0 = k[0], h1 = k[1]; + __u64 a, b, c, d; + __u64 pad; + int i; + + assert("nikita-676", name != NULL); + assert("nikita-677", len >= 0); + +#define DELTA 0x9E3779B9u +#define FULLROUNDS 10 /* 32 is overkill, 16 is strong crypto */ +#define PARTROUNDS 6 /* 6 gets complete mixing */ + +/* a, b, c, d - data; h0, h1 - accumulated hash */ +#define TEACORE(rounds) \ + do { \ + __u64 sum = 0; \ + int n = rounds; \ + __u64 b0, b1; \ + \ + b0 = h0; \ + b1 = h1; \ + \ + do \ + { \ + sum += DELTA; \ + b0 += ((b1 << 4)+a) ^ (b1+sum) ^ ((b1 >> 5)+b); \ + b1 += ((b0 << 4)+c) ^ (b0+sum) ^ ((b0 >> 5)+d); \ + } while(--n); \ + \ + h0 += b0; \ + h1 += b1; \ + } while(0) + + pad = (__u64) len | ((__u64) len << 8); + pad |= pad << 16; + + while (len >= 16) { + a = (__u64) name[0] | (__u64) name[1] << 8 | (__u64) name[2] << 16 | (__u64) name[3] << 24; + b = (__u64) name[4] | (__u64) name[5] << 8 | (__u64) name[6] << 16 | (__u64) name[7] << 24; + c = (__u64) name[8] | (__u64) name[9] << 8 | (__u64) name[10] << 16 | (__u64) name[11] << 24; + d = (__u64) name[12] | (__u64) name[13] << 8 | (__u64) name[14] << 16 | (__u64) name[15] << 24; + + TEACORE(PARTROUNDS); + + len -= 16; + name += 16; + } + + if (len >= 12) { + //assert(len < 16); + if (len >= 16) + *(int *) 0 = 0; + + a = (__u64) name[0] | (__u64) name[1] << 8 | (__u64) name[2] << 16 | (__u64) name[3] << 24; + b = (__u64) name[4] | (__u64) name[5] << 8 | (__u64) name[6] << 16 | (__u64) name[7] << 24; + c = (__u64) name[8] | (__u64) name[9] << 8 | (__u64) name[10] << 16 | (__u64) name[11] << 24; + + d = pad; + for (i = 12; i < len; i++) { + d <<= 8; + d |= name[i]; + } + } else if (len >= 8) { + //assert(len < 12); + if (len >= 12) + *(int *) 0 = 0; + a = (__u64) name[0] | (__u64) name[1] << 8 | (__u64) name[2] << 16 | (__u64) name[3] << 24; + b = (__u64) name[4] | (__u64) name[5] << 8 | (__u64) name[6] << 16 | (__u64) name[7] << 24; + + c = d = pad; + for (i = 8; i < len; i++) { + c <<= 8; + c |= name[i]; + } + } else if (len >= 4) { + //assert(len < 8); + if (len >= 8) + *(int *) 0 = 0; + a = (__u64) name[0] | (__u64) name[1] << 8 | (__u64) name[2] << 16 | (__u64) name[3] << 24; + + b = c = d = pad; + for (i = 4; i < len; i++) { + b <<= 8; + b |= name[i]; + } + } else { + //assert(len < 4); + if (len >= 4) + *(int *) 0 = 0; + a = b = c = d = pad; + for (i = 0; i < len; i++) { + a <<= 8; + a |= name[i]; + } + } + + TEACORE(FULLROUNDS); + +/* return 0;*/ + return h0 ^ h1; + +} + +/* classical 64 bit Fowler/Noll/Vo-1 (FNV-1) hash. + + See http://www.isthe.com/chongo/tech/comp/fnv/ for details. + + Excerpts: + + FNV hashes are designed to be fast while maintaining a low collision + rate. + + [This version also seems to preserve lexicographical order locally.] + + FNV hash algorithms and source code have been released into the public + domain. + +*/ +static __u64 +hash_fnv1(const unsigned char *name /* name to hash */ , + int len UNUSED_ARG /* @name's length */ ) +{ + unsigned long long a = 0xcbf29ce484222325ull; + const unsigned long long fnv_64_prime = 0x100000001b3ull; + + assert("nikita-678", name != NULL); + assert("nikita-679", len >= 0); + + /* FNV-1 hash each octet in the buffer */ + for (; *name; ++name) { + /* multiply by the 32 bit FNV magic prime mod 2^64 */ + a *= fnv_64_prime; + /* xor the bottom with the current octet */ + a ^= (unsigned long long) (*name); + } + /* return our new hash value */ + return a; +} + +/* degenerate hash function used to simplify testing of non-unique key + handling */ +static __u64 +hash_deg(const unsigned char *name UNUSED_ARG /* name to hash */ , + int len UNUSED_ARG /* @name's length */ ) +{ + return 0xc0c0c0c010101010ull; +} + +static int +change_hash(struct inode * inode, reiser4_plugin * plugin) +{ + int result; + + assert("nikita-3503", inode != NULL); + assert("nikita-3504", plugin != NULL); + + assert("nikita-3505", is_reiser4_inode(inode)); + assert("nikita-3506", inode_dir_plugin(inode) != NULL); + assert("nikita-3507", plugin->h.type_id == REISER4_HASH_PLUGIN_TYPE); + + result = 0; + if (inode_hash_plugin(inode) == NULL || + inode_hash_plugin(inode)->h.id != plugin->h.id) { + if (is_dir_empty(inode) == 0) + result = plugin_set_hash(&reiser4_inode_data(inode)->pset, + &plugin->hash); + else + result = RETERR(-ENOTEMPTY); + + } + return result; +} + +static reiser4_plugin_ops hash_plugin_ops = { + .init = NULL, + .load = NULL, + .save_len = NULL, + .save = NULL, + .change = change_hash +}; + +/* hash plugins */ +hash_plugin hash_plugins[LAST_HASH_ID] = { + [RUPASOV_HASH_ID] = { + .h = { + .type_id = REISER4_HASH_PLUGIN_TYPE, + .id = RUPASOV_HASH_ID, + .pops = &hash_plugin_ops, + .label = "rupasov", + .desc = "Original Yura's hash", + .linkage = TYPE_SAFE_LIST_LINK_ZERO} + , + .hash = hash_rupasov + }, + [R5_HASH_ID] = { + .h = { + .type_id = REISER4_HASH_PLUGIN_TYPE, + .id = R5_HASH_ID, + .pops = &hash_plugin_ops, + .label = "r5", + .desc = "r5 hash", + .linkage = TYPE_SAFE_LIST_LINK_ZERO} + , + .hash = hash_r5 + }, + [TEA_HASH_ID] = { + .h = { + .type_id = REISER4_HASH_PLUGIN_TYPE, + .id = TEA_HASH_ID, + .pops = &hash_plugin_ops, + .label = "tea", + .desc = "tea hash", + .linkage = TYPE_SAFE_LIST_LINK_ZERO} + , + .hash = hash_tea + }, + [FNV1_HASH_ID] = { + .h = { + .type_id = REISER4_HASH_PLUGIN_TYPE, + .id = FNV1_HASH_ID, + .pops = &hash_plugin_ops, + .label = "fnv1", + .desc = "fnv1 hash", + .linkage = TYPE_SAFE_LIST_LINK_ZERO} + , + .hash = hash_fnv1 + }, + [DEGENERATE_HASH_ID] = { + .h = { + .type_id = REISER4_HASH_PLUGIN_TYPE, + .id = DEGENERATE_HASH_ID, + .pops = &hash_plugin_ops, + .label = "degenerate hash", + .desc = "Degenerate hash: only for testing", + .linkage = TYPE_SAFE_LIST_LINK_ZERO} + , + .hash = hash_deg + } +}; + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/acl.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/acl.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,64 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Directory entry. */ + +#if !defined( __FS_REISER4_PLUGIN_DIRECTORY_ENTRY_H__ ) +#define __FS_REISER4_PLUGIN_DIRECTORY_ENTRY_H__ + +#include "../../forward.h" +#include "../../dformat.h" +#include "../../kassign.h" +#include "../../key.h" + +#include +#include /* for struct dentry */ + +typedef struct directory_entry_format { + /* key of object stat-data. It's not necessary to store whole + key here, because it's always key of stat-data, so minor + packing locality and offset can be omitted here. But this + relies on particular key allocation scheme for stat-data, so, + for extensibility sake, whole key can be stored here. + + We store key as array of bytes, because we don't want 8-byte + alignment of dir entries. + */ + obj_key_id id; + /* file name. Null terminated string. */ + d8 name[0]; +} directory_entry_format; + +void print_de(const char *prefix, coord_t * coord); +int extract_key_de(const coord_t * coord, reiser4_key * key); +int update_key_de(const coord_t * coord, const reiser4_key * key, lock_handle * lh); +char *extract_name_de(const coord_t * coord, char *buf); +unsigned extract_file_type_de(const coord_t * coord); +int add_entry_de(struct inode *dir, coord_t * coord, + lock_handle * lh, const struct dentry *name, reiser4_dir_entry_desc * entry); +int rem_entry_de(struct inode *dir, const struct qstr * name, coord_t * coord, lock_handle * lh, reiser4_dir_entry_desc * entry); +int max_name_len_de(const struct inode *dir); + + +int de_rem_and_shrink(struct inode *dir, coord_t * coord, int length); + +char *extract_dent_name(const coord_t * coord, + directory_entry_format *dent, char *buf); + +#if REISER4_LARGE_KEY +#define DE_NAME_BUF_LEN (24) +#else +#define DE_NAME_BUF_LEN (16) +#endif + +/* __FS_REISER4_PLUGIN_DIRECTORY_ENTRY_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/blackbox.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/blackbox.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,142 @@ +/* Copyright 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Black box item implementation */ + +#include "../../forward.h" +#include "../../debug.h" +#include "../../dformat.h" +#include "../../kassign.h" +#include "../../coord.h" +#include "../../tree.h" +#include "../../lock.h" + +#include "blackbox.h" +#include "item.h" +#include "../plugin.h" + + +reiser4_internal int +store_black_box(reiser4_tree *tree, + const reiser4_key *key, void *data, int length) +{ + int result; + reiser4_item_data idata; + coord_t coord; + lock_handle lh; + + memset(&idata, 0, sizeof idata); + + idata.data = data; + idata.user = 0; + idata.length = length; + idata.iplug = item_plugin_by_id(BLACK_BOX_ID); + + init_lh(&lh); + result = insert_by_key(tree, key, + &idata, &coord, &lh, LEAF_LEVEL, CBK_UNIQUE); + + assert("nikita-3413", + ergo(result == 0, + WITH_COORD(&coord, item_length_by_coord(&coord) == length))); + + done_lh(&lh); + return result; +} + +reiser4_internal int +load_black_box(reiser4_tree *tree, + reiser4_key *key, void *data, int length, int exact) +{ + int result; + coord_t coord; + lock_handle lh; + + init_lh(&lh); + result = coord_by_key(tree, key, + &coord, &lh, ZNODE_READ_LOCK, + exact ? FIND_EXACT : FIND_MAX_NOT_MORE_THAN, + LEAF_LEVEL, LEAF_LEVEL, CBK_UNIQUE, NULL); + + if (result == 0) { + int ilen; + + result = zload(coord.node); + if (result == 0) { + ilen = item_length_by_coord(&coord); + if (ilen <= length) { + memcpy(data, item_body_by_coord(&coord), ilen); + unit_key_by_coord(&coord, key); + } else if (exact) { + /* + * item is larger than buffer provided by the + * user. Only issue a warning if @exact is + * set. If @exact is false, we are iterating + * over all safe-links and here we are reaching + * the end of the iteration. + */ + warning("nikita-3415", + "Wrong black box length: %i > %i", + ilen, length); + result = RETERR(-EIO); + } + zrelse(coord.node); + } + } + + done_lh(&lh); + return result; + +} + +reiser4_internal int +update_black_box(reiser4_tree *tree, + const reiser4_key *key, void *data, int length) +{ + int result; + coord_t coord; + lock_handle lh; + + init_lh(&lh); + result = coord_by_key(tree, key, + &coord, &lh, ZNODE_READ_LOCK, + FIND_EXACT, + LEAF_LEVEL, LEAF_LEVEL, CBK_UNIQUE, NULL); + if (result == 0) { + int ilen; + + result = zload(coord.node); + if (result == 0) { + ilen = item_length_by_coord(&coord); + if (length <= ilen) { + memcpy(item_body_by_coord(&coord), data, length); + } else { + warning("nikita-3437", + "Wrong black box length: %i < %i", + ilen, length); + result = RETERR(-EIO); + } + zrelse(coord.node); + } + } + + done_lh(&lh); + return result; + +} + +reiser4_internal int kill_black_box(reiser4_tree *tree, const reiser4_key *key) +{ + return cut_tree(tree, key, key, NULL, 1); +} + + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/blackbox.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/blackbox.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,33 @@ +/* Copyright 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* "Black box" entry to fixed-width contain user supplied data */ + +#if !defined( __FS_REISER4_BLACK_BOX_H__ ) +#define __FS_REISER4_BLACK_BOX_H__ + +#include "../../forward.h" +#include "../../dformat.h" +#include "../../kassign.h" +#include "../../key.h" + +extern int store_black_box(reiser4_tree *tree, + const reiser4_key *key, void *data, int length); +extern int load_black_box(reiser4_tree *tree, + reiser4_key *key, void *data, int length, int exact); +extern int kill_black_box(reiser4_tree *tree, const reiser4_key *key); +extern int update_black_box(reiser4_tree *tree, + const reiser4_key *key, void *data, int length); + +/* __FS_REISER4_BLACK_BOX_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/cde.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/cde.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1070 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Directory entry implementation */ + +/* DESCRIPTION: + + This is "compound" directory item plugin implementation. This directory + item type is compound (as opposed to the "simple directory item" in + fs/reiser4/plugin/item/sde.[ch]), because it consists of several directory + entries. + + The reason behind this decision is disk space efficiency: all directory + entries inside the same directory have identical fragment in their + keys. This, of course, depends on key assignment policy. In our default key + assignment policy, all directory entries have the same locality which is + equal to the object id of their directory. + + Composing directory item out of several directory entries for the same + directory allows us to store said key fragment only once. That is, this is + some ad hoc form of key compression (stem compression) that is implemented + here, because general key compression is not supposed to be implemented in + v4.0. + + Another decision that was made regarding all directory item plugins, is + that they will store entry keys unaligned. This is for that sake of disk + space efficiency again. + + In should be noted, that storing keys unaligned increases CPU consumption, + at least on some architectures. + + Internal on-disk structure of the compound directory item is the following: + + HEADER cde_item_format. Here number of entries is stored. + ENTRY_HEADER_0 cde_unit_header. Here part of entry key and + ENTRY_HEADER_1 offset of entry body are stored. + ENTRY_HEADER_2 (basically two last parts of key) + ... + ENTRY_HEADER_N + ENTRY_BODY_0 directory_entry_format. Here part of stat data key and + ENTRY_BODY_1 NUL-terminated name are stored. + ENTRY_BODY_2 (part of statadta key in the + sence that since all SDs have + zero offset, this offset is not + stored on disk). + ... + ENTRY_BODY_N + + When it comes to the balancing, each directory entry in compound directory + item is unit, that is, something that can be cut from one item and pasted + into another item of the same type. Handling of unit cut and paste is major + reason for the complexity of code below. + +*/ + +#include "../../forward.h" +#include "../../debug.h" +#include "../../dformat.h" +#include "../../kassign.h" +#include "../../key.h" +#include "../../coord.h" +#include "sde.h" +#include "cde.h" +#include "item.h" +#include "../node/node.h" +#include "../plugin.h" +#include "../../znode.h" +#include "../../carry.h" +#include "../../tree.h" +#include "../../inode.h" + +#include /* for struct inode */ +#include /* for struct dentry */ +#include + +#if 0 +#define CHECKME(coord) \ +({ \ + const char *message; \ + coord_t dup; \ + \ + coord_dup_nocheck(&dup, (coord)); \ + dup.unit_pos = 0; \ + assert("nikita-2871", cde_check(&dup, &message) == 0); \ +}) +#else +#define CHECKME(coord) noop +#endif + + +/* return body of compound directory item at @coord */ +static inline cde_item_format * +formatted_at(const coord_t * coord) +{ + assert("nikita-1282", coord != NULL); + return item_body_by_coord(coord); +} + +/* return entry header at @coord */ +static inline cde_unit_header * +header_at(const coord_t * coord /* coord of item */ , + int idx /* index of unit */ ) +{ + assert("nikita-1283", coord != NULL); + return &formatted_at(coord)->entry[idx]; +} + +/* return number of units in compound directory item at @coord */ +static int +units(const coord_t * coord /* coord of item */ ) +{ + return d16tocpu(&formatted_at(coord)->num_of_entries); +} + +/* return offset of the body of @idx-th entry in @coord */ +static unsigned int +offset_of(const coord_t * coord /* coord of item */ , + int idx /* index of unit */ ) +{ + if (idx < units(coord)) + return d16tocpu(&header_at(coord, idx)->offset); + else if (idx == units(coord)) + return item_length_by_coord(coord); + else + impossible("nikita-1308", "Wrong idx"); + return 0; +} + +/* set offset of the body of @idx-th entry in @coord */ +static void +set_offset(const coord_t * coord /* coord of item */ , + int idx /* index of unit */ , + unsigned int offset /* new offset */ ) +{ + cputod16((__u16) offset, &header_at(coord, idx)->offset); +} + +static void +adj_offset(const coord_t * coord /* coord of item */ , + int idx /* index of unit */ , + int delta /* offset change */ ) +{ + d16 *doffset; + __u16 offset; + + doffset = &header_at(coord, idx)->offset; + offset = d16tocpu(doffset); + offset += delta; + cputod16((__u16) offset, doffset); +} + +/* return pointer to @offset-th byte from the beginning of @coord */ +static char * +address(const coord_t * coord /* coord of item */ , + int offset) +{ + return ((char *) item_body_by_coord(coord)) + offset; +} + +/* return pointer to the body of @idx-th entry in @coord */ +static directory_entry_format * +entry_at(const coord_t * coord /* coord of + * item */ , + int idx /* index of unit */ ) +{ + return (directory_entry_format *) address(coord, (int) offset_of(coord, idx)); +} + +/* return number of unit referenced by @coord */ +static int +idx_of(const coord_t * coord /* coord of item */ ) +{ + assert("nikita-1285", coord != NULL); + return coord->unit_pos; +} + +/* find position where entry with @entry_key would be inserted into @coord */ +static int +find(const coord_t * coord /* coord of item */ , + const reiser4_key * entry_key /* key to look for */ , + cmp_t * last /* result of last comparison */ ) +{ + int entries; + + int left; + int right; + + cde_unit_header *header; + + assert("nikita-1295", coord != NULL); + assert("nikita-1296", entry_key != NULL); + assert("nikita-1297", last != NULL); + + entries = units(coord); + left = 0; + right = entries - 1; + while (right - left >= REISER4_SEQ_SEARCH_BREAK) { + int median; + + median = (left + right) >> 1; + + header = header_at(coord, median); + *last = de_id_key_cmp(&header->hash, entry_key); + switch (*last) { + case LESS_THAN: + left = median; + break; + case GREATER_THAN: + right = median; + break; + case EQUAL_TO: { + do { + median --; + header --; + } while (median >= 0 && + de_id_key_cmp(&header->hash, + entry_key) == EQUAL_TO); + return median + 1; + } + } + } + header = header_at(coord, left); + for (; left < entries; ++ left, ++ header) { + prefetch(header + 1); + *last = de_id_key_cmp(&header->hash, entry_key); + if (*last != LESS_THAN) + break; + } + if (left < entries) + return left; + else + return RETERR(-ENOENT); + +} + +/* expand @coord as to accommodate for insertion of @no new entries starting + from @pos, with total bodies size @size. */ +static int +expand_item(const coord_t * coord /* coord of item */ , + int pos /* unit position */ , int no /* number of new + * units*/ , + int size /* total size of new units' data */ , + unsigned int data_size /* free space already reserved + * in the item for insertion */ ) +{ + int entries; + cde_unit_header *header; + char *dent; + int i; + + assert("nikita-1310", coord != NULL); + assert("nikita-1311", pos >= 0); + assert("nikita-1312", no > 0); + assert("nikita-1313", data_size >= no * sizeof (directory_entry_format)); + assert("nikita-1343", item_length_by_coord(coord) >= (int) (size + data_size + no * sizeof *header)); + + entries = units(coord); + + if (pos == entries) + dent = address(coord, size); + else + dent = (char *) entry_at(coord, pos); + /* place where new header will be in */ + header = header_at(coord, pos); + /* free space for new entry headers */ + memmove(header + no, header, (unsigned) (address(coord, size) - (char *) header)); + /* if adding to the end initialise first new header */ + if (pos == entries) { + set_offset(coord, pos, (unsigned) size); + } + + /* adjust entry pointer and size */ + dent = dent + no * sizeof *header; + size += no * sizeof *header; + /* free space for new entries */ + memmove(dent + data_size, dent, (unsigned) (address(coord, size) - dent)); + + /* increase counter */ + entries += no; + cputod16((__u16) entries, &formatted_at(coord)->num_of_entries); + + /* [ 0 ... pos ] entries were shifted by no * ( sizeof *header ) + bytes. */ + for (i = 0; i <= pos; ++i) + adj_offset(coord, i, no * sizeof *header); + /* [ pos + no ... +\infty ) entries were shifted by ( no * + sizeof *header + data_size ) bytes */ + for (i = pos + no; i < entries; ++i) + adj_offset(coord, i, no * sizeof *header + data_size); + return 0; +} + +/* insert new @entry into item */ +static int +expand(const coord_t * coord /* coord of item */ , + cde_entry * entry /* entry to insert */ , + int len /* length of @entry data */ , + int *pos /* position to insert */ , + reiser4_dir_entry_desc * dir_entry /* parameters for new + * entry */ ) +{ + cmp_t cmp_res; + int datasize; + + *pos = find(coord, &dir_entry->key, &cmp_res); + if (*pos < 0) + *pos = units(coord); + + datasize = sizeof (directory_entry_format); + if (is_longname(entry->name->name, entry->name->len)) + datasize += entry->name->len + 1; + + expand_item(coord, *pos, 1, item_length_by_coord(coord) - len, datasize); + return 0; +} + +/* paste body of @entry into item */ +static int +paste_entry(const coord_t * coord /* coord of item */ , + cde_entry * entry /* new entry */ , + int pos /* position to insert */ , + reiser4_dir_entry_desc * dir_entry /* parameters for + * new entry */ ) +{ + cde_unit_header *header; + directory_entry_format *dent; + const char *name; + int len; + + header = header_at(coord, pos); + dent = entry_at(coord, pos); + + build_de_id_by_key(&dir_entry->key, &header->hash); + build_inode_key_id(entry->obj, &dent->id); + /* AUDIT unsafe strcpy() operation! It should be replaced with + much less CPU hungry + memcpy( ( char * ) dent -> name, entry -> name -> name , entry -> name -> len ); + + Also a more major thing is that there should be a way to figure out + amount of space in dent -> name and be able to check that we are + not going to overwrite more than we supposed to */ + name = entry->name->name; + len = entry->name->len; + if (is_longname(name, len)) { + strcpy((unsigned char *) dent->name, name); + cputod8(0, &dent->name[len]); + } + return 0; +} + +/* estimate how much space is necessary in item to insert/paste set of entries + described in @data. */ +reiser4_internal int +estimate_cde(const coord_t * coord /* coord of item */ , + const reiser4_item_data * data /* parameters for new item */ ) +{ + cde_entry_data *e; + int result; + int i; + + e = (cde_entry_data *) data->data; + + assert("nikita-1288", e != NULL); + assert("nikita-1289", e->num_of_entries >= 0); + + if (coord == NULL) + /* insert */ + result = sizeof (cde_item_format); + else + /* paste */ + result = 0; + + result += e->num_of_entries * + (sizeof (cde_unit_header) + sizeof (directory_entry_format)); + for (i = 0; i < e->num_of_entries; ++i) { + const char *name; + int len; + + name = e->entry[i].name->name; + len = e->entry[i].name->len; + assert("nikita-2054", strlen(name) == len); + if (is_longname(name, len)) + result += len + 1; + } + ((reiser4_item_data *) data)->length = result; + return result; +} + +/* ->nr_units() method for this item plugin. */ +reiser4_internal pos_in_node_t +nr_units_cde(const coord_t * coord /* coord of item */ ) +{ + return units(coord); +} + +/* ->unit_key() method for this item plugin. */ +reiser4_internal reiser4_key * +unit_key_cde(const coord_t * coord /* coord of item */ , + reiser4_key * key /* resulting key */ ) +{ + assert("nikita-1452", coord != NULL); + assert("nikita-1345", idx_of(coord) < units(coord)); + assert("nikita-1346", key != NULL); + + item_key_by_coord(coord, key); + extract_key_from_de_id(extract_dir_id_from_key(key), &header_at(coord, idx_of(coord))->hash, key); + return key; +} + +/* mergeable_cde(): implementation of ->mergeable() item method. + + Two directory items are mergeable iff they are from the same + directory. That simple. + +*/ +reiser4_internal int +mergeable_cde(const coord_t * p1 /* coord of first item */ , + const coord_t * p2 /* coord of second item */ ) +{ + reiser4_key k1; + reiser4_key k2; + + assert("nikita-1339", p1 != NULL); + assert("nikita-1340", p2 != NULL); + + return + (item_plugin_by_coord(p1) == item_plugin_by_coord(p2)) && + (extract_dir_id_from_key(item_key_by_coord(p1, &k1)) == + extract_dir_id_from_key(item_key_by_coord(p2, &k2))); + +} + +/* ->max_key_inside() method for this item plugin. */ +reiser4_internal reiser4_key * +max_key_inside_cde(const coord_t * coord /* coord of item */ , + reiser4_key * result /* resulting key */) +{ + assert("nikita-1342", coord != NULL); + + item_key_by_coord(coord, result); + set_key_ordering(result, get_key_ordering(max_key())); + set_key_fulloid(result, get_key_fulloid(max_key())); + set_key_offset(result, get_key_offset(max_key())); + return result; +} + +/* @data contains data which are to be put into tree */ +reiser4_internal int +can_contain_key_cde(const coord_t * coord /* coord of item */ , + const reiser4_key * key /* key to check */ , + const reiser4_item_data * data /* parameters of new + * item/unit being + * created */ ) +{ + reiser4_key item_key; + + /* FIXME-VS: do not rely on anything but iplug field of @data. Only + data->iplug is initialized */ + assert("vs-457", data && data->iplug); +/* assert( "vs-553", data -> user == 0 );*/ + item_key_by_coord(coord, &item_key); + + return (item_plugin_by_coord(coord) == data->iplug) && + (extract_dir_id_from_key(&item_key) == extract_dir_id_from_key(key)); +} + +#if REISER4_DEBUG_OUTPUT +/* ->print() method for this item plugin. */ +reiser4_internal void +print_cde(const char *prefix /* prefix to print */ , + coord_t * coord /* coord of item to print */ ) +{ + assert("nikita-1077", prefix != NULL); + assert("nikita-1078", coord != NULL); + + if (item_length_by_coord(coord) < (int) sizeof (cde_item_format)) { + printk("%s: wrong size: %i < %i\n", prefix, item_length_by_coord(coord), sizeof (cde_item_format)); + } else { + char *name; + char *end; + char *start; + int i; + oid_t dirid; + reiser4_key key; + + start = address(coord, 0); + end = address(coord, item_length_by_coord(coord)); + item_key_by_coord(coord, &key); + dirid = extract_dir_id_from_key(&key); + + printk("%s: units: %i\n", prefix, nr_units_cde(coord)); + for (i = 0; i < units(coord); ++i) { + cde_unit_header *header; + + header = header_at(coord, i); + indent_znode(coord->node); + printk("\theader %i: ", i); + if ((char *) (header + 1) > end) { + printk("out of bounds: %p [%p, %p]\n", header, start, end); + } else { + extract_key_from_de_id(dirid, &header->hash, &key); + printk("%i: at %i, offset: %i, ", i, i * sizeof (*header), d16tocpu(&header->offset)); + print_key("key", &key); + } + } + for (i = 0; i < units(coord); ++i) { + directory_entry_format *entry; + char buf[DE_NAME_BUF_LEN]; + + entry = entry_at(coord, i); + indent_znode(coord->node); + printk("\tentry: %i: ", i); + if (((char *) (entry + 1) > end) || ((char *) entry < start)) { + printk("out of bounds: %p [%p, %p]\n", entry, start, end); + } else { + coord->unit_pos = i; + extract_key_cde(coord, &key); + name = extract_name_cde(coord, buf); + printk("at %i, name: %s, ", (char *) entry - start, name); + print_key("sdkey", &key); + } + } + } +} +#endif + +#if REISER4_DEBUG +/* cde_check ->check() method for compressed directory items + + used for debugging, every item should have here the most complete + possible check of the consistency of the item that the inventor can + construct +*/ +reiser4_internal int +check_cde(const coord_t * coord /* coord of item to check */ , + const char **error /* where to store error message */ ) +{ + int i; + int result; + char *item_start; + char *item_end; + reiser4_key key; + + coord_t c; + + assert("nikita-1357", coord != NULL); + assert("nikita-1358", error != NULL); + + if (!ergo(coord->item_pos != 0, + is_dot_key(item_key_by_coord(coord, &key)))) { + *error = "CDE doesn't start with dot"; + return -1; + } + item_start = item_body_by_coord(coord); + item_end = item_start + item_length_by_coord(coord); + + coord_dup(&c, coord); + result = 0; + for (i = 0; i < units(coord); ++i) { + directory_entry_format *entry; + + if ((char *) (header_at(coord, i) + 1) > item_end - units(coord) * sizeof *entry) { + *error = "CDE header is out of bounds"; + result = -1; + break; + } + entry = entry_at(coord, i); + if ((char *) entry < item_start + sizeof (cde_item_format)) { + *error = "CDE header is too low"; + result = -1; + break; + } + if ((char *) (entry + 1) > item_end) { + *error = "CDE header is too high"; + result = -1; + break; + } + } + + return result; +} +#endif + +/* ->init() method for this item plugin. */ +reiser4_internal int +init_cde(coord_t * coord /* coord of item */ , + coord_t * from UNUSED_ARG, + reiser4_item_data * data /* structure used for insertion */ + UNUSED_ARG) +{ + cputod16(0u, &formatted_at(coord)->num_of_entries); + return 0; +} + +/* ->lookup() method for this item plugin. */ +reiser4_internal lookup_result +lookup_cde(const reiser4_key * key /* key to search for */ , + lookup_bias bias /* search bias */ , + coord_t * coord /* coord of item to lookup in */ ) +{ + cmp_t last_comp; + int pos; + + reiser4_key utmost_key; + + assert("nikita-1293", coord != NULL); + assert("nikita-1294", key != NULL); + + CHECKME(coord); + + if (keygt(item_key_by_coord(coord, &utmost_key), key)) { + coord->unit_pos = 0; + coord->between = BEFORE_UNIT; + return CBK_COORD_NOTFOUND; + } + pos = find(coord, key, &last_comp); + if (pos >= 0) { + coord->unit_pos = (int) pos; + switch (last_comp) { + case EQUAL_TO: + coord->between = AT_UNIT; + return CBK_COORD_FOUND; + case GREATER_THAN: + coord->between = BEFORE_UNIT; + return RETERR(-ENOENT); + case LESS_THAN: + default: + impossible("nikita-1298", "Broken find"); + return RETERR(-EIO); + } + } else { + coord->unit_pos = units(coord) - 1; + coord->between = AFTER_UNIT; + return (bias == FIND_MAX_NOT_MORE_THAN) ? CBK_COORD_FOUND : CBK_COORD_NOTFOUND; + } +} + +/* ->paste() method for this item plugin. */ +reiser4_internal int +paste_cde(coord_t * coord /* coord of item */ , + reiser4_item_data * data /* parameters of new unit being + * inserted */ , + carry_plugin_info * info UNUSED_ARG /* todo carry queue */ ) +{ + cde_entry_data *e; + int result; + int i; + + CHECKME(coord); + e = (cde_entry_data *) data->data; + + result = 0; + for (i = 0; i < e->num_of_entries; ++i) { + int pos; + int phantom_size; + + phantom_size = data->length; + if (units(coord) == 0) + phantom_size -= sizeof (cde_item_format); + + result = expand(coord, e->entry + i, phantom_size, &pos, data->arg); + if (result != 0) + break; + result = paste_entry(coord, e->entry + i, pos, data->arg); + if (result != 0) + break; + } + CHECKME(coord); + return result; +} + +/* amount of space occupied by all entries starting from @idx both headers and + bodies. */ +static unsigned int +part_size(const coord_t * coord /* coord of item */ , + int idx /* index of unit */ ) +{ + assert("nikita-1299", coord != NULL); + assert("nikita-1300", idx < (int) units(coord)); + + return sizeof (cde_item_format) + + (idx + 1) * sizeof (cde_unit_header) + offset_of(coord, idx + 1) - offset_of(coord, 0); +} + +/* how many but not more than @want units of @source can be merged with + item in @target node. If pend == append - we try to append last item + of @target by first units of @source. If pend == prepend - we try to + "prepend" first item in @target by last units of @source. @target + node has @free_space bytes of free space. Total size of those units + are returned via @size */ +reiser4_internal int +can_shift_cde(unsigned free_space /* free space in item */ , + coord_t * coord /* coord of source item */ , + znode * target /* target node */ , + shift_direction pend /* shift direction */ , + unsigned *size /* resulting number of shifted bytes */ , + unsigned want /* maximal number of bytes to shift */ ) +{ + int shift; + + CHECKME(coord); + if (want == 0) { + *size = 0; + return 0; + } + + /* pend == SHIFT_LEFT <==> shifting to the left */ + if (pend == SHIFT_LEFT) { + for (shift = min((int) want - 1, units(coord)); shift >= 0; --shift) { + *size = part_size(coord, shift); + if (target != NULL) + *size -= sizeof (cde_item_format); + if (*size <= free_space) + break; + } + shift = shift + 1; + } else { + int total_size; + + assert("nikita-1301", pend == SHIFT_RIGHT); + + total_size = item_length_by_coord(coord); + for (shift = units(coord) - want - 1; shift < units(coord) - 1; ++shift) { + *size = total_size - part_size(coord, shift); + if (target == NULL) + *size += sizeof (cde_item_format); + if (*size <= free_space) + break; + } + shift = units(coord) - shift - 1; + } + if (shift == 0) + *size = 0; + CHECKME(coord); + return shift; +} + +/* ->copy_units() method for this item plugin. */ +reiser4_internal void +copy_units_cde(coord_t * target /* coord of target item */ , + coord_t * source /* coord of source item */ , + unsigned from /* starting unit */ , + unsigned count /* how many units to copy */ , + shift_direction where_is_free_space /* shift direction */ , + unsigned free_space /* free space in item */ ) +{ + char *header_from; + char *header_to; + + char *entry_from; + char *entry_to; + + int pos_in_target; + int data_size; + int data_delta; + int i; + + assert("nikita-1303", target != NULL); + assert("nikita-1304", source != NULL); + assert("nikita-1305", (int) from < units(source)); + assert("nikita-1307", (int) (from + count) <= units(source)); + + if (where_is_free_space == SHIFT_LEFT) { + assert("nikita-1453", from == 0); + pos_in_target = units(target); + } else { + assert("nikita-1309", (int) (from + count) == units(source)); + pos_in_target = 0; + memmove(item_body_by_coord(target), + (char *) item_body_by_coord(target) + free_space, item_length_by_coord(target) - free_space); + } + + CHECKME(target); + CHECKME(source); + + /* expand @target */ + data_size = offset_of(source, (int) (from + count)) - offset_of(source, (int) from); + + if (units(target) == 0) + free_space -= sizeof (cde_item_format); + + expand_item(target, pos_in_target, (int) count, + (int) (item_length_by_coord(target) - free_space), (unsigned) data_size); + + /* copy first @count units of @source into @target */ + data_delta = offset_of(target, pos_in_target) - offset_of(source, (int) from); + + /* copy entries */ + entry_from = (char *) entry_at(source, (int) from); + entry_to = (char *) entry_at(source, (int) (from + count)); + memmove(entry_at(target, pos_in_target), entry_from, (unsigned) (entry_to - entry_from)); + + /* copy headers */ + header_from = (char *) header_at(source, (int) from); + header_to = (char *) header_at(source, (int) (from + count)); + memmove(header_at(target, pos_in_target), header_from, (unsigned) (header_to - header_from)); + + /* update offsets */ + for (i = pos_in_target; i < (int) (pos_in_target + count); ++i) + adj_offset(target, i, data_delta); + CHECKME(target); + CHECKME(source); +} + +/* ->cut_units() method for this item plugin. */ +reiser4_internal int +cut_units_cde(coord_t * coord /* coord of item */ , + pos_in_node_t from /* start unit pos */ , + pos_in_node_t to /* stop unit pos */ , + struct carry_cut_data *cdata UNUSED_ARG, reiser4_key *smallest_removed, + reiser4_key *new_first) +{ + char *header_from; + char *header_to; + + char *entry_from; + char *entry_to; + + int size; + int entry_delta; + int header_delta; + int i; + + unsigned count; + + CHECKME(coord); + + count = to - from + 1; + + assert("nikita-1454", coord != NULL); + assert("nikita-1455", (int) (from + count) <= units(coord)); + + if (smallest_removed) + unit_key_by_coord(coord, smallest_removed); + + if (new_first) { + coord_t next; + + /* not everything is cut from item head */ + assert("vs-1527", from == 0); + assert("vs-1528", to < units(coord) - 1); + + coord_dup(&next, coord); + next.unit_pos ++; + unit_key_by_coord(&next, new_first); + } + + size = item_length_by_coord(coord); + if (count == (unsigned) units(coord)) { + return size; + } + + header_from = (char *) header_at(coord, (int) from); + header_to = (char *) header_at(coord, (int) (from + count)); + + entry_from = (char *) entry_at(coord, (int) from); + entry_to = (char *) entry_at(coord, (int) (from + count)); + + /* move headers */ + memmove(header_from, header_to, (unsigned) (address(coord, size) - header_to)); + + header_delta = header_to - header_from; + + entry_from -= header_delta; + entry_to -= header_delta; + size -= header_delta; + + /* copy entries */ + memmove(entry_from, entry_to, (unsigned) (address(coord, size) - entry_to)); + + entry_delta = entry_to - entry_from; + size -= entry_delta; + + /* update offsets */ + + for (i = 0; i < (int) from; ++i) + adj_offset(coord, i, - header_delta); + + for (i = from; i < units(coord) - (int) count; ++i) + adj_offset(coord, i, - header_delta - entry_delta); + + cputod16((__u16) units(coord) - count, &formatted_at(coord)->num_of_entries); + + if (from == 0) { + /* entries from head was removed - move remaining to right */ + memmove((char *) item_body_by_coord(coord) + + header_delta + entry_delta, item_body_by_coord(coord), (unsigned) size); + if (REISER4_DEBUG) + memset(item_body_by_coord(coord), 0, (unsigned) header_delta + entry_delta); + } else { + /* freed space is already at the end of item */ + if (REISER4_DEBUG) + memset((char *) item_body_by_coord(coord) + size, 0, (unsigned) header_delta + entry_delta); + } + + return header_delta + entry_delta; +} + +reiser4_internal int +kill_units_cde(coord_t * coord /* coord of item */ , + pos_in_node_t from /* start unit pos */ , + pos_in_node_t to /* stop unit pos */ , + struct carry_kill_data *kdata UNUSED_ARG, reiser4_key *smallest_removed, + reiser4_key *new_first) +{ + return cut_units_cde(coord, from, to, 0, smallest_removed, new_first); +} + +/* ->s.dir.extract_key() method for this item plugin. */ +reiser4_internal int +extract_key_cde(const coord_t * coord /* coord of item */ , + reiser4_key * key /* resulting key */ ) +{ + directory_entry_format *dent; + + assert("nikita-1155", coord != NULL); + assert("nikita-1156", key != NULL); + + dent = entry_at(coord, idx_of(coord)); + return extract_key_from_id(&dent->id, key); +} + +reiser4_internal int +update_key_cde(const coord_t * coord, const reiser4_key * key, lock_handle * lh UNUSED_ARG) +{ + directory_entry_format *dent; + obj_key_id obj_id; + int result; + + assert("nikita-2344", coord != NULL); + assert("nikita-2345", key != NULL); + + dent = entry_at(coord, idx_of(coord)); + result = build_obj_key_id(key, &obj_id); + if (result == 0) { + dent->id = obj_id; + znode_make_dirty(coord->node); + } + return 0; +} + +/* ->s.dir.extract_name() method for this item plugin. */ +reiser4_internal char * +extract_name_cde(const coord_t * coord /* coord of item */, char *buf) +{ + directory_entry_format *dent; + + assert("nikita-1157", coord != NULL); + + dent = entry_at(coord, idx_of(coord)); + return extract_dent_name(coord, dent, buf); +} + +static int +cde_bytes(int pasting, const reiser4_item_data * data) +{ + int result; + + result = data->length; + if (!pasting) + result -= sizeof (cde_item_format); + return result; +} + +/* ->s.dir.add_entry() method for this item plugin */ +reiser4_internal int +add_entry_cde(struct inode *dir /* directory object */ , + coord_t * coord /* coord of item */ , + lock_handle * lh /* lock handle for insertion */ , + const struct dentry *name /* name to insert */ , + reiser4_dir_entry_desc * dir_entry /* parameters of new + * directory entry */ ) +{ + reiser4_item_data data; + cde_entry entry; + cde_entry_data edata; + int result; + + assert("nikita-1656", coord->node == lh->node); + assert("nikita-1657", znode_is_write_locked(coord->node)); + + edata.num_of_entries = 1; + edata.entry = &entry; + + entry.dir = dir; + entry.obj = dir_entry->obj; + entry.name = &name->d_name; + + data.data = (char *) &edata; + data.user = 0; /* &edata is not user space */ + data.iplug = item_plugin_by_id(COMPOUND_DIR_ID); + data.arg = dir_entry; + assert("nikita-1302", data.iplug != NULL); + + result = is_dot_key(&dir_entry->key); + data.length = estimate_cde(result ? coord : NULL, &data); + + /* NOTE-NIKITA quota plugin? */ + if (DQUOT_ALLOC_SPACE_NODIRTY(dir, cde_bytes(result, &data))) + return RETERR(-EDQUOT); + + if (result) + result = insert_by_coord(coord, &data, &dir_entry->key, lh, 0); + else + result = resize_item(coord, &data, &dir_entry->key, lh, 0); + return result; +} + +/* ->s.dir.rem_entry() */ +reiser4_internal int +rem_entry_cde(struct inode *dir /* directory of item */ , + const struct qstr * name, + coord_t * coord /* coord of item */ , + lock_handle * lh UNUSED_ARG /* lock handle for + * removal */ , + reiser4_dir_entry_desc * entry UNUSED_ARG /* parameters of + * directory entry + * being removed */ ) +{ + coord_t shadow; + int result; + int length; + ON_DEBUG(char buf[DE_NAME_BUF_LEN]); + + assert("nikita-2870", strlen(name->name) == name->len); + assert("nikita-2869", !strcmp(name->name, extract_name_cde(coord, buf))); + + length = sizeof (directory_entry_format) + sizeof (cde_unit_header); + if (is_longname(name->name, name->len)) + length += name->len + 1; + + if (inode_get_bytes(dir) < length) { + warning("nikita-2628", "Dir is broke: %llu: %llu", + (unsigned long long)get_inode_oid(dir), + inode_get_bytes(dir)); + + return RETERR(-EIO); + } + + /* cut_node() is supposed to take pointers to _different_ + coords, because it will modify them without respect to + possible aliasing. To work around this, create temporary copy + of @coord. + */ + coord_dup(&shadow, coord); + result = kill_node_content(coord, &shadow, NULL, NULL, NULL, NULL, NULL, 0); + if (result == 0) { + /* NOTE-NIKITA quota plugin? */ + DQUOT_FREE_SPACE_NODIRTY(dir, length); + } + return result; +} + +/* ->s.dir.max_name_len() method for this item plugin */ +reiser4_internal int +max_name_len_cde(const struct inode *dir /* directory */ ) +{ + return + tree_by_inode(dir)->nplug->max_item_size() - + sizeof (directory_entry_format) - sizeof (cde_item_format) - sizeof (cde_unit_header) - 2; +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/cde.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/cde.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,78 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Compound directory item. See cde.c for description. */ + +#if !defined( __FS_REISER4_PLUGIN_COMPRESSED_DE_H__ ) +#define __FS_REISER4_PLUGIN_COMPRESSED_DE_H__ + +#include "../../forward.h" +#include "../../kassign.h" +#include "../../dformat.h" + +#include /* for struct inode */ +#include /* for struct dentry, etc */ + +typedef struct cde_unit_header { + de_id hash; + d16 offset; +} cde_unit_header; + +typedef struct cde_item_format { + d16 num_of_entries; + cde_unit_header entry[0]; +} cde_item_format; + +typedef struct cde_entry { + const struct inode *dir; + const struct inode *obj; + const struct qstr *name; +} cde_entry; + +typedef struct cde_entry_data { + int num_of_entries; + cde_entry *entry; +} cde_entry_data; + +/* plugin->item.b.* */ +reiser4_key *max_key_inside_cde(const coord_t * coord, reiser4_key * result); +int can_contain_key_cde(const coord_t * coord, const reiser4_key * key, const reiser4_item_data *); +int mergeable_cde(const coord_t * p1, const coord_t * p2); +pos_in_node_t nr_units_cde(const coord_t * coord); +reiser4_key *unit_key_cde(const coord_t * coord, reiser4_key * key); +int estimate_cde(const coord_t * coord, const reiser4_item_data * data); +void print_cde(const char *prefix, coord_t * coord); +int init_cde(coord_t * coord, coord_t * from, reiser4_item_data * data); +lookup_result lookup_cde(const reiser4_key * key, lookup_bias bias, coord_t * coord); +int paste_cde(coord_t * coord, reiser4_item_data * data, carry_plugin_info * info UNUSED_ARG); +int can_shift_cde(unsigned free_space, coord_t * coord, + znode * target, shift_direction pend, unsigned *size, unsigned want); +void copy_units_cde(coord_t * target, coord_t * source, + unsigned from, unsigned count, shift_direction where_is_free_space, unsigned free_space); +int cut_units_cde(coord_t * coord, pos_in_node_t from, pos_in_node_t to, + struct carry_cut_data *, reiser4_key * smallest_removed, reiser4_key *new_first); +int kill_units_cde(coord_t * coord, pos_in_node_t from, pos_in_node_t to, + struct carry_kill_data *, reiser4_key * smallest_removed, reiser4_key *new_first); +void print_cde(const char *prefix, coord_t * coord); +int check_cde(const coord_t * coord, const char **error); + +/* plugin->u.item.s.dir.* */ +int extract_key_cde(const coord_t * coord, reiser4_key * key); +int update_key_cde(const coord_t * coord, const reiser4_key * key, lock_handle * lh); +char *extract_name_cde(const coord_t * coord, char *buf); +int add_entry_cde(struct inode *dir, coord_t * coord, + lock_handle * lh, const struct dentry *name, reiser4_dir_entry_desc * entry); +int rem_entry_cde(struct inode *dir, const struct qstr * name, coord_t * coord, lock_handle * lh, reiser4_dir_entry_desc * entry); +int max_name_len_cde(const struct inode *dir); + +/* __FS_REISER4_PLUGIN_COMPRESSED_DE_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/ctail.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/ctail.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1627 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* ctails (aka "clustered tails") are items for cryptcompress objects */ + +/* DESCRIPTION: + +Each cryptcompress object is stored on disk as a set of clusters sliced +into ctails. + +Internal on-disk structure: + + HEADER (1) Here stored disk cluster shift + BODY +*/ + +#include "../../forward.h" +#include "../../debug.h" +#include "../../dformat.h" +#include "../../kassign.h" +#include "../../key.h" +#include "../../coord.h" +#include "item.h" +#include "../node/node.h" +#include "../plugin.h" +#include "../object.h" +#include "../../znode.h" +#include "../../carry.h" +#include "../../tree.h" +#include "../../inode.h" +#include "../../super.h" +#include "../../context.h" +#include "../../page_cache.h" +#include "../../cluster.h" +#include "../../flush.h" +#include "../../tree_walk.h" +#include "../file/funcs.h" + +#include +#include +#include + +/* return body of ctail item at @coord */ +static ctail_item_format * +ctail_formatted_at(const coord_t * coord) +{ + assert("edward-60", coord != NULL); + return item_body_by_coord(coord); +} + +reiser4_internal int +cluster_shift_by_coord(const coord_t * coord) +{ + return d8tocpu(&ctail_formatted_at(coord)->cluster_shift); +} + +static unsigned long +pg_by_coord(const coord_t * coord) +{ + reiser4_key key; + + return get_key_offset(item_key_by_coord(coord, &key)) >> PAGE_CACHE_SHIFT; +} + +static int +coord_is_unprepped_ctail(const coord_t * coord) +{ + assert("edward-1233", coord != NULL); + assert("edward-1234", item_id_by_coord(coord) == CTAIL_ID); + assert("edward-1235", + ergo((int)cluster_shift_by_coord(coord) == (int)UCTAIL_SHIFT, + nr_units_ctail(coord) == (pos_in_node_t)UCTAIL_NR_UNITS)); + + return (int)cluster_shift_by_coord(coord) == (int)UCTAIL_SHIFT; +} + +reiser4_internal unsigned long +clust_by_coord(const coord_t * coord, struct inode * inode) +{ + int shift; + + if (inode != NULL) { + shift = inode_cluster_shift(inode); + assert("edward-1236", + ergo(!coord_is_unprepped_ctail(coord), + shift == cluster_shift_by_coord(coord))); + } + else { + assert("edward-1237", !coord_is_unprepped_ctail(coord)); + shift = cluster_shift_by_coord(coord); + } + return pg_by_coord(coord) >> shift; +} + +static int unsigned long +disk_cluster_size (const coord_t * coord) +{ + assert("edward-1156", + item_plugin_by_coord(coord) == item_plugin_by_id(CTAIL_ID)); + /* calculation of disk cluster size + is meaninless if ctail is unprepped */ + assert("edward-1238", !coord_is_unprepped_ctail(coord)); + + return PAGE_CACHE_SIZE << cluster_shift_by_coord(coord); +} + +/* true if the key is of first disk cluster item */ +static int +is_disk_cluster_key(const reiser4_key * key, const coord_t * coord) +{ + assert("edward-1239", item_id_by_coord(coord) == CTAIL_ID); + + return coord_is_unprepped_ctail(coord) || + ((get_key_offset(key) & ((loff_t)disk_cluster_size(coord) - 1)) == 0); +} + +static char * +first_unit(coord_t * coord) +{ + /* FIXME: warning: pointer of type `void *' used in arithmetic */ + return (char *)item_body_by_coord(coord) + sizeof (ctail_item_format); +} + +/* plugin->u.item.b.max_key_inside : + tail_max_key_inside */ + +/* plugin->u.item.b.can_contain_key */ +reiser4_internal int +can_contain_key_ctail(const coord_t *coord, const reiser4_key *key, const reiser4_item_data *data) +{ + reiser4_key item_key; + + if (item_plugin_by_coord(coord) != data->iplug) + return 0; + + item_key_by_coord(coord, &item_key); + if (get_key_locality(key) != get_key_locality(&item_key) || + get_key_objectid(key) != get_key_objectid(&item_key)) + return 0; + if (get_key_offset(&item_key) + nr_units_ctail(coord) != get_key_offset(key)) + return 0; + if (is_disk_cluster_key(key, coord)) + return 0; + return 1; +} + +/* plugin->u.item.b.mergeable + c-tails of different clusters are not mergeable */ +reiser4_internal int +mergeable_ctail(const coord_t * p1, const coord_t * p2) +{ + reiser4_key key1, key2; + + assert("edward-62", item_id_by_coord(p1) == CTAIL_ID); + assert("edward-61", item_type_by_coord(p1) == UNIX_FILE_METADATA_ITEM_TYPE); + + if (item_id_by_coord(p2) != CTAIL_ID) { + /* second item is of another type */ + return 0; + } + + item_key_by_coord(p1, &key1); + item_key_by_coord(p2, &key2); + if (get_key_locality(&key1) != get_key_locality(&key2) || + get_key_objectid(&key1) != get_key_objectid(&key2) || + get_key_type(&key1) != get_key_type(&key2)) { + /* items of different objects */ + return 0; + } + if (get_key_offset(&key1) + nr_units_ctail(p1) != get_key_offset(&key2)) + /* not adjacent items */ + return 0; + if (is_disk_cluster_key(&key2, p2)) + return 0; + return 1; +} + +/* plugin->u.item.b.nr_units */ +reiser4_internal pos_in_node_t +nr_units_ctail(const coord_t * coord) +{ + return (item_length_by_coord(coord) - sizeof(ctail_formatted_at(coord)->cluster_shift)); +} + +/* plugin->u.item.b.estimate: + estimate how much space is needed to insert/paste @data->length bytes + into ctail at @coord */ +reiser4_internal int +estimate_ctail(const coord_t * coord /* coord of item */, + const reiser4_item_data * data /* parameters for new item */) +{ + if (coord == NULL) + /* insert */ + return (sizeof(ctail_item_format) + data->length); + else + /* paste */ + return data->length; +} + +#if REISER4_DEBUG_OUTPUT +/* ->print() method for this item plugin. */ +reiser4_internal void +print_ctail(const char *prefix /* prefix to print */ , + coord_t * coord /* coord of item to print */ ) +{ + assert("edward-63", prefix != NULL); + assert("edward-64", coord != NULL); + + if (item_length_by_coord(coord) < (int) sizeof (ctail_item_format)) + printk("%s: wrong size: %i < %i\n", prefix, item_length_by_coord(coord), sizeof (ctail_item_format)); + else + printk("%s: disk cluster shift: %d\n", prefix, cluster_shift_by_coord(coord)); +} +#endif + +/* ->init() method for this item plugin. */ +reiser4_internal int +init_ctail(coord_t * to /* coord of item */, + coord_t * from /* old_item */, + reiser4_item_data * data /* structure used for insertion */) +{ + int cluster_shift; /* cpu value to convert */ + + if (data) { + assert("edward-463", data->length > sizeof(ctail_item_format)); + cluster_shift = *((int *)(data->arg)); + data->length -= sizeof(ctail_item_format); + } + else { + assert("edward-464", from != NULL); + assert("edward-855", ctail_ok(from)); + cluster_shift = (int)(cluster_shift_by_coord(from)); + } + cputod8(cluster_shift, &ctail_formatted_at(to)->cluster_shift); + assert("edward-856", ctail_ok(to)); + return 0; +} + +reiser4_internal int +ctail_ok (const coord_t *coord) +{ + return coord_is_unprepped_ctail(coord) || + (cluster_shift_by_coord(coord) <= MAX_CLUSTER_SHIFT); +} + +/* plugin->u.item.b.lookup: + NULL: We are looking for item keys only */ +reiser4_internal int +check_ctail (const coord_t * coord, const char **error) +{ + if (!ctail_ok(coord)) { + if (error) + *error = "bad cluster shift in ctail"; + return 1; + } + return 0; +} + +/* plugin->u.item.b.check */ + + +/* plugin->u.item.b.paste */ +reiser4_internal int +paste_ctail(coord_t * coord, reiser4_item_data * data, carry_plugin_info * info UNUSED_ARG) +{ + unsigned old_nr_units; + + assert("edward-268", data->data != NULL); + /* copy only from kernel space */ + assert("edward-66", data->user == 0); + + old_nr_units = item_length_by_coord(coord) - sizeof(ctail_item_format) - data->length; + + /* ctail items never get pasted in the middle */ + + if (coord->unit_pos == 0 && coord->between == AT_UNIT) { + + /* paste at the beginning when create new item */ + assert("edward-450", item_length_by_coord(coord) == data->length + sizeof(ctail_item_format)); + assert("edward-451", old_nr_units == 0); + } + else if (coord->unit_pos == old_nr_units - 1 && coord->between == AFTER_UNIT) { + + /* paste at the end */ + coord->unit_pos++; + } + else + impossible("edward-453", "bad paste position"); + + memcpy(first_unit(coord) + coord->unit_pos, data->data, data->length); + + assert("edward-857", ctail_ok(coord)); + + return 0; +} + +/* plugin->u.item.b.fast_paste */ + +/* plugin->u.item.b.can_shift + number of units is returned via return value, number of bytes via @size. For + ctail items they coincide */ +reiser4_internal int +can_shift_ctail(unsigned free_space, coord_t * source, + znode * target, shift_direction direction UNUSED_ARG, + unsigned *size /* number of bytes */ , unsigned want) +{ + /* make sure that that we do not want to shift more than we have */ + assert("edward-68", want > 0 && want <= nr_units_ctail(source)); + + *size = min(want, free_space); + + if (!target) { + /* new item will be created */ + if (*size <= sizeof(ctail_item_format)) { + *size = 0; + return 0; + } + return *size - sizeof(ctail_item_format); + } + return *size; +} + +/* plugin->u.item.b.copy_units + cooperates with ->can_shift() */ +reiser4_internal void +copy_units_ctail(coord_t * target, coord_t * source, + unsigned from, unsigned count /* units */, + shift_direction where_is_free_space, + unsigned free_space /* bytes */) +{ + /* make sure that item @target is expanded already */ + assert("edward-69", (unsigned) item_length_by_coord(target) >= count); + assert("edward-70", free_space == count || free_space == count + 1); + + assert("edward-858", ctail_ok(source)); + + if (where_is_free_space == SHIFT_LEFT) { + /* append item @target with @count first bytes of @source: + this restriction came from ordinary tails */ + assert("edward-71", from == 0); + assert("edward-860", ctail_ok(target)); + + memcpy(first_unit(target) + nr_units_ctail(target) - count, first_unit(source), count); + } else { + /* target item is moved to right already */ + reiser4_key key; + + assert("edward-72", nr_units_ctail(source) == from + count); + + if (free_space == count) { + init_ctail(target, source, NULL); + //assert("edward-861", cluster_shift_by_coord(target) == d8tocpu(&ctail_formatted_at(target)->body[count])); + } + else { + /* new item has been created */ + assert("edward-862", ctail_ok(target)); + } + memcpy(first_unit(target), first_unit(source) + from, count); + + assert("edward-863", ctail_ok(target)); + + /* new units are inserted before first unit in an item, + therefore, we have to update item key */ + item_key_by_coord(source, &key); + set_key_offset(&key, get_key_offset(&key) + from); + + node_plugin_by_node(target->node)->update_item_key(target, &key, 0 /*info */); + } +} + +/* plugin->u.item.b.create_hook */ +reiser4_internal int +create_hook_ctail (const coord_t * coord, void * arg) +{ + assert("edward-864", znode_is_loaded(coord->node)); + + znode_set_convertible(coord->node); + return 0; +} + +/* plugin->u.item.b.kill_hook */ +reiser4_internal int +kill_hook_ctail(const coord_t *coord, pos_in_node_t from, pos_in_node_t count, carry_kill_data *kdata) +{ + struct inode *inode; + + assert("edward-1157", item_id_by_coord(coord) == CTAIL_ID); + assert("edward-291", znode_is_write_locked(coord->node)); + + inode = kdata->inode; + if (inode) { + reiser4_key key; + item_key_by_coord(coord, &key); + + if (from == 0 && is_disk_cluster_key(&key, coord)) { + cloff_t start = off_to_clust(get_key_offset(&key), inode); + truncate_page_cluster(inode, start); + } + } + return 0; +} + +/* for shift_hook_ctail(), + return true if the first disk cluster item has dirty child +*/ +static int +ctail_convertible (const coord_t *coord) +{ + int result; + reiser4_key key; + jnode * child = NULL; + + assert("edward-477", coord != NULL); + assert("edward-478", item_id_by_coord(coord) == CTAIL_ID); + + if (coord_is_unprepped_ctail(coord)) + /* unprepped ctail should be converted */ + return 1; + + item_key_by_coord(coord, &key); + child = jlookup(current_tree, + get_key_objectid(&key), + clust_by_coord(coord, NULL) << cluster_shift_by_coord(coord)); + if (!child) + return 0; + LOCK_JNODE(child); + if (jnode_is_dirty(child)) + result = 1; + else + result = 0; + UNLOCK_JNODE(child); + jput(child); + return result; +} + +/* plugin->u.item.b.shift_hook */ +reiser4_internal int +shift_hook_ctail(const coord_t * item /* coord of item */ , + unsigned from UNUSED_ARG /* start unit */ , + unsigned count UNUSED_ARG /* stop unit */ , + znode * old_node /* old parent */ ) +{ + assert("edward-479", item != NULL); + assert("edward-480", item->node != old_node); + + if (!znode_convertible(old_node) || znode_convertible(item->node)) + return 0; + if (ctail_convertible(item)) + znode_set_convertible(item->node); + return 0; +} + +static int +cut_or_kill_ctail_units(coord_t * coord, pos_in_node_t from, pos_in_node_t to, int cut, + void *p, reiser4_key * smallest_removed, reiser4_key *new_first) +{ + pos_in_node_t count; /* number of units to cut */ + char *item; + + count = to - from + 1; + item = item_body_by_coord(coord); + + assert("edward-74", ergo(from != 0, to == coord_last_unit_pos(coord))); + + if (smallest_removed) { + /* store smallest key removed */ + item_key_by_coord(coord, smallest_removed); + set_key_offset(smallest_removed, get_key_offset(smallest_removed) + from); + } + + if (new_first) { + assert("vs-1531", from == 0); + + item_key_by_coord(coord, new_first); + set_key_offset(new_first, get_key_offset(new_first) + from + count); + } + + if (!cut) + kill_hook_ctail(coord, from, 0, (struct carry_kill_data *)p); + + if (from == 0) { + if (count != nr_units_ctail(coord)) { + /* part of item is removed, so move free space at the beginning + of the item and update item key */ + reiser4_key key; + memcpy(item + to + 1, item, sizeof(ctail_item_format)); + item_key_by_coord(coord, &key); + set_key_offset(&key, get_key_offset(&key) + count); + node_plugin_by_node(coord->node)->update_item_key(coord, &key, 0 /*info */ ); + } + else { + /* cut_units should not be called to cut evrything */ + assert("vs-1532", ergo(cut, 0)); + /* whole item is cut, so more then amount of space occupied + by units got freed */ + count += sizeof(ctail_item_format); + } + if (REISER4_DEBUG) + memset(item, 0, count); + } + else if (REISER4_DEBUG) + memset(item + sizeof(ctail_item_format) + from, 0, count); + return count; +} + +/* plugin->u.item.b.cut_units */ +reiser4_internal int +cut_units_ctail(coord_t *item, pos_in_node_t from, pos_in_node_t to, + carry_cut_data *cdata, reiser4_key *smallest_removed, reiser4_key *new_first) +{ + return cut_or_kill_ctail_units(item, from, to, 1, NULL, smallest_removed, new_first); +} + +/* plugin->u.item.b.kill_units */ +reiser4_internal int +kill_units_ctail(coord_t *item, pos_in_node_t from, pos_in_node_t to, + struct carry_kill_data *kdata, reiser4_key *smallest_removed, reiser4_key *new_first) +{ + return cut_or_kill_ctail_units(item, from, to, 0, kdata, smallest_removed, new_first); +} + +/* plugin->u.item.s.file.read */ +reiser4_internal int +read_ctail(struct file *file UNUSED_ARG, flow_t *f, hint_t *hint) +{ + uf_coord_t *uf_coord; + coord_t *coord; + + uf_coord = &hint->ext_coord; + coord = &uf_coord->coord; + assert("edward-127", f->user == 0); + assert("edward-129", coord && coord->node); + assert("edward-130", coord_is_existing_unit(coord)); + assert("edward-132", znode_is_loaded(coord->node)); + + /* start read only from the beginning of ctail */ + assert("edward-133", coord->unit_pos == 0); + /* read only whole ctails */ + assert("edward-135", nr_units_ctail(coord) <= f->length); + + assert("edward-136", schedulable()); + assert("edward-886", ctail_ok(coord)); + + if (f->data) + memcpy(f->data, (char *)first_unit(coord), (size_t)nr_units_ctail(coord)); + + dclust_set_extension(hint); + mark_page_accessed(znode_page(coord->node)); + move_flow_forward(f, nr_units_ctail(coord)); + + return 0; +} + +/* Reads a disk cluster consists of ctail items, + attaches a transform stream with plain text */ +reiser4_internal int +ctail_read_cluster (reiser4_cluster_t * clust, struct inode * inode, int write) +{ + int result; + compression_plugin * cplug; +#if REISER4_DEBUG + reiser4_inode * info; + info = reiser4_inode_data(inode); +#endif + assert("edward-671", clust->hint != NULL); + assert("edward-140", clust->dstat == INVAL_DISK_CLUSTER); + assert("edward-672", crc_inode_ok(inode)); + assert("edward-145", inode_get_flag(inode, REISER4_CLUSTER_KNOWN)); + + /* set input stream */ + result = grab_tfm_stream(inode, &clust->tc, TFM_READ, INPUT_STREAM); + if (result) + return result; + + result = find_cluster(clust, inode, 1 /* read */, write); + if (cbk_errored(result)) + return result; + + if (!write) + set_hint_cluster(inode, clust->hint, + clust->index + 1, ZNODE_READ_LOCK); + + assert("edward-673", znode_is_any_locked(clust->hint->ext_coord.lh->node)); + + if (clust->dstat == FAKE_DISK_CLUSTER || + clust->dstat == UNPR_DISK_CLUSTER) { + tfm_cluster_set_uptodate(&clust->tc); + return 0; + } + cplug = inode_compression_plugin(inode); + if (cplug->alloc && !get_coa(&clust->tc, cplug->h.id)) { + result = alloc_coa(&clust->tc, cplug, TFM_READ); + if (result) + return result; + } + result = inflate_cluster(clust, inode); + if(result) + return result; + tfm_cluster_set_uptodate(&clust->tc); + return 0; +} + +/* read one locked page */ +reiser4_internal int +do_readpage_ctail(reiser4_cluster_t * clust, struct page *page) +{ + int ret; + unsigned cloff; + struct inode * inode; + char * data; + size_t pgcnt; + tfm_cluster_t * tc = &clust->tc; + + assert("edward-212", PageLocked(page)); + + if(PageUptodate(page)) + goto exit; + + inode = page->mapping->host; + + if (!tfm_cluster_is_uptodate(&clust->tc)) { + clust->index = pg_to_clust(page->index, inode); + unlock_page(page); + ret = ctail_read_cluster(clust, inode, 0 /* read only */); + lock_page(page); + if (ret) + return ret; + } + if(PageUptodate(page)) + /* races with another read/write */ + goto exit; + + /* bytes in the page */ + pgcnt = off_to_pgcount(i_size_read(inode), page->index); + + if (pgcnt == 0) { + assert("edward-1290", 0); + return RETERR(-EINVAL); + } + + assert("edward-119", tfm_cluster_is_uptodate(tc)); + + switch (clust->dstat) { + case UNPR_DISK_CLUSTER: + assert("edward-1285", 0); +#if REISER4_DEBUG + warning("edward-1168", + "page %lu is not uptodate and disk cluster %lu (inode %llu) is unprepped\n", + page->index, clust->index, (unsigned long long)get_inode_oid(inode)); +#endif + case FAKE_DISK_CLUSTER: + /* fill the page by zeroes */ + data = kmap_atomic(page, KM_USER0); + + memset(data, 0, PAGE_CACHE_SIZE); + flush_dcache_page(page); + kunmap_atomic(data, KM_USER0); + SetPageUptodate(page); + break; + case PREP_DISK_CLUSTER: + /* fill the page by transformed data */ + assert("edward-1058", !PageUptodate(page)); + assert("edward-120", tc->len <= inode_cluster_size(inode)); + + /* start page offset in the cluster */ + cloff = pg_to_off_to_cloff(page->index, inode); + + data = kmap(page); + memcpy(data, tfm_stream_data(tc, OUTPUT_STREAM) + cloff, pgcnt); + memset(data + pgcnt, 0, (size_t)PAGE_CACHE_SIZE - pgcnt); + flush_dcache_page(page); + kunmap(page); + SetPageUptodate(page); + break; + default: + impossible("edward-1169", "bad disk cluster state"); + } + exit: + return 0; +} + +/* plugin->u.item.s.file.readpage */ +reiser4_internal int readpage_ctail(void * vp, struct page * page) +{ + int result; + hint_t hint; + lock_handle lh; + reiser4_cluster_t * clust = vp; + + assert("edward-114", clust != NULL); + assert("edward-115", PageLocked(page)); + assert("edward-116", !PageUptodate(page)); + assert("edward-117", !jprivate(page) && !PagePrivate(page)); + assert("edward-118", page->mapping && page->mapping->host); + assert("edward-867", !tfm_cluster_is_uptodate(&clust->tc)); + + clust->hint = &hint; + result = load_file_hint(clust->file, &hint); + if (result) + return result; + init_lh(&lh); + hint.ext_coord.lh = &lh; + + result = do_readpage_ctail(clust, page); + + assert("edward-213", PageLocked(page)); + assert("edward-1163", ergo (!result, PageUptodate(page))); + assert("edward-868", ergo (!result, tfm_cluster_is_uptodate(&clust->tc))); + + unlock_page(page); + + hint.ext_coord.valid = 0; + save_file_hint(clust->file, &hint); + done_lh(&lh); + tfm_cluster_clr_uptodate(&clust->tc); + + return result; +} + +/* Unconditionally reads a disk cluster. + This is used by ->readpages() */ +static int +ctail_read_page_cluster(reiser4_cluster_t * clust, struct inode * inode) +{ + int i; + int result; + assert("edward-779", clust != NULL); + assert("edward-1059", clust->win == NULL); + assert("edward-780", inode != NULL); + + result = prepare_page_cluster(inode, clust, 0 /* do not capture */); + if (result) + return result; + result = ctail_read_cluster(clust, inode, 0 /* read */); + if (result) + goto out; + /* stream is attached at this point */ + assert("edward-781", tfm_cluster_is_uptodate(&clust->tc)); + + for (i=0; i < clust->nr_pages; i++) { + struct page * page = clust->pages[i]; + lock_page(page); + result = do_readpage_ctail(clust, page); + unlock_page(page); + if (result) + break; + } + tfm_cluster_clr_uptodate(&clust->tc); + out: + release_cluster_pages_nocapture(clust); + assert("edward-1060", !result); + + return result; +} + +#define list_to_page(head) (list_entry((head)->prev, struct page, lru)) +#define list_to_next_page(head) (list_entry((head)->prev->prev, struct page, lru)) + +#if REISER4_DEBUG +#define check_order(pages) \ +assert("edward-214", ergo(!list_empty(pages) && pages->next != pages->prev, \ + list_to_page(pages)->index < list_to_next_page(pages)->index)) +#endif + +/* plugin->s.file.writepage */ + +/* plugin->u.item.s.file.readpages + populate an address space with page clusters, and start reads against them. + FIXME_EDWARD: this function should return errors +*/ +reiser4_internal void +readpages_ctail(void *vp, struct address_space *mapping, struct list_head *pages) +{ + int ret = 0; + hint_t hint; + lock_handle lh; + reiser4_cluster_t clust; + struct page *page; + struct pagevec lru_pvec; + struct inode * inode = mapping->host; + int progress = 0; + + assert("edward-214", ergo(!list_empty(pages) && + pages->next != pages->prev, + list_to_page(pages)->index < list_to_next_page(pages)->index)); + pagevec_init(&lru_pvec, 0); + reiser4_cluster_init(&clust, 0); + clust.file = vp; + clust.hint = &hint; + + init_lh(&lh); + + ret = alloc_cluster_pgset(&clust, cluster_nrpages(inode)); + if (ret) + goto out; + ret = load_file_hint(clust.file, &hint); + if (ret) + goto out; + hint.ext_coord.lh = &lh; + + /* address_space-level file readahead doesn't know about + reiser4 page clustering, so we work around this fact */ + + while (!list_empty(pages)) { + page = list_to_page(pages); + list_del(&page->lru); + if (add_to_page_cache(page, mapping, page->index, GFP_KERNEL)) { + page_cache_release(page); + continue; + } + if (PageUptodate(page)) { + if (!pagevec_add(&lru_pvec, page)) + __pagevec_lru_add(&lru_pvec); + unlock_page(page); + continue; + } + unlock_page(page); + reset_cluster_params(&clust); + + if (progress && + /* hole in the indices */ + pg_to_clust(page->index, inode) != clust.index + 1) + invalidate_hint_cluster(&clust); + progress++; + + clust.index = pg_to_clust(page->index, inode); + ret = ctail_read_page_cluster(&clust, inode); + if (ret) + goto exit; + assert("edward-869", !tfm_cluster_is_uptodate(&clust.tc)); + + lock_page(page); + ret = do_readpage_ctail(&clust, page); + if (!pagevec_add(&lru_pvec, page)) + __pagevec_lru_add(&lru_pvec); + if (ret) { + warning("edward-215", "do_readpage_ctail failed"); + unlock_page(page); + exit: + while (!list_empty(pages)) { + struct page *victim; + + victim = list_to_page(pages); + list_del(&victim->lru); + page_cache_release(victim); + } + break; + } + assert("edward-1061", PageUptodate(page)); + + unlock_page(page); + } + assert("edward-870", !tfm_cluster_is_uptodate(&clust.tc)); + save_file_hint(clust.file, &hint); + out: + done_lh(&lh); + hint.ext_coord.valid = 0; + put_cluster_handle(&clust, TFM_READ); + pagevec_lru_add(&lru_pvec); + return; +} + +/* + plugin->u.item.s.file.append_key + key of the first item of the next disk cluster +*/ +reiser4_internal reiser4_key * +append_key_ctail(const coord_t *coord, reiser4_key *key) +{ + assert("edward-1241", item_id_by_coord(coord) == CTAIL_ID); + assert("edward-1242", cluster_shift_by_coord(coord) <= MAX_CLUSTER_SHIFT); + + item_key_by_coord(coord, key); + set_key_offset(key, ((__u64)(clust_by_coord(coord, NULL)) + 1) << cluster_shift_by_coord(coord) << PAGE_CACHE_SHIFT); + return key; +} + + static int +insert_unprepped_ctail(reiser4_cluster_t * clust, struct inode * inode) +{ + int result; + char buf[UCTAIL_NR_UNITS]; + reiser4_item_data data; + reiser4_key key; + int shift = (int)UCTAIL_SHIFT; + + memset(buf, 0, (size_t)UCTAIL_NR_UNITS); + result = key_by_inode_cryptcompress(inode, + clust_to_off(clust->index, inode), + &key); + if (result) + return result; + data.user = 0; + data.iplug = item_plugin_by_id(CTAIL_ID); + data.arg = &shift; + data.length = sizeof(ctail_item_format) + (size_t)UCTAIL_NR_UNITS; + data.data = buf; + + result = insert_by_coord(&clust->hint->ext_coord.coord, + &data, + &key, + clust->hint->ext_coord.lh, 0); + return result; +} + +static int +insert_crc_flow(coord_t * coord, lock_handle * lh, flow_t * f, struct inode * inode) +{ + int result; + carry_pool *pool; + carry_level lowest_level; + carry_op *op; + reiser4_item_data data; + int cluster_shift = inode_cluster_shift(inode); + + pool = init_carry_pool(); + if (IS_ERR(pool)) + return PTR_ERR(pool); + init_carry_level(&lowest_level, pool); + + assert("edward-466", coord->between == AFTER_ITEM || coord->between == AFTER_UNIT || + coord->between == BEFORE_ITEM || coord->between == EMPTY_NODE + || coord->between == BEFORE_UNIT); + + if (coord->between == AFTER_UNIT) { + coord->unit_pos = 0; + coord->between = AFTER_ITEM; + } + op = post_carry(&lowest_level, COP_INSERT_FLOW, coord->node, 0 /* operate directly on coord -> node */ ); + if (IS_ERR(op) || (op == NULL)) { + done_carry_pool(pool); + return RETERR(op ? PTR_ERR(op) : -EIO); + } + data.user = 0; + data.iplug = item_plugin_by_id(CTAIL_ID); + data.arg = &cluster_shift; + + data.length = 0; + data.data = 0; + + op->u.insert_flow.flags = COPI_DONT_SHIFT_LEFT | COPI_DONT_SHIFT_RIGHT; + op->u.insert_flow.insert_point = coord; + op->u.insert_flow.flow = f; + op->u.insert_flow.data = &data; + op->u.insert_flow.new_nodes = 0; + + lowest_level.track_type = CARRY_TRACK_CHANGE; + lowest_level.tracked = lh; + + result = carry(&lowest_level, 0); + done_carry_pool(pool); + + return result; +} + +/* Implementation of CRC_APPEND_ITEM mode of ctail conversion */ +static int +insert_crc_flow_in_place(coord_t * coord, lock_handle * lh, flow_t * f, struct inode * inode) +{ + int ret; + coord_t pos; + lock_handle lock; + + assert("edward-674", f->length <= inode_scaled_cluster_size(inode)); + assert("edward-484", coord->between == AT_UNIT || coord->between == AFTER_ITEM); + assert("edward-485", item_id_by_coord(coord) == CTAIL_ID); + + coord_dup (&pos, coord); + pos.unit_pos = 0; + pos.between = AFTER_ITEM; + + init_lh (&lock); + copy_lh(&lock, lh); + + ret = insert_crc_flow(&pos, &lock, f, inode); + done_lh(&lock); + + assert("edward-1228", !ret); + return ret; +} + +/* Implementation of CRC_OVERWRITE_ITEM mode of ctail conversion */ +static int +overwrite_ctail(coord_t * coord, flow_t * f) +{ + unsigned count; + + assert("edward-269", f->user == 0); + assert("edward-270", f->data != NULL); + assert("edward-271", f->length > 0); + assert("edward-272", coord_is_existing_unit(coord)); + assert("edward-273", coord->unit_pos == 0); + assert("edward-274", znode_is_write_locked(coord->node)); + assert("edward-275", schedulable()); + assert("edward-467", item_id_by_coord(coord) == CTAIL_ID); + assert("edward-1243", ctail_ok(coord)); + + count = nr_units_ctail(coord); + + if (count > f->length) + count = f->length; + memcpy(first_unit(coord), f->data, count); + move_flow_forward(f, count); + coord->unit_pos += count; + return 0; +} + +/* Implementation of CRC_CUT_ITEM mode of ctail conversion: + cut ctail (part or whole) starting from next unit position */ +static int +cut_ctail(coord_t * coord) +{ + coord_t stop; + + assert("edward-435", coord->between == AT_UNIT && + coord->item_pos < coord_num_items(coord) && + coord->unit_pos <= coord_num_units(coord)); + + if(coord->unit_pos == coord_num_units(coord)) + /* nothing to cut */ + return 0; + coord_dup(&stop, coord); + stop.unit_pos = coord_last_unit_pos(coord); + + return cut_node_content(coord, &stop, NULL, NULL, NULL); +} + +int ctail_insert_unprepped_cluster(reiser4_cluster_t * clust, struct inode * inode) +{ + int result; + + assert("edward-1244", inode != NULL); + assert("edward-1245", clust->hint != NULL); + assert("edward-1246", clust->dstat == FAKE_DISK_CLUSTER); + assert("edward-1247", clust->reserved == 1); + assert("edward-1248", get_current_context()->grabbed_blocks == + estimate_insert_cluster(inode, 1)); + + result = get_disk_cluster_locked(clust, inode, ZNODE_WRITE_LOCK); + if (cbk_errored(result)) + return result; + assert("edward-1249", result == CBK_COORD_NOTFOUND); + assert("edward-1250", znode_is_write_locked(clust->hint->ext_coord.lh->node)); + + assert("edward-1295", + clust->hint->ext_coord.lh->node == clust->hint->ext_coord.coord.node); + + coord_set_between_clusters(&clust->hint->ext_coord.coord); + + result = insert_unprepped_ctail(clust, inode); + all_grabbed2free(); + + assert("edward-1251", !result); + assert("edward-1252", crc_inode_ok(inode)); + assert("edward-1253", znode_is_write_locked(clust->hint->ext_coord.lh->node)); + assert("edward-1254", reiser4_clustered_blocks(reiser4_get_current_sb())); + assert("edward-1255", znode_convertible(clust->hint->ext_coord.coord.node)); + + return result; +} + +static int +do_convert_ctail(flush_pos_t * pos, crc_write_mode_t mode) +{ + int result = 0; + convert_item_info_t * info; + + assert("edward-468", pos != NULL); + assert("edward-469", pos->sq != NULL); + assert("edward-845", item_convert_data(pos) != NULL); + + info = item_convert_data(pos); + assert("edward-679", info->flow.data != NULL); + + switch (mode) { + case CRC_APPEND_ITEM: + assert("edward-1229", info->flow.length != 0); + assert("edward-1256", cluster_shift_by_coord(&pos->coord) <= MAX_CLUSTER_SHIFT); + result = insert_crc_flow_in_place(&pos->coord, &pos->lock, &info->flow, info->inode); + break; + case CRC_OVERWRITE_ITEM: + assert("edward-1230", info->flow.length != 0); + overwrite_ctail(&pos->coord, &info->flow); + if (info->flow.length != 0) + break; + case CRC_CUT_ITEM: + assert("edward-1231", info->flow.length == 0); + result = cut_ctail(&pos->coord); + break; + default: + result = RETERR(-EIO); + impossible("edward-244", "bad convert mode"); + } + return result; +} + +/* plugin->u.item.f.scan */ +reiser4_internal int scan_ctail(flush_scan * scan) +{ + int result = 0; + struct page * page; + struct inode * inode; + jnode * node = scan->node; + + assert("edward-227", scan->node != NULL); + assert("edward-228", jnode_is_cluster_page(scan->node)); + assert("edward-639", znode_is_write_locked(scan->parent_lock.node)); + + page = jnode_page(node); + inode = page->mapping->host; + + if (!scanning_left(scan)) + return result; + if (!znode_is_dirty(scan->parent_lock.node)) + znode_make_dirty(scan->parent_lock.node); + + if (!znode_convertible(scan->parent_lock.node)) { + LOCK_JNODE(scan->node); + if (jnode_is_dirty(scan->node)) { + warning("edward-873", "child is dirty but parent not squeezable"); + znode_set_convertible(scan->parent_lock.node); + } else { + warning("edward-681", "cluster page is already processed"); + UNLOCK_JNODE(scan->node); + return -EAGAIN; + } + UNLOCK_JNODE(scan->node); + } + return result; +} + +/* If true, this function attaches children */ +static int +should_attach_convert_idata(flush_pos_t * pos) +{ + int result; + assert("edward-431", pos != NULL); + assert("edward-432", pos->child == NULL); + assert("edward-619", znode_is_write_locked(pos->coord.node)); + assert("edward-470", item_plugin_by_coord(&pos->coord) == item_plugin_by_id(CTAIL_ID)); + + /* check for leftmost child */ + utmost_child_ctail(&pos->coord, LEFT_SIDE, &pos->child); + + if (!pos->child) + return 0; + LOCK_JNODE(pos->child); + result = jnode_is_dirty(pos->child) && + pos->child->atom == ZJNODE(pos->coord.node)->atom; + UNLOCK_JNODE(pos->child); + if (!result && pos->child) { + /* existing child isn't to attach, clear up this one */ + jput(pos->child); + pos->child = NULL; + } + return result; +} + +/* plugin->init_convert_data() */ +static int +init_convert_data_ctail(convert_item_info_t * idata, struct inode * inode) +{ + assert("edward-813", idata != NULL); + assert("edward-814", inode != NULL); + + idata->inode = inode; + idata->d_cur = DC_FIRST_ITEM; + idata->d_next = DC_INVALID_STATE; + + return 0; +} + +static int +alloc_item_convert_data(convert_info_t * sq) +{ + assert("edward-816", sq != NULL); + assert("edward-817", sq->itm == NULL); + + sq->itm = reiser4_kmalloc(sizeof(*sq->itm), GFP_KERNEL); + if (sq->itm == NULL) + return RETERR(-ENOMEM); + return 0; +} + +static void +free_item_convert_data(convert_info_t * sq) +{ + assert("edward-818", sq != NULL); + assert("edward-819", sq->itm != NULL); + assert("edward-820", sq->iplug != NULL); + + reiser4_kfree(sq->itm); + sq->itm = NULL; + return; +} + +static int +alloc_convert_data(flush_pos_t * pos) +{ + assert("edward-821", pos != NULL); + assert("edward-822", pos->sq == NULL); + + pos->sq = reiser4_kmalloc(sizeof(*pos->sq), GFP_KERNEL); + if (!pos->sq) + return RETERR(-ENOMEM); + memset(pos->sq, 0, sizeof(*pos->sq)); + return 0; +} + +reiser4_internal void +free_convert_data(flush_pos_t * pos) +{ + convert_info_t * sq; + + assert("edward-823", pos != NULL); + assert("edward-824", pos->sq != NULL); + + sq = pos->sq; + if (sq->itm) + free_item_convert_data(sq); + put_cluster_handle(&sq->clust, TFM_WRITE); + reiser4_kfree(pos->sq); + pos->sq = NULL; + return; +} + +static int +init_item_convert_data(flush_pos_t * pos, struct inode * inode) +{ + convert_info_t * sq; + + assert("edward-825", pos != NULL); + assert("edward-826", pos->sq != NULL); + assert("edward-827", item_convert_data(pos) != NULL); + assert("edward-828", inode != NULL); + + sq = pos->sq; + + memset(sq->itm, 0, sizeof(*sq->itm)); + + /* iplug->init_convert_data() */ + return init_convert_data_ctail(sq->itm, inode); +} + +/* create and attach disk cluster info used by 'convert' phase of the flush + squalloc() */ +static int +attach_convert_idata(flush_pos_t * pos, struct inode * inode) +{ + int ret = 0; + convert_item_info_t * info; + reiser4_cluster_t *clust; + file_plugin * fplug = inode_file_plugin(inode); + compression_plugin * cplug = inode_compression_plugin(inode); + + assert("edward-248", pos != NULL); + assert("edward-249", pos->child != NULL); + assert("edward-251", inode != NULL); + assert("edward-682", crc_inode_ok(inode)); + assert("edward-252", fplug == file_plugin_by_id(CRC_FILE_PLUGIN_ID)); + assert("edward-473", item_plugin_by_coord(&pos->coord) == item_plugin_by_id(CTAIL_ID)); + + if (!pos->sq) { + ret = alloc_convert_data(pos); + if (ret) + return ret; + } + clust = &pos->sq->clust; + if (cplug->alloc && !get_coa(&clust->tc, cplug->h.id)) { + ret = alloc_coa(&clust->tc, cplug, TFM_WRITE); + if (ret) + goto err; + } + + if (convert_data(pos)->clust.pages == NULL) { + ret = alloc_cluster_pgset(&convert_data(pos)->clust, + MAX_CLUSTER_NRPAGES); + if (ret) + goto err; + } + reset_cluster_pgset(&convert_data(pos)->clust, + MAX_CLUSTER_NRPAGES); + + assert("edward-829", pos->sq != NULL); + assert("edward-250", item_convert_data(pos) == NULL); + + pos->sq->iplug = item_plugin_by_id(CTAIL_ID); + + ret = alloc_item_convert_data(pos->sq); + if (ret) + goto err; + ret = init_item_convert_data(pos, inode); + if (ret) + goto err; + info = item_convert_data(pos); + + clust->index = pg_to_clust(jnode_page(pos->child)->index, inode); + + ret = flush_cluster_pages(clust, pos->child, inode); + if (ret) + goto err; + + assert("edward-830", equi(get_coa(&clust->tc, cplug->h.id), cplug->alloc)); + + ret = deflate_cluster(clust, inode); + if (ret) + goto err; + + inc_item_convert_count(pos); + + /* make flow by transformed stream */ + fplug->flow_by_inode(info->inode, + tfm_stream_data(&clust->tc, OUTPUT_STREAM), + 0/* kernel space */, + clust->tc.len, + clust_to_off(clust->index, inode), + WRITE_OP, + &info->flow); + jput(pos->child); + + assert("edward-683", crc_inode_ok(inode)); + return 0; + err: + jput(pos->child); + free_convert_data(pos); + return ret; +} + +/* clear up disk cluster info */ +static void +detach_convert_idata(convert_info_t * sq) +{ + convert_item_info_t * info; + + assert("edward-253", sq != NULL); + assert("edward-840", sq->itm != NULL); + + info = sq->itm; + assert("edward-255", info->inode != NULL); + assert("edward-1175", + inode_get_flag(info->inode, REISER4_CLUSTER_KNOWN)); + assert("edward-1212", info->flow.length == 0); + + /* the final release of pages */ + forget_cluster_pages(sq->clust.pages, sq->clust.nr_pages); + free_item_convert_data(sq); + return; +} + +/* plugin->u.item.f.utmost_child */ + +/* This function sets leftmost child for a first cluster item, + if the child exists, and NULL in other cases. + NOTE-EDWARD: Do not call this for RIGHT_SIDE */ + +reiser4_internal int +utmost_child_ctail(const coord_t * coord, sideof side, jnode ** child) +{ + reiser4_key key; + + item_key_by_coord(coord, &key); + + assert("edward-257", coord != NULL); + assert("edward-258", child != NULL); + assert("edward-259", side == LEFT_SIDE); + assert("edward-260", item_plugin_by_coord(coord) == item_plugin_by_id(CTAIL_ID)); + + if (!is_disk_cluster_key(&key, coord)) + *child = NULL; + else + *child = jlookup(current_tree, get_key_objectid(item_key_by_coord(coord, &key)), pg_by_coord(coord)); + return 0; +} + +/* Returns true if @p2 is the next item to @p1 + in the _same_ disk cluster. + Disk cluster is a set of items. If ->clustered() != NULL, + with each item the whole disk cluster should be read/modified +*/ +static int +clustered_ctail (const coord_t * p1, const coord_t * p2) +{ + return mergeable_ctail(p1, p2); +} + +/* Go rightward and check for next disk cluster item, set + d_next to DC_CHAINED_ITEM, if the last one exists. + If the current position is last item, go to right neighbor. + Skip empty nodes. Note, that right neighbors may be not in + the slum because of races. If so, make it dirty and + convertible. +*/ +static int +next_item_dc_stat(flush_pos_t * pos) +{ + int ret = 0; + int stop = 0; + znode * cur; + coord_t coord; + lock_handle lh; + lock_handle right_lock; + + assert("edward-1232", !node_is_empty(pos->coord.node)); + assert("edward-1014", pos->coord.item_pos < coord_num_items(&pos->coord)); + assert("edward-1015", chaining_data_present(pos)); + assert("edward-1017", item_convert_data(pos)->d_next == DC_INVALID_STATE); + + item_convert_data(pos)->d_next = DC_AFTER_CLUSTER; + + if (item_convert_data(pos)->d_cur == DC_AFTER_CLUSTER) + return ret; + if (pos->coord.item_pos < coord_num_items(&pos->coord) - 1) + return ret; + + /* check next slum item */ + init_lh(&right_lock); + cur = pos->coord.node; + + while (!stop) { + init_lh(&lh); + ret = reiser4_get_right_neighbor(&lh, + cur, + ZNODE_WRITE_LOCK, + GN_CAN_USE_UPPER_LEVELS); + if (ret) + break; + ret = zload(lh.node); + if (ret) { + done_lh(&lh); + break; + } + coord_init_before_first_item(&coord, lh.node); + + if (node_is_empty(lh.node)) { + znode_make_dirty(lh.node); + znode_set_convertible(lh.node); + stop = 0; + } else if (clustered_ctail(&pos->coord, &coord)) { + + item_convert_data(pos)->d_next = DC_CHAINED_ITEM; + + if (!znode_is_dirty(lh.node)) { + warning("edward-1024", + "next slum item mergeable, " + "but znode %p isn't dirty\n", + lh.node); + znode_make_dirty(lh.node); + } + if (!znode_convertible(lh.node)) { + warning("edward-1272", + "next slum item mergeable, " + "but znode %p isn't convertible\n", + lh.node); + znode_set_convertible(lh.node); + } + stop = 1; + } else + stop = 1; + zrelse(lh.node); + done_lh(&right_lock); + copy_lh(&right_lock, &lh); + done_lh(&lh); + cur = right_lock.node; + } + done_lh(&right_lock); + + if (ret == -E_NO_NEIGHBOR) + ret = 0; + return ret; +} + +static int +assign_convert_mode(convert_item_info_t * idata, + crc_write_mode_t * mode) +{ + int result = 0; + + assert("edward-1025", idata != NULL); + + if (idata->flow.length) { + /* append or overwrite */ + switch(idata->d_cur) { + case DC_FIRST_ITEM: + case DC_CHAINED_ITEM: + *mode = CRC_OVERWRITE_ITEM; + break; + case DC_AFTER_CLUSTER: + *mode = CRC_APPEND_ITEM; + break; + default: + impossible("edward-1018", + "wrong current item state"); + } + } else { + /* cut or invalidate */ + switch(idata->d_cur) { + case DC_FIRST_ITEM: + case DC_CHAINED_ITEM: + *mode = CRC_CUT_ITEM; + break; + case DC_AFTER_CLUSTER: + result = 1; + break; + default: + impossible("edward-1019", + "wrong current item state"); + } + } + return result; +} + +/* plugin->u.item.f.convert */ +/* write ctail in guessed mode */ +reiser4_internal int +convert_ctail(flush_pos_t * pos) +{ + int result; + int nr_items; + crc_write_mode_t mode = CRC_OVERWRITE_ITEM; + + assert("edward-1020", pos != NULL); + assert("edward-1213", coord_num_items(&pos->coord) != 0); + assert("edward-1257", item_id_by_coord(&pos->coord) == CTAIL_ID); + assert("edward-1258", ctail_ok(&pos->coord)); + assert("edward-261", pos->coord.node != NULL); + + nr_items = coord_num_items(&pos->coord); + if (!chaining_data_present(pos)) { + if (should_attach_convert_idata(pos)) { + /* attach convert item info */ + struct inode * inode; + + assert("edward-264", pos->child != NULL); + assert("edward-265", jnode_page(pos->child) != NULL); + assert("edward-266", jnode_page(pos->child)->mapping != NULL); + + inode = jnode_page(pos->child)->mapping->host; + + assert("edward-267", inode != NULL); + + /* attach item convert info by child and put the last one */ + result = attach_convert_idata(pos, inode); + pos->child = NULL; + if (result == -E_REPEAT) { + /* jnode became clean, or there is no dirty + pages (nothing to update in disk cluster) */ + warning("edward-1021", + "convert_ctail: nothing to attach"); + return 0; + } + if (result != 0) + return result; + } + else + /* unconvertible */ + return 0; + } + else { + /* use old convert info */ + + convert_item_info_t * idata; + + idata = item_convert_data(pos); + + result = assign_convert_mode(idata, &mode); + if (result) { + /* disk cluster is over, + nothing to update anymore */ + detach_convert_idata(pos->sq); + return 0; + } + } + + assert("edward-433", chaining_data_present(pos)); + assert("edward-1022", pos->coord.item_pos < coord_num_items(&pos->coord)); + + result = next_item_dc_stat(pos); + if (result) { + detach_convert_idata(pos->sq); + return result; + } + result = do_convert_ctail(pos, mode); + if (result) { + detach_convert_idata(pos->sq); + return result; + } + switch (mode) { + case CRC_CUT_ITEM: + assert("edward-1214", item_convert_data(pos)->flow.length == 0); + assert("edward-1215", + coord_num_items(&pos->coord) == nr_items || + coord_num_items(&pos->coord) == nr_items - 1); + if (item_convert_data(pos)->d_next == DC_CHAINED_ITEM) + break; + if (coord_num_items(&pos->coord) != nr_items) { + /* the item was killed, no more chained items */ + detach_convert_idata(pos->sq); + if (!node_is_empty(pos->coord.node)) + /* make sure the next item will be scanned */ + coord_init_before_item(&pos->coord); + break; + } + case CRC_APPEND_ITEM: + assert("edward-434", item_convert_data(pos)->flow.length == 0); + detach_convert_idata(pos->sq); + break; + case CRC_OVERWRITE_ITEM: + if (coord_is_unprepped_ctail(&pos->coord)) { + /* convert unpprepped ctail to prepped one */ + int shift; + shift = inode_cluster_shift(item_convert_data(pos)->inode); + assert("edward-1259", shift <= MAX_CLUSTER_SHIFT); + cputod8(shift, &ctail_formatted_at(&pos->coord)->cluster_shift); + } + break; + } + return result; +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/ctail.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/ctail.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,82 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#if !defined( __FS_REISER4_CTAIL_H__ ) +#define __FS_REISER4_CTAIL_H__ + +/* cryptcompress object item. See ctail.c for description. */ + +#define UCTAIL_NR_UNITS 1 +#define UCTAIL_SHIFT 0xff + +typedef struct ctail_item_format { + /* cluster shift */ + d8 cluster_shift; + /* ctail body */ + d8 body[0]; +} __attribute__((packed)) ctail_item_format; + +/* The following is a set of various item states in a disk cluster. + Disk cluster is a set of items whose keys belong to the interval + [dc_key , dc_key + disk_cluster_size - 1] */ +typedef enum { + DC_INVALID_STATE = 0, + DC_FIRST_ITEM = 1, + DC_CHAINED_ITEM = 2, + DC_AFTER_CLUSTER = 3 +} dc_item_stat; + +typedef struct { + int shift; /* we keep here a cpu value of cluster_shift field + of ctail_item_format (see above) */ +} ctail_coord_extension_t; + +struct cut_list; + +/* plugin->item.b.* */ +int can_contain_key_ctail(const coord_t *, const reiser4_key *, const reiser4_item_data *); +int mergeable_ctail(const coord_t * p1, const coord_t * p2); +pos_in_node_t nr_units_ctail(const coord_t * coord); +int estimate_ctail(const coord_t * coord, const reiser4_item_data * data); +void print_ctail(const char *prefix, coord_t * coord); +lookup_result lookup_ctail(const reiser4_key *, lookup_bias, coord_t *); + +int paste_ctail(coord_t * coord, reiser4_item_data * data, carry_plugin_info * info UNUSED_ARG); +int init_ctail(coord_t *, coord_t *, reiser4_item_data *); +int can_shift_ctail(unsigned free_space, coord_t * coord, + znode * target, shift_direction pend, unsigned *size, unsigned want); +void copy_units_ctail(coord_t * target, coord_t * source, + unsigned from, unsigned count, shift_direction where_is_free_space, unsigned free_space); +int cut_units_ctail(coord_t *coord, pos_in_node_t from, pos_in_node_t to, + carry_cut_data *, reiser4_key * smallest_removed, reiser4_key *new_first); +int kill_units_ctail(coord_t * coord, pos_in_node_t from, pos_in_node_t to, + carry_kill_data *, reiser4_key * smallest_removed, reiser4_key *new_first); +int ctail_ok(const coord_t * coord); +int check_ctail(const coord_t * coord, const char **error); + +/* plugin->u.item.s.* */ +int read_ctail(struct file *, flow_t *, hint_t *); +int readpage_ctail(void *, struct page *); +void readpages_ctail(void *, struct address_space *, struct list_head *); +reiser4_key *append_key_ctail(const coord_t *, reiser4_key *); +int create_hook_ctail (const coord_t * coord, void * arg); +int kill_hook_ctail(const coord_t *, pos_in_node_t, pos_in_node_t, carry_kill_data *); +int shift_hook_ctail(const coord_t *, unsigned, unsigned, znode *); + +/* plugin->u.item.f */ +int utmost_child_ctail(const coord_t *, sideof, jnode **); +int scan_ctail(flush_scan *); +int convert_ctail(flush_pos_t *); +size_t inode_scaled_cluster_size(struct inode *); +int cluster_shift_by_coord(const coord_t * coord); + +#endif /* __FS_REISER4_CTAIL_H__ */ + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/extent.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/extent.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,181 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "item.h" +#include "../../key.h" +#include "../../super.h" +#include "../../carry.h" +#include "../../inode.h" +#include "../../page_cache.h" +#include "../../emergency_flush.h" +#include "../../flush.h" +#include "../object.h" + + +/* prepare structure reiser4_item_data. It is used to put one extent unit into tree */ +/* Audited by: green(2002.06.13) */ +reiser4_internal reiser4_item_data * +init_new_extent(reiser4_item_data *data, void *ext_unit, int nr_extents) +{ + data->data = ext_unit; + /* data->data is kernel space */ + data->user = 0; + data->length = sizeof(reiser4_extent) * nr_extents; + data->arg = 0; + data->iplug = item_plugin_by_id(EXTENT_POINTER_ID); + return data; +} + +/* how many bytes are addressed by @nr first extents of the extent item */ +reiser4_internal reiser4_block_nr +extent_size(const coord_t *coord, pos_in_node_t nr) +{ + pos_in_node_t i; + reiser4_block_nr blocks; + reiser4_extent *ext; + + ext = item_body_by_coord(coord); + assert("vs-263", nr <= nr_units_extent(coord)); + + blocks = 0; + for (i = 0; i < nr; i++, ext++) { + blocks += extent_get_width(ext); + } + + return blocks * current_blocksize; +} + +reiser4_internal extent_state +state_of_extent(reiser4_extent *ext) +{ + switch ((int) extent_get_start(ext)) { + case 0: + return HOLE_EXTENT; + case 1: + return UNALLOCATED_EXTENT; + default: + break; + } + return ALLOCATED_EXTENT; +} + +reiser4_internal int +extent_is_unallocated(const coord_t *item) +{ + assert("jmacd-5133", item_is_extent(item)); + + return state_of_extent(extent_by_coord(item)) == UNALLOCATED_EXTENT; +} + +/* set extent's start and width */ +reiser4_internal void +set_extent(reiser4_extent *ext, reiser4_block_nr start, reiser4_block_nr width) +{ + extent_set_start(ext, start); + extent_set_width(ext, width); +} + +/* used in split_allocated_extent, conv_extent, plug_hole to insert 1 or 2 extent units (@exts_to_add) after the one + @un_extent is set to. @un_extent itself is changed to @replace */ +reiser4_internal int +replace_extent(coord_t *un_extent, lock_handle *lh, + reiser4_key *key, reiser4_item_data *exts_to_add, const reiser4_extent *replace, unsigned flags UNUSED_ARG, + int return_inserted_position /* if it is 1 - un_extent and lh are returned set to first of newly inserted + units, if it is 0 - un_extent and lh are returned set to unit which was + replaced */) +{ + int result; + coord_t coord_after; + lock_handle lh_after; + tap_t watch; + znode *orig_znode; + ON_DEBUG(reiser4_extent orig_ext); /* this is for debugging */ + + assert("vs-990", coord_is_existing_unit(un_extent)); + assert("vs-1375", znode_is_write_locked(un_extent->node)); + assert("vs-1426", extent_get_width(replace) != 0); + assert("vs-1427", extent_get_width((reiser4_extent *)exts_to_add->data) != 0); + + coord_dup(&coord_after, un_extent); + init_lh(&lh_after); + copy_lh(&lh_after, lh); + tap_init(&watch, &coord_after, &lh_after, ZNODE_WRITE_LOCK); + tap_monitor(&watch); + + ON_DEBUG(orig_ext = *extent_by_coord(un_extent)); + orig_znode = un_extent->node; + + /* make sure that key is set properly */ + if (REISER4_DEBUG) { + reiser4_key tmp; + + unit_key_by_coord(un_extent, &tmp); + set_key_offset(&tmp, get_key_offset(&tmp) + extent_get_width(replace) * current_blocksize); + assert("vs-1080", keyeq(&tmp, key)); + } + + /* set insert point after unit to be replaced */ + un_extent->between = AFTER_UNIT; + + result = insert_into_item(un_extent, + return_inserted_position ? lh : 0, + /*(flags == COPI_DONT_SHIFT_LEFT) ? 0 : lh,*/ key, exts_to_add, flags); + if (!result) { + /* now we have to replace the unit after which new units were inserted. Its position is tracked by + @watch */ + reiser4_extent *ext; + + if (coord_after.node != orig_znode) { + coord_clear_iplug(&coord_after); + result = zload(coord_after.node); + } + + if (likely(!result)) { + ext = extent_by_coord(&coord_after); + + assert("vs-987", znode_is_loaded(coord_after.node)); + assert("vs-988", !memcmp(ext, &orig_ext, sizeof (*ext))); + + memcpy(ext, replace, sizeof(*ext)); + znode_make_dirty(coord_after.node); + + if (coord_after.node != orig_znode) + zrelse(coord_after.node); + + if (return_inserted_position == 0) { + /* return un_extent and lh set to the same */ + assert("vs-1662", WITH_DATA(coord_after.node, !memcmp(replace, extent_by_coord(&coord_after), sizeof(reiser4_extent)))); + + *un_extent = coord_after; + done_lh(lh); + copy_lh(lh, &lh_after); + } else { + /* return un_extent and lh set to first of inserted units */ + assert("vs-1663", WITH_DATA(un_extent->node, !memcmp(exts_to_add->data, extent_by_coord(un_extent), sizeof(reiser4_extent)))); + assert("vs-1664", lh->node == un_extent->node); + } + } + } + tap_done(&watch); + + return result; +} + +reiser4_internal lock_handle * +znode_lh(znode *node) +{ + assert("vs-1371", znode_is_write_locked(node)); + assert("vs-1372", znode_is_wlocked_once(node)); + return owners_list_front(&node->lock.owners); +} + + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/extent_file_ops.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/extent_file_ops.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1542 @@ +/* COPYRIGHT 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "item.h" +#include "../../inode.h" +#include "../../page_cache.h" +#include "../../flush.h" /* just for jnode_tostring */ +#include "../object.h" + +#include + +static inline reiser4_extent * +ext_by_offset(const znode *node, int offset) +{ + reiser4_extent *ext; + + ext = (reiser4_extent *)(zdata(node) + offset); + return ext; +} + +static inline reiser4_extent * +ext_by_ext_coord(const uf_coord_t *uf_coord) +{ + reiser4_extent *ext; + + ext = ext_by_offset(uf_coord->coord.node, uf_coord->extension.extent.ext_offset); + assert("vs-1650", extent_get_start(ext) == extent_get_start(&uf_coord->extension.extent.extent)); + assert("vs-1651", extent_get_width(ext) == extent_get_width(&uf_coord->extension.extent.extent)); + return ext; +} + +#if REISER4_DEBUG +static int +coord_extension_is_ok(const uf_coord_t *uf_coord) +{ + const coord_t *coord; + const extent_coord_extension_t *ext_coord; + reiser4_extent *ext; + + coord = &uf_coord->coord; + ext_coord = &uf_coord->extension.extent; + ext = ext_by_ext_coord(uf_coord); + + return WITH_DATA(coord->node, (uf_coord->valid == 1 && + coord_is_iplug_set(coord) && + item_is_extent(coord) && + ext_coord->nr_units == nr_units_extent(coord) && + ext == extent_by_coord(coord) && + ext_coord->width == extent_get_width(ext) && + coord->unit_pos < ext_coord->nr_units && + ext_coord->pos_in_unit < ext_coord->width && + extent_get_start(ext) == extent_get_start(&ext_coord->extent) && + extent_get_width(ext) == extent_get_width(&ext_coord->extent))); +} + +/* return 1 if offset @off is inside of extent unit pointed to by @coord. Set + pos_in_unit inside of unit correspondingly */ +static int +offset_is_in_unit(const coord_t *coord, loff_t off) +{ + reiser4_key unit_key; + __u64 unit_off; + reiser4_extent *ext; + + ext = extent_by_coord(coord); + + unit_key_extent(coord, &unit_key); + unit_off = get_key_offset(&unit_key); + if (off < unit_off) + return 0; + if (off >= (unit_off + (current_blocksize * extent_get_width(ext)))) + return 0; + return 1; +} + +static int +coord_matches_key_extent(const coord_t *coord, const reiser4_key *key) +{ + reiser4_key item_key; + + assert("vs-771", coord_is_existing_unit(coord)); + assert("vs-1258", keylt(key, append_key_extent(coord, &item_key))); + assert("vs-1259", keyge(key, item_key_by_coord(coord, &item_key))); + + return offset_is_in_unit(coord, get_key_offset(key)); +} + +static int +coord_extension_is_ok2(const uf_coord_t *uf_coord, const reiser4_key *key) +{ + reiser4_key coord_key; + + unit_key_by_coord(&uf_coord->coord, &coord_key); + set_key_offset(&coord_key, + get_key_offset(&coord_key) + + (uf_coord->extension.extent.pos_in_unit << PAGE_CACHE_SHIFT)); + return keyeq(key, &coord_key); +} + +#endif + +/* @coord is set either to the end of last extent item of a file + (coord->node is a node on the twig level) or to a place where first + item of file has to be inserted to (coord->node is leaf + node). Calculate size of hole to be inserted. If that hole is too + big - only part of it is inserted */ +static int +add_hole(coord_t *coord, lock_handle *lh, const reiser4_key *key /* key of position in a file for write */) +{ + int result; + znode *loaded; + reiser4_extent *ext, new_ext; + reiser4_block_nr hole_width; + reiser4_item_data item; + reiser4_key hole_key; + + /*coord_clear_iplug(coord);*/ + result = zload(coord->node); + if (result) + return result; + loaded = coord->node; + + if (znode_get_level(coord->node) == LEAF_LEVEL) { + /* there are no items of this file yet. First item will be + hole extent inserted here */ + + /* @coord must be set for inserting of new item */ + assert("vs-711", coord_is_between_items(coord)); + + hole_key = *key; + set_key_offset(&hole_key, 0ull); + + hole_width = ((get_key_offset(key) + current_blocksize - 1) >> + current_blocksize_bits); + assert("vs-710", hole_width > 0); + + /* compose body of hole extent */ + set_extent(&new_ext, HOLE_EXTENT_START, hole_width); + + result = insert_extent_by_coord(coord, init_new_extent(&item, &new_ext, 1), &hole_key, lh); + zrelse(loaded); + return result; + } + + /* last item of file may have to be appended with hole */ + assert("vs-708", znode_get_level(coord->node) == TWIG_LEVEL); + assert("vs-714", item_id_by_coord(coord) == EXTENT_POINTER_ID); + + /* make sure we are at proper item */ + assert("vs-918", keylt(key, max_key_inside_extent(coord, &hole_key))); + + /* key of first byte which is not addressed by this extent */ + append_key_extent(coord, &hole_key); + + if (keyle(key, &hole_key)) { + /* there is already extent unit which contains position + specified by @key */ + zrelse(loaded); + return 0; + } + + /* extent item has to be appended with hole. Calculate length of that + hole */ + hole_width = ((get_key_offset(key) - get_key_offset(&hole_key) + + current_blocksize - 1) >> current_blocksize_bits); + assert("vs-954", hole_width > 0); + + /* set coord after last unit */ + coord_init_after_item_end(coord); + + /* get last extent in the item */ + ext = extent_by_coord(coord); + if (state_of_extent(ext) == HOLE_EXTENT) { + /* last extent of a file is hole extent. Widen that extent by + @hole_width blocks. Note that we do not worry about + overflowing - extent width is 64 bits */ + set_extent(ext, HOLE_EXTENT_START, extent_get_width(ext) + hole_width); + znode_make_dirty(coord->node); + zrelse(loaded); + return 0; + } + + /* append item with hole extent unit */ + assert("vs-713", (state_of_extent(ext) == ALLOCATED_EXTENT || state_of_extent(ext) == UNALLOCATED_EXTENT)); + + /* compose body of hole extent */ + set_extent(&new_ext, HOLE_EXTENT_START, hole_width); + + result = insert_into_item(coord, lh, &hole_key, init_new_extent(&item, &new_ext, 1), 0 /*flags */ ); + zrelse(loaded); + return result; +} + +/* insert extent item (containing one unallocated extent of width 1) to place + set by @coord */ +static int +insert_first_block(uf_coord_t *uf_coord, const reiser4_key *key, reiser4_block_nr *block) +{ + int result; + reiser4_extent ext; + reiser4_item_data unit; + + /* make sure that we really write to first block */ + assert("vs-240", get_key_offset(key) == 0); + + /* extent insertion starts at leaf level */ + assert("vs-719", znode_get_level(uf_coord->coord.node) == LEAF_LEVEL); + + set_extent(&ext, UNALLOCATED_EXTENT_START, 1); + result = insert_extent_by_coord(&uf_coord->coord, init_new_extent(&unit, &ext, 1), key, uf_coord->lh); + if (result) { + /* FIXME-VITALY: this is grabbed at file_write time. */ + /* grabbed2free ((__u64)1); */ + return result; + } + + *block = fake_blocknr_unformatted(); + + /* invalidate coordinate, research must be performed to continue because write will continue on twig level */ + uf_coord->valid = 0; + return 0; +} + +/* @coord is set to the end of extent item. Append it with pointer to one block - either by expanding last unallocated + extent or by appending a new one of width 1 */ +static int +append_one_block(uf_coord_t *uf_coord, reiser4_key *key, reiser4_block_nr *block) +{ + int result; + reiser4_extent new_ext; + reiser4_item_data unit; + coord_t *coord; + extent_coord_extension_t *ext_coord; + reiser4_extent *ext; + + coord = &uf_coord->coord; + ext_coord = &uf_coord->extension.extent; + ext = ext_by_ext_coord(uf_coord); + + /* check correctness of position in the item */ + assert("vs-228", coord->unit_pos == coord_last_unit_pos(coord)); + assert("vs-1311", coord->between == AFTER_UNIT); + assert("vs-1302", ext_coord->pos_in_unit == ext_coord->width - 1); + assert("vs-883", + ( { + reiser4_key next; + keyeq(key, append_key_extent(coord, &next)); + })); + + switch (state_of_extent(ext)) { + case UNALLOCATED_EXTENT: + set_extent(ext, UNALLOCATED_EXTENT_START, extent_get_width(ext) + 1); + znode_make_dirty(coord->node); + + /* update coord extension */ + ext_coord->width ++; + ON_DEBUG(extent_set_width(&uf_coord->extension.extent.extent, ext_coord->width)); + break; + + case HOLE_EXTENT: + case ALLOCATED_EXTENT: + /* append one unallocated extent of width 1 */ + set_extent(&new_ext, UNALLOCATED_EXTENT_START, 1); + result = insert_into_item(coord, uf_coord->lh, key, init_new_extent(&unit, &new_ext, 1), 0 /* flags */ ); + /* FIXME: for now */ + uf_coord->valid = 0; + if (result) + return result; + break; + default: + assert("", 0); + } + + *block = fake_blocknr_unformatted(); + return 0; +} + +/* @coord is set to hole unit inside of extent item, replace hole unit with an + unit for unallocated extent of the width 1, and perhaps a hole unit before + the unallocated unit and perhaps a hole unit after the unallocated unit. */ +static int +plug_hole(uf_coord_t *uf_coord, reiser4_key *key) +{ + reiser4_extent *ext, new_exts[2], /* extents which will be added after original + * hole one */ + replace; /* extent original hole extent will be replaced + * with */ + reiser4_block_nr width, pos_in_unit; + reiser4_item_data item; + int count; + coord_t *coord; + extent_coord_extension_t *ext_coord; + reiser4_key tmp_key; + int return_inserted_position; + + coord = &uf_coord->coord; + ext_coord = &uf_coord->extension.extent; + ext = ext_by_ext_coord(uf_coord); + + width = ext_coord->width; + pos_in_unit = ext_coord->pos_in_unit; + + if (width == 1) { + set_extent(ext, UNALLOCATED_EXTENT_START, 1); + znode_make_dirty(coord->node); + /* update uf_coord */ + ON_DEBUG(ext_coord->extent = *ext); + return 0; + } else if (pos_in_unit == 0) { + /* we deal with first element of extent */ + if (coord->unit_pos) { + /* there is an extent to the left */ + if (state_of_extent(ext - 1) == UNALLOCATED_EXTENT) { + /* unit to the left is an unallocated extent. Increase its width and decrease width of + * hole */ + extent_set_width(ext - 1, extent_get_width(ext - 1) + 1); + extent_set_width(ext, width - 1); + znode_make_dirty(coord->node); + + /* update coord extension */ + coord->unit_pos --; + ext_coord->width = extent_get_width(ext - 1); + ext_coord->pos_in_unit = ext_coord->width - 1; + ext_coord->ext_offset -= sizeof(reiser4_extent); + ON_DEBUG(ext_coord->extent = *extent_by_coord(coord)); + return 0; + } + } + /* extent for replace */ + set_extent(&replace, UNALLOCATED_EXTENT_START, 1); + /* extent to be inserted */ + set_extent(&new_exts[0], HOLE_EXTENT_START, width - 1); + + /* have replace_extent to return with @coord and @uf_coord->lh set to unit which was replaced */ + return_inserted_position = 0; + count = 1; + } else if (pos_in_unit == width - 1) { + /* we deal with last element of extent */ + if (coord->unit_pos < nr_units_extent(coord) - 1) { + /* there is an extent unit to the right */ + if (state_of_extent(ext + 1) == UNALLOCATED_EXTENT) { + /* unit to the right is an unallocated extent. Increase its width and decrease width of + * hole */ + extent_set_width(ext + 1, extent_get_width(ext + 1) + 1); + extent_set_width(ext, width - 1); + znode_make_dirty(coord->node); + + /* update coord extension */ + coord->unit_pos ++; + ext_coord->width = extent_get_width(ext + 1); + ext_coord->pos_in_unit = 0; + ext_coord->ext_offset += sizeof(reiser4_extent); + ON_DEBUG(ext_coord->extent = *extent_by_coord(coord)); + return 0; + } + } + /* extent for replace */ + set_extent(&replace, HOLE_EXTENT_START, width - 1); + /* extent to be inserted */ + set_extent(&new_exts[0], UNALLOCATED_EXTENT_START, 1); + + /* have replace_extent to return with @coord and @uf_coord->lh set to unit which was inserted */ + return_inserted_position = 1; + count = 1; + } else { + /* extent for replace */ + set_extent(&replace, HOLE_EXTENT_START, pos_in_unit); + /* extents to be inserted */ + set_extent(&new_exts[0], UNALLOCATED_EXTENT_START, 1); + set_extent(&new_exts[1], HOLE_EXTENT_START, width - pos_in_unit - 1); + + /* have replace_extent to return with @coord and @uf_coord->lh set to first of units which were + inserted */ + return_inserted_position = 1; + count = 2; + } + + /* insert_into_item will insert new units after the one @coord is set + to. So, update key correspondingly */ + unit_key_by_coord(coord, &tmp_key); + set_key_offset(&tmp_key, (get_key_offset(&tmp_key) + extent_get_width(&replace) * current_blocksize)); + + uf_coord->valid = 0; + return replace_extent(coord, uf_coord->lh, &tmp_key, init_new_extent(&item, new_exts, count), &replace, 0 /* flags */, return_inserted_position); +} + +/* make unallocated node pointer in the position @uf_coord is set to */ +static int +overwrite_one_block(uf_coord_t *uf_coord, reiser4_key *key, reiser4_block_nr *block, int *created, + struct inode *inode) +{ + int result; + extent_coord_extension_t *ext_coord; + reiser4_extent *ext; + oid_t oid; + pgoff_t index; + + oid = get_key_objectid(key); + index = get_key_offset(key) >> current_blocksize_bits; + + assert("vs-1312", uf_coord->coord.between == AT_UNIT); + + result = 0; + *created = 0; + ext_coord = &uf_coord->extension.extent; + ext = ext_by_ext_coord(uf_coord); + + switch (state_of_extent(ext)) { + case ALLOCATED_EXTENT: + *block = extent_get_start(ext) + ext_coord->pos_in_unit; + break; + + case HOLE_EXTENT: + if (inode != NULL && DQUOT_ALLOC_BLOCK(inode, 1)) + return RETERR(-EDQUOT); + result = plug_hole(uf_coord, key); + if (!result) { + *block = fake_blocknr_unformatted(); + *created = 1; + } else { + if (inode != NULL) + DQUOT_FREE_BLOCK(inode, 1); + } + break; + + case UNALLOCATED_EXTENT: + break; + + default: + impossible("vs-238", "extent of unknown type found"); + result = RETERR(-EIO); + break; + } + + return result; +} + +#if REISER4_DEBUG + +/* after make extent uf_coord's lock handle must be set to node containing unit which was inserted/found */ +static void +check_make_extent_result(int result, write_mode_t mode, const reiser4_key *key, + const lock_handle *lh, reiser4_block_nr block) +{ + coord_t coord; + + if (result != 0) + return; + + assert("vs-960", znode_is_write_locked(lh->node)); + zload(lh->node); + result = lh->node->nplug->lookup(lh->node, key, FIND_EXACT, &coord); + assert("vs-1502", result == NS_FOUND); + assert("vs-1656", coord_is_existing_unit(&coord)); + + if (blocknr_is_fake(&block)) { + assert("vs-1657", state_of_extent(extent_by_coord(&coord)) == UNALLOCATED_EXTENT); + } else if (block == 0) { + assert("vs-1660", mode == OVERWRITE_ITEM); + assert("vs-1657", state_of_extent(extent_by_coord(&coord)) == UNALLOCATED_EXTENT); + } else { + reiser4_key tmp; + reiser4_block_nr pos_in_unit; + + assert("vs-1658", state_of_extent(extent_by_coord(&coord)) == ALLOCATED_EXTENT); + unit_key_by_coord(&coord, &tmp); + pos_in_unit = (get_key_offset(key) - get_key_offset(&tmp)) >> current_blocksize_bits; + assert("vs-1659", block == extent_get_start(extent_by_coord(&coord)) + pos_in_unit); + } + zrelse(lh->node); +} + +#endif + +/* when @inode is not NULL, alloc quota before updating extent item */ +static int +make_extent(reiser4_key *key, uf_coord_t *uf_coord, write_mode_t mode, + reiser4_block_nr *block, int *created, struct inode *inode) +{ + int result; + oid_t oid; + pgoff_t index; + + oid = get_key_objectid(key); + index = get_key_offset(key) >> current_blocksize_bits; + + assert("vs-960", znode_is_write_locked(uf_coord->coord.node)); + assert("vs-1334", znode_is_loaded(uf_coord->coord.node)); + + *block = 0; + switch (mode) { + case FIRST_ITEM: + /* new block will be inserted into file. Check quota */ + if (inode != NULL && DQUOT_ALLOC_BLOCK(inode, 1)) + return RETERR(-EDQUOT); + + /* create first item of the file */ + result = insert_first_block(uf_coord, key, block); + if (result && inode != NULL) + DQUOT_FREE_BLOCK(inode, 1); + *created = 1; + break; + + case APPEND_ITEM: + /* new block will be inserted into file. Check quota */ + if (inode != NULL && DQUOT_ALLOC_BLOCK(inode, 1)) + return RETERR(-EDQUOT); + + /* FIXME: item plugin should be initialized + item_plugin_by_coord(&uf_coord->base_coord);*/ + assert("vs-1316", coord_extension_is_ok(uf_coord)); + result = append_one_block(uf_coord, key, block); + if (result && inode != NULL) + DQUOT_FREE_BLOCK(inode, 1); + *created = 1; + break; + + case OVERWRITE_ITEM: + /* FIXME: item plugin should be initialized + item_plugin_by_coord(&uf_coord->base_coord);*/ + assert("vs-1316", coord_extension_is_ok(uf_coord)); + result = overwrite_one_block(uf_coord, key, block, created, inode); + break; + + default: + assert("vs-1346", 0); + result = RETERR(-E_REPEAT); + break; + } + + ON_DEBUG(check_make_extent_result(result, mode, key, uf_coord->lh, *block)); + + return result; +} + +/* estimate and reserve space which may be required for writing one page of file */ +static int +reserve_extent_write_iteration(struct inode *inode, reiser4_tree *tree) +{ + int result; + + grab_space_enable(); + /* one unformatted node and one insertion into tree and one stat data update may be involved */ + result = reiser4_grab_space(1 + /* Hans removed reservation for balancing here. */ + /* if extent items will be ever used by plugins other than unix file plugin - estimate update should instead be taken by + inode_file_plugin(inode)->estimate.update(inode) + */ + estimate_update_common(inode), + 0/* flags */); + return result; +} + +static void +write_move_coord(coord_t *coord, uf_coord_t *uf_coord, write_mode_t mode, int full_page) +{ + extent_coord_extension_t *ext_coord; + + assert("vs-1339", ergo(mode == OVERWRITE_ITEM, coord->between == AT_UNIT)); + assert("vs-1341", ergo(mode == FIRST_ITEM, uf_coord->valid == 0)); + + if (uf_coord->valid == 0) + return; + + ext_coord = &uf_coord->extension.extent; + + if (mode == APPEND_ITEM) { + assert("vs-1340", coord->between == AFTER_UNIT); + assert("vs-1342", coord->unit_pos == ext_coord->nr_units - 1); + assert("vs-1343", ext_coord->pos_in_unit == ext_coord->width - 2); + assert("vs-1344", state_of_extent(ext_by_ext_coord(uf_coord)) == UNALLOCATED_EXTENT); + ON_DEBUG(ext_coord->extent = *ext_by_ext_coord(uf_coord)); + ext_coord->pos_in_unit ++; + if (!full_page) + coord->between = AT_UNIT; + return; + } + + assert("vs-1345", coord->between == AT_UNIT); + + if (!full_page) + return; + if (ext_coord->pos_in_unit == ext_coord->width - 1) { + /* last position in the unit */ + if (coord->unit_pos == ext_coord->nr_units - 1) { + /* last unit in the item */ + uf_coord->valid = 0; + } else { + /* move to the next unit */ + coord->unit_pos ++; + ext_coord->ext_offset += sizeof(reiser4_extent); + ON_DEBUG(ext_coord->extent = *ext_by_offset(coord->node, ext_coord->ext_offset)); + ext_coord->width = extent_get_width(ext_by_offset(coord->node, ext_coord->ext_offset)); + ext_coord->pos_in_unit = 0; + } + } else + ext_coord->pos_in_unit ++; +} + +static int +write_is_partial(struct inode *inode, loff_t file_off, unsigned page_off, unsigned count) +{ + if (count == inode->i_sb->s_blocksize) + return 0; + if (page_off == 0 && file_off + count >= inode->i_size) + return 0; + return 1; +} + +/* this initialize content of page not covered by write */ +static void +zero_around(struct page *page, int from, int count) +{ + char *data; + + data = kmap_atomic(page, KM_USER0); + memset(data, 0, from); + memset(data + from + count, 0, PAGE_CACHE_SIZE - from - count); + flush_dcache_page(page); + kunmap_atomic(data, KM_USER0); +} + +static void assign_jnode_blocknr(jnode *j, reiser4_block_nr blocknr, int created) +{ + assert("vs-1737", !JF_ISSET(j, JNODE_EFLUSH)); + if (created) { + /* extent corresponding to this jnode was just created */ + assert("vs-1504", *jnode_get_block(j) == 0); + JF_SET(j, JNODE_CREATED); + /* new block is added to file. Update inode->i_blocks and inode->i_bytes. FIXME: + inode_set/get/add/sub_bytes is used to be called by quota macros */ + /*inode_add_bytes(inode, PAGE_CACHE_SIZE);*/ + } + + if (*jnode_get_block(j) == 0) { + jnode_set_block(j, &blocknr); + } else { + assert("vs-1508", !blocknr_is_fake(&blocknr)); + assert("vs-1507", ergo(blocknr, *jnode_get_block(j) == blocknr)); + } +} + +static int +extent_balance_dirty_pages(struct inode *inode, const flow_t *f, + hint_t *hint) +{ + int result; + + if (hint->ext_coord.valid) + set_hint(hint, &f->key, ZNODE_WRITE_LOCK); + else + unset_hint(hint); + longterm_unlock_znode(hint->ext_coord.lh); + + /* file was appended, update its size */ + if (get_key_offset(&f->key) > inode->i_size) { + assert("vs-1649", f->user == 1); + INODE_SET_FIELD(inode, i_size, get_key_offset(&f->key)); + } + if (f->user != 0) { + /* this was writing data from user space. Update timestamps, + therefore. Othrewise, this is tail conversion where we + should not update timestamps */ + inode->i_ctime = inode->i_mtime = CURRENT_TIME; + result = reiser4_update_sd(inode); + if (result) + return result; + } + + reiser4_throttle_write(inode); + return 0; +} + +/* write flow's data into file by pages */ +static int +extent_write_flow(struct inode *inode, flow_t *flow, hint_t *hint, + int grabbed, /* 0 if space for operation is not reserved yet, 1 - otherwise */ + write_mode_t mode) +{ + int result; + loff_t file_off; + unsigned long page_nr; + unsigned long page_off, count; + struct page *page; + jnode *j; + uf_coord_t *uf_coord; + coord_t *coord; + oid_t oid; + reiser4_tree *tree; + reiser4_key page_key; + reiser4_block_nr blocknr; + int created; + int err; + + err = 0; + + assert("nikita-3139", !inode_get_flag(inode, REISER4_NO_SD)); + assert("vs-885", current_blocksize == PAGE_CACHE_SIZE); + assert("vs-700", flow->user == 1); + assert("vs-1352", flow->length > 0); + + tree = tree_by_inode(inode); + oid = get_inode_oid(inode); + uf_coord = &hint->ext_coord; + coord = &uf_coord->coord; + + /* position in a file to start write from */ + file_off = get_key_offset(&flow->key); + /* index of page containing that offset */ + page_nr = (unsigned long)(file_off >> PAGE_CACHE_SHIFT); + /* offset within the page */ + page_off = (unsigned long)(file_off & (PAGE_CACHE_SIZE - 1)); + + /* key of first byte of page */ + page_key = flow->key; + set_key_offset(&page_key, (loff_t)page_nr << PAGE_CACHE_SHIFT); + do { + if (!grabbed) { + result = reserve_extent_write_iteration(inode, tree); + if (result) + goto exit0; + } + /* number of bytes to be written to page */ + count = PAGE_CACHE_SIZE - page_off; + if (count > flow->length) + count = flow->length; + + result = make_extent(&page_key, uf_coord, mode, &blocknr, &created, inode/* check quota */); + if (result) { + err = 2; + goto exit1; + } + + /* look for jnode and create it if it does not exist yet */ + j = find_get_jnode(tree, inode->i_mapping, oid, page_nr); + if (IS_ERR(j)) { + result = PTR_ERR(j); + err = 3; + goto exit1; + } + + /* get page looked and attached to jnode */ + page = jnode_get_page_locked(j, GFP_KERNEL); + if (IS_ERR(page)) { + result = PTR_ERR(page); + err = 4; + goto exit2; + } + + page_cache_get(page); + + if (!PageUptodate(page)) { + if (mode == OVERWRITE_ITEM) { + int blocknr_set = 0; + /* this page may be either an anonymous page (a + page which was dirtied via mmap, + writepage-ed and for which extent pointer + was just created. In this case jnode is + eflushed) or correspond to not page cached + block (in which case created == 0). In + either case we have to read this page if it + is being overwritten partially */ + if (write_is_partial(inode, file_off, page_off, count) && + (created == 0 || JF_ISSET(j, JNODE_EFLUSH))) { + if (!JF_ISSET(j, JNODE_EFLUSH)) { + /* eflush bit can be neither + set nor cleared by other + process because page + attached to jnode is + locked */ + LOCK_JNODE(j); + assign_jnode_blocknr(j, blocknr, created); + blocknr_set = 1; + UNLOCK_JNODE(j); + } + result = page_io(page, j, READ, GFP_KERNEL); + if (result) { + err = 5; + goto exit3; + } + lock_page(page); + if (!PageUptodate(page)) { + err = 6; + goto exit3; + } + } else { + zero_around(page, page_off, count); + } + + /* assign blocknr to jnode if it is not assigned yet */ + LOCK_JNODE(j); + eflush_del(j, 1); + if (blocknr_set == 0) + assign_jnode_blocknr(j, blocknr, created); + UNLOCK_JNODE(j); + } else { + /* new page added to the file. No need to carry + about data it might contain. Zero content of + new page around write area */ + assert("vs-1681", !JF_ISSET(j, JNODE_EFLUSH)); + zero_around(page, page_off, count); + + /* assign blocknr to jnode if it is not + assigned yet */ + LOCK_JNODE(j); + assign_jnode_blocknr(j, blocknr, created); + UNLOCK_JNODE(j); + } + } else { + LOCK_JNODE(j); + eflush_del(j, 1); + assign_jnode_blocknr(j, blocknr, created); + UNLOCK_JNODE(j); + } + + assert("vs-1503", UNDER_SPIN(jnode, j, (!JF_ISSET(j, JNODE_EFLUSH) && jnode_page(j) == page))); + assert("nikita-3033", schedulable()); + + /* copy user data into page */ + result = __copy_from_user((char *)kmap(page) + page_off, flow->data, count); + kunmap(page); + if (unlikely(result)) { + /* FIXME: write(fd, 0, 10); to empty file will write no + data but file will get increased size. */ + result = RETERR(-EFAULT); + err = 7; + goto exit3; + } + + set_page_dirty_internal(page, 0); + SetPageUptodate(page); + if (!PageReferenced(page)) + SetPageReferenced(page); + + unlock_page(page); + page_cache_release(page); + + /* FIXME: possible optimization: if jnode is not dirty yet - it + gets into clean list in try_capture and then in + jnode_mark_dirty gets moved to dirty list. So, it would be + more optimal to put jnode directly to dirty list */ + LOCK_JNODE(j); + result = try_capture(j, ZNODE_WRITE_LOCK, 0, 1/* can_coc */); + if (result) { + UNLOCK_JNODE(j); + err = 8; + goto exit2; + } + jnode_make_dirty_locked(j); + UNLOCK_JNODE(j); + + jput(j); + + move_flow_forward(flow, count); + write_move_coord(coord, uf_coord, mode, page_off + count == PAGE_CACHE_SIZE); + + /* set seal, drop long term lock, throttle the writer */ + result = extent_balance_dirty_pages(inode, flow, hint); + if (!grabbed) + all_grabbed2free(); + if (result) { + err = 9; + break; + } + + page_off = 0; + page_nr ++; + file_off += count; + set_key_offset(&page_key, (loff_t)page_nr << PAGE_CACHE_SHIFT); + + if (flow->length && uf_coord->valid == 1) { + /* loop continues - try to obtain lock validating a + seal set in extent_balance_dirty_pages*/ + result = hint_validate(hint, &flow->key, 0/* do not check key */, ZNODE_WRITE_LOCK); + if (result == 0) + continue; + } + break; + + /* handling various error code pathes */ + exit3: + unlock_page(page); + page_cache_release(page); + exit2: + if (created) + inode_sub_bytes(inode, PAGE_CACHE_SIZE); + jput(j); + exit1: + if (!grabbed) + all_grabbed2free(); + + exit0: + unset_hint(hint); + longterm_unlock_znode(hint->ext_coord.lh); + break; + + } while (1); + + if (err) { + assert("", !hint_is_set(hint)); + } else + assert("", ergo(hint_is_set(hint), + coords_equal(&hint->ext_coord.coord, &hint->seal.coord1) && + keyeq(&flow->key, &hint->seal.key))); + assert("", lock_stack_isclean(get_current_lock_stack())); + return result; +} + +/* estimate and reserve space which may be required for appending file with hole stored in extent */ +static int +extent_hole_reserve(reiser4_tree *tree) +{ + /* adding hole may require adding a hole unit into extent item and stat data update */ + grab_space_enable(); + return reiser4_grab_space(estimate_one_insert_into_item(tree) * 2, 0); +} + +static int +extent_write_hole(struct inode *inode, flow_t *flow, hint_t *hint, int grabbed) +{ + int result; + loff_t new_size; + coord_t *coord; + lock_handle *lh; + + coord = &hint->ext_coord.coord; + lh = hint->ext_coord.lh; + if (!grabbed) { + result = extent_hole_reserve(znode_get_tree(coord->node)); + if (result) { + unset_hint(hint); + done_lh(lh); + return result; + } + } + + new_size = get_key_offset(&flow->key) + flow->length; + set_key_offset(&flow->key, new_size); + flow->length = 0; + result = add_hole(coord, lh, &flow->key); + hint->ext_coord.valid = 0; + unset_hint(hint); + done_lh(lh); + if (!result) { + INODE_SET_FIELD(inode, i_size, new_size); + inode->i_ctime = inode->i_mtime = CURRENT_TIME; + result = reiser4_update_sd(inode); + } + if (!grabbed) + all_grabbed2free(); + return result; +} + +/* + plugin->s.file.write + It can be called in two modes: + 1. real write - to write data from flow to a file (@flow->data != 0) + 2. expanding truncate (@f->data == 0) +*/ +reiser4_internal int +write_extent(struct inode *inode, flow_t *flow, hint_t *hint, + int grabbed, /* extent's write may be called from plain unix file write and from tail conversion. In first + case (grabbed == 0) space is not reserved forehand, so, it must be done here. When it is + being called from tail conversion - space is reserved already for whole operation which may + involve several calls to item write. In this case space reservation will not be done + here */ + write_mode_t mode) +{ + if (flow->data) + /* real write */ + return extent_write_flow(inode, flow, hint, grabbed, mode); + + /* expanding truncate. add_hole requires f->key to be set to new end of file */ + return extent_write_hole(inode, flow, hint, grabbed); +} + +static inline void +zero_page(struct page *page) +{ + char *kaddr = kmap_atomic(page, KM_USER0); + + memset(kaddr, 0, PAGE_CACHE_SIZE); + flush_dcache_page(page); + kunmap_atomic(kaddr, KM_USER0); + SetPageUptodate(page); + unlock_page(page); +} + +static int +do_readpage_extent(reiser4_extent *ext, reiser4_block_nr pos, struct page *page) +{ + jnode *j; + struct address_space *mapping; + unsigned long index; + oid_t oid; + + mapping = page->mapping; + oid = get_inode_oid(mapping->host); + index = page->index; + + switch (state_of_extent(ext)) { + case HOLE_EXTENT: + /* + * it is possible to have hole page with jnode, if page was + * eflushed previously. + */ + j = jfind(mapping, index); + if (j == NULL) { + zero_page(page); + return 0; + } + LOCK_JNODE(j); + if (!jnode_page(j)) { + jnode_attach_page(j, page); + } else { + BUG_ON(jnode_page(j) != page); + assert("vs-1504", jnode_page(j) == page); + } + + UNLOCK_JNODE(j); + break; + + case ALLOCATED_EXTENT: + j = jnode_of_page(page); + if (IS_ERR(j)) + return PTR_ERR(j); + if (*jnode_get_block(j) == 0) { + reiser4_block_nr blocknr; + + blocknr = extent_get_start(ext) + pos; + jnode_set_block(j, &blocknr); + } else + assert("vs-1403", j->blocknr == extent_get_start(ext) + pos); + break; + + case UNALLOCATED_EXTENT: + j = jfind(mapping, index); + assert("nikita-2688", j); + assert("vs-1426", jnode_page(j) == NULL); + + UNDER_SPIN_VOID(jnode, j, jnode_attach_page(j, page)); + + /* page is locked, it is safe to check JNODE_EFLUSH */ + assert("vs-1668", JF_ISSET(j, JNODE_EFLUSH)); + break; + + default: + warning("vs-957", "wrong extent\n"); + return RETERR(-EIO); + } + + BUG_ON(j == 0); + page_io(page, j, READ, GFP_NOIO); + jput(j); + return 0; +} + +static int +move_coord_pages(coord_t *coord, extent_coord_extension_t *ext_coord, unsigned count) +{ + reiser4_extent *ext; + + ext_coord->expected_page += count; + + ext = ext_by_offset(coord->node, ext_coord->ext_offset); + + do { + if (ext_coord->pos_in_unit + count < ext_coord->width) { + ext_coord->pos_in_unit += count; + break; + } + + if (coord->unit_pos == ext_coord->nr_units - 1) { + coord->between = AFTER_UNIT; + return 1; + } + + /* shift to next unit */ + count -= (ext_coord->width - ext_coord->pos_in_unit); + coord->unit_pos ++; + ext_coord->pos_in_unit = 0; + ext_coord->ext_offset += sizeof(reiser4_extent); + ext ++; + ON_DEBUG(ext_coord->extent = *ext); + ext_coord->width = extent_get_width(ext); + } while (1); + + return 0; +} + +static int +readahead_readpage_extent(void *vp, struct page *page) +{ + int result; + uf_coord_t *uf_coord; + coord_t *coord; + extent_coord_extension_t *ext_coord; + + uf_coord = vp; + coord = &uf_coord->coord; + + if (coord->between != AT_UNIT) { + unlock_page(page); + return RETERR(-EINVAL); + } + + ext_coord = &uf_coord->extension.extent; + if (ext_coord->expected_page != page->index) { + /* read_cache_pages skipped few pages. Try to adjust coord to page */ + assert("vs-1269", page->index > ext_coord->expected_page); + if (move_coord_pages(coord, ext_coord, page->index - ext_coord->expected_page)) { + /* extent pointing to this page is not here */ + unlock_page(page); + return RETERR(-EINVAL); + } + + assert("vs-1274", offset_is_in_unit(coord, + (loff_t)page->index << PAGE_CACHE_SHIFT)); + ext_coord->expected_page = page->index; + } + + assert("vs-1281", page->index == ext_coord->expected_page); + result = do_readpage_extent(ext_by_ext_coord(uf_coord), ext_coord->pos_in_unit, page); + if (!result) + move_coord_pages(coord, ext_coord, 1); + return result; +} + +static int +move_coord_forward(uf_coord_t *ext_coord) +{ + coord_t *coord; + extent_coord_extension_t *extension; + + assert("", coord_extension_is_ok(ext_coord)); + + extension = &ext_coord->extension.extent; + extension->pos_in_unit ++; + if (extension->pos_in_unit < extension->width) + /* stay within the same extent unit */ + return 0; + + coord = &ext_coord->coord; + + /* try to move to the next extent unit */ + coord->unit_pos ++; + if (coord->unit_pos < extension->nr_units) { + /* went to the next extent unit */ + reiser4_extent *ext; + + extension->pos_in_unit = 0; + extension->ext_offset += sizeof(reiser4_extent); + ext = ext_by_offset(coord->node, extension->ext_offset); + ON_DEBUG(extension->extent = *ext); + extension->width = extent_get_width(ext); + return 0; + } + + /* there is no units in the item anymore */ + return 1; +} + +/* this is called by read_cache_pages for each of readahead pages */ +static int +extent_readpage_filler(void *data, struct page *page) +{ + hint_t *hint; + loff_t offset; + reiser4_key key; + uf_coord_t *ext_coord; + int result; + + offset = (loff_t)page->index << PAGE_CACHE_SHIFT; + key_by_inode_unix_file(page->mapping->host, offset, &key); + + hint = (hint_t *)data; + ext_coord = &hint->ext_coord; + if (hint_validate(hint, &key, 1/* check key */, ZNODE_READ_LOCK) != 0) { + result = coord_by_key(current_tree, &key, &ext_coord->coord, + ext_coord->lh, ZNODE_READ_LOCK, + FIND_EXACT, TWIG_LEVEL, + TWIG_LEVEL, CBK_UNIQUE, NULL); + if (result != CBK_COORD_FOUND) { + unset_hint(hint); + return result; + } + ext_coord->valid = 0; + } + + if (zload(ext_coord->coord.node)) { + unset_hint(hint); + done_lh(ext_coord->lh); + return RETERR(-EIO); + } + + if (ext_coord->valid == 0) + init_coord_extension_extent(ext_coord, offset); + + assert("", (coord_extension_is_ok(ext_coord) && + coord_extension_is_ok2(ext_coord, &key))); + + result = do_readpage_extent(ext_by_ext_coord(ext_coord), + ext_coord->extension.extent.pos_in_unit, page); + if (!result && move_coord_forward(ext_coord) == 0) { + set_key_offset(&key, offset + PAGE_CACHE_SIZE); + set_hint(hint, &key, ZNODE_READ_LOCK); + } else + unset_hint(hint); + zrelse(ext_coord->coord.node); + done_lh(ext_coord->lh); + return result; +} + +/* this is called by reiser4_readpages */ +static void +extent_readpages_hook(struct address_space *mapping, struct list_head *pages, void *data) +{ + /* FIXME: try whether having reiser4_read_cache_pages improves anything */ + read_cache_pages(mapping, pages, extent_readpage_filler, data); +} + +static void +call_page_cache_readahead(struct address_space *mapping, struct file *file, + hint_t *hint, + unsigned long page_nr, + unsigned long ra_pages) +{ + reiser4_file_fsdata *fsdata; + + fsdata = reiser4_get_file_fsdata(file); + if (fsdata == NULL) + return; + fsdata->ra2.data = hint; + fsdata->ra2.readpages = extent_readpages_hook; + + page_cache_readahead(mapping, &file->f_ra, file, page_nr, ra_pages); + fsdata->ra2.readpages = NULL; +} + +/* Implements plugin->u.item.s.file.read operation for extent items. */ +reiser4_internal int +read_extent(struct file *file, flow_t *flow, hint_t *hint) +{ + int result; + struct page *page; + unsigned long page_nr; + unsigned long page_off, count; + struct address_space *mapping; + loff_t file_off; + uf_coord_t *uf_coord; + coord_t *coord; + extent_coord_extension_t *ext_coord; + unsigned long ra_pages; + + assert("vs-1353", current_blocksize == PAGE_CACHE_SIZE); + assert("vs-572", flow->user == 1); + assert("vs-1351", flow->length > 0); + + uf_coord = &hint->ext_coord; + assert("vs-1318", coord_extension_is_ok(uf_coord)); + + coord = &uf_coord->coord; + assert("vs-1119", znode_is_rlocked(coord->node)); + assert("vs-1120", znode_is_loaded(coord->node)); + assert("vs-1256", coord_matches_key_extent(coord, &flow->key)); + + mapping = file->f_dentry->d_inode->i_mapping; + ext_coord = &uf_coord->extension.extent; + + /* offset in a file to start read from */ + file_off = get_key_offset(&flow->key); + /* index of page containing that offset */ + page_nr = (unsigned long)(file_off >> PAGE_CACHE_SHIFT); + /* offset within the page to start read from */ + page_off = (unsigned long)(file_off & (PAGE_CACHE_SIZE - 1)); + /* bytes which can be read from the page which contains file_off */ + count = PAGE_CACHE_SIZE - page_off; + /* number of pages flow spans over */ + ra_pages = (flow->length + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + + /* we start having twig node read locked. However, we do not want to + keep that lock all the time readahead works. So, set a sel and + release twig node. */ + set_hint(hint, &flow->key, ZNODE_READ_LOCK); + longterm_unlock_znode(hint->ext_coord.lh); + + do { + call_page_cache_readahead(mapping, file, hint, page_nr, ra_pages); + + /* this will return page if it exists and is uptodate, + otherwise it will allocate page and call readpage_extent to + fill it */ + page = read_cache_page(mapping, page_nr, readpage_unix_file, file); + if (IS_ERR(page)) + return PTR_ERR(page); + + wait_on_page_locked(page); + if (!PageUptodate(page)) { + page_detach_jnode(page, mapping, page_nr); + page_cache_release(page); + warning("jmacd-97178", "extent_read: page is not up to date"); + return RETERR(-EIO); + } + + /* If users can be writing to this page using arbitrary virtual addresses, take care about potential + aliasing before reading the page on the kernel side. + */ + if (mapping_writably_mapped(mapping)) + flush_dcache_page(page); + + assert("nikita-3034", schedulable()); + + /* number of bytes which are to be read from the page */ + if (count > flow->length) + count = flow->length; + /* user area is already get_user_pages-ed in read_unix_file, + which makes major page faults impossible */ + result = __copy_to_user(flow->data, (char *)kmap(page) + page_off, count); + kunmap(page); + + page_cache_release(page); + if (unlikely(result)) + return RETERR(-EFAULT); + + /* increase key (flow->key), update user area pointer (flow->data) */ + move_flow_forward(flow, count); + + page_off = 0; + page_nr ++; + count = PAGE_CACHE_SIZE; + ra_pages --; + } while (flow->length); + + return 0; +} + +/* + plugin->u.item.s.file.readpages +*/ +reiser4_internal void +readpages_extent(void *vp, struct address_space *mapping, struct list_head *pages) +{ + assert("vs-1739", 0); + if (vp) + read_cache_pages(mapping, pages, readahead_readpage_extent, vp); +} + +/* + plugin->s.file.readpage + reiser4_read->unix_file_read->page_cache_readahead->reiser4_readpage->unix_file_readpage->extent_readpage + or + filemap_nopage->reiser4_readpage->readpage_unix_file->->readpage_extent + + At the beginning: coord->node is read locked, zloaded, page is + locked, coord is set to existing unit inside of extent item (it is not necessary that coord matches to page->index) +*/ +reiser4_internal int +readpage_extent(void *vp, struct page *page) +{ + uf_coord_t *uf_coord = vp; + ON_DEBUG(coord_t *coord = &uf_coord->coord); + ON_DEBUG(reiser4_key key); + + assert("vs-1040", PageLocked(page)); + assert("vs-1050", !PageUptodate(page)); + assert("vs-757", !jprivate(page) && !PagePrivate(page)); + assert("vs-1039", page->mapping && page->mapping->host); + + assert("vs-1044", znode_is_loaded(coord->node)); + assert("vs-758", item_is_extent(coord)); + assert("vs-1046", coord_is_existing_unit(coord)); + assert("vs-1045", znode_is_rlocked(coord->node)); + assert("vs-1047", page->mapping->host->i_ino == get_key_objectid(item_key_by_coord(coord, &key))); + assert("vs-1320", coord_extension_is_ok(uf_coord)); + + return do_readpage_extent(ext_by_ext_coord(uf_coord), uf_coord->extension.extent.pos_in_unit, page); +} + +/* + plugin->s.file.capture + + At the beginning: coord.node is write locked, zloaded, page is not locked, coord is set to existing unit inside of + extent item +*/ +reiser4_internal int +capture_extent(reiser4_key *key, uf_coord_t *uf_coord, struct page *page, write_mode_t mode) +{ + jnode *j; + int result; + reiser4_block_nr blocknr; + int created; + int check_quota; + + assert("vs-1051", page->mapping && page->mapping->host); + assert("nikita-3139", !inode_get_flag(page->mapping->host, REISER4_NO_SD)); + assert("vs-864", znode_is_wlocked(uf_coord->coord.node)); + assert("vs-1398", get_key_objectid(key) == get_inode_oid(page->mapping->host)); + + /* FIXME: assume for now that quota is only checked on write */ + check_quota = 0; + result = make_extent(key, uf_coord, mode, &blocknr, &created, check_quota ? page->mapping->host : NULL); + if (result) { + done_lh(uf_coord->lh); + return result; + } + + lock_page(page); + j = jnode_of_page(page); + if (IS_ERR(j)) { + unlock_page(page); + done_lh(uf_coord->lh); + return PTR_ERR(j); + } + UNDER_SPIN_VOID(jnode, j, eflush_del(j, 1)); + set_page_dirty_internal(page, 0); + unlock_page(page); + + LOCK_JNODE(j); + BUG_ON(JF_ISSET(j, JNODE_EFLUSH)); + if (created) { + /* extent corresponding to this jnode was just created */ + assert("vs-1504", *jnode_get_block(j) == 0); + JF_SET(j, JNODE_CREATED); + /* new block is added to file. Update inode->i_blocks and inode->i_bytes. FIXME: + inode_set/get/add/sub_bytes is used to be called by quota macros */ + inode_add_bytes(page->mapping->host, PAGE_CACHE_SIZE); + } + + if (*jnode_get_block(j) == 0) + jnode_set_block(j, &blocknr); + else { + assert("vs-1508", !blocknr_is_fake(&blocknr)); + assert("vs-1507", ergo(blocknr, *jnode_get_block(j) == blocknr)); + } + UNLOCK_JNODE(j); + + done_lh(uf_coord->lh); + + LOCK_JNODE(j); + result = try_capture(j, ZNODE_WRITE_LOCK, 0, 1/* can_coc */); + if (result != 0) + reiser4_panic("nikita-3324", "Cannot capture jnode: %i", result); + jnode_make_dirty_locked(j); + UNLOCK_JNODE(j); + jput(j); + + if (created) + reiser4_update_sd(page->mapping->host); + /* warning about failure of this is issued already */ + + return 0; +} + +/* + plugin->u.item.s.file.get_block +*/ +reiser4_internal int +get_block_address_extent(const coord_t *coord, sector_t block, struct buffer_head *bh) +{ + reiser4_extent *ext; + + assert("vs-1321", coord_is_existing_unit(coord)); + + ext = extent_by_coord(coord); + + if (state_of_extent(ext) != ALLOCATED_EXTENT) + /* FIXME: bad things may happen if it is unallocated extent */ + bh->b_blocknr = 0; + else { + reiser4_key key; + + unit_key_by_coord(coord, &key); + assert("vs-1645", block >= get_key_offset(&key) >> current_blocksize_bits); + assert("vs-1646", block < (get_key_offset(&key) >> current_blocksize_bits) + extent_get_width(ext)); + bh->b_blocknr = extent_get_start(ext) + (block - (get_key_offset(&key) >> current_blocksize_bits)); + } + return 0; +} + +/* + plugin->u.item.s.file.append_key + key of first byte which is the next to last byte by addressed by this extent +*/ +reiser4_internal reiser4_key * +append_key_extent(const coord_t *coord, reiser4_key *key) +{ + item_key_by_coord(coord, key); + set_key_offset(key, get_key_offset(key) + extent_size(coord, nr_units_extent(coord))); + + assert("vs-610", get_key_offset(key) && (get_key_offset(key) & (current_blocksize - 1)) == 0); + return key; +} + +/* plugin->u.item.s.file.init_coord_extension */ +reiser4_internal void +init_coord_extension_extent(uf_coord_t *uf_coord, loff_t lookuped) +{ + coord_t *coord; + extent_coord_extension_t *ext_coord; + reiser4_key key; + loff_t offset; + + assert("vs-1295", uf_coord->valid == 0); + + coord = &uf_coord->coord; + assert("vs-1288", coord_is_iplug_set(coord)); + assert("vs-1327", znode_is_loaded(coord->node)); + + if (coord->between != AFTER_UNIT && coord->between != AT_UNIT) + return; + + ext_coord = &uf_coord->extension.extent; + ext_coord->nr_units = nr_units_extent(coord); + ext_coord->ext_offset = (char *)extent_by_coord(coord) - zdata(coord->node); + ext_coord->width = extent_get_width(extent_by_coord(coord)); + ON_DEBUG(ext_coord->extent = *extent_by_coord(coord)); + uf_coord->valid = 1; + + /* pos_in_unit is the only uninitialized field in extended coord */ + if (coord->between == AFTER_UNIT) { + assert("vs-1330", coord->unit_pos == nr_units_extent(coord) - 1); + + ext_coord->pos_in_unit = ext_coord->width - 1; + } else { + /* AT_UNIT */ + unit_key_by_coord(coord, &key); + offset = get_key_offset(&key); + + assert("vs-1328", offset <= lookuped); + assert("vs-1329", lookuped < offset + ext_coord->width * current_blocksize); + ext_coord->pos_in_unit = ((lookuped - offset) >> current_blocksize_bits); + } +} + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/extent_flush_ops.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/extent_flush_ops.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1003 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "item.h" +#include "../../tree.h" +#include "../../jnode.h" +#include "../../super.h" +#include "../../flush.h" +#include "../../carry.h" +#include "../object.h" + +#include + +static reiser4_block_nr extent_unit_start(const coord_t *item); + +/* Return either first or last extent (depending on @side) of the item + @coord is set to. Set @pos_in_unit either to first or to last block + of extent. */ +static reiser4_extent * +extent_utmost_ext(const coord_t *coord, sideof side, reiser4_block_nr *pos_in_unit) +{ + reiser4_extent *ext; + + if (side == LEFT_SIDE) { + /* get first extent of item */ + ext = extent_item(coord); + *pos_in_unit = 0; + } else { + /* get last extent of item and last position within it */ + assert("vs-363", side == RIGHT_SIDE); + ext = extent_item(coord) + coord_last_unit_pos(coord); + *pos_in_unit = extent_get_width(ext) - 1; + } + + return ext; +} + +/* item_plugin->f.utmost_child */ +/* Return the child. Coord is set to extent item. Find jnode corresponding + either to first or to last unformatted node pointed by the item */ +reiser4_internal int +utmost_child_extent(const coord_t *coord, sideof side, jnode **childp) +{ + reiser4_extent *ext; + reiser4_block_nr pos_in_unit; + + ext = extent_utmost_ext(coord, side, &pos_in_unit); + + switch (state_of_extent(ext)) { + case HOLE_EXTENT: + *childp = NULL; + return 0; + case ALLOCATED_EXTENT: + case UNALLOCATED_EXTENT: + break; + default: + /* this should never happen */ + assert("vs-1417", 0); + } + + { + reiser4_key key; + reiser4_tree *tree; + unsigned long index; + + if (side == LEFT_SIDE) { + /* get key of first byte addressed by the extent */ + item_key_by_coord(coord, &key); + } else { + /* get key of byte which next after last byte addressed by the extent */ + append_key_extent(coord, &key); + } + + assert("vs-544", (get_key_offset(&key) >> PAGE_CACHE_SHIFT) < ~0ul); + /* index of first or last (depending on @side) page addressed + by the extent */ + index = (unsigned long) (get_key_offset(&key) >> PAGE_CACHE_SHIFT); + if (side == RIGHT_SIDE) + index --; + + tree = coord->node->zjnode.tree; + *childp = jlookup(tree, get_key_objectid(&key), index); + } + + return 0; +} + +/* item_plugin->f.utmost_child_real_block */ +/* Return the child's block, if allocated. */ +reiser4_internal int +utmost_child_real_block_extent(const coord_t *coord, sideof side, reiser4_block_nr *block) +{ + reiser4_extent *ext; + + ext = extent_by_coord(coord); + + switch (state_of_extent(ext)) { + case ALLOCATED_EXTENT: + *block = extent_get_start(ext); + if (side == RIGHT_SIDE) + *block += extent_get_width(ext) - 1; + break; + case HOLE_EXTENT: + case UNALLOCATED_EXTENT: + *block = 0; + break; + default: + /* this should never happen */ + assert("vs-1418", 0); + } + + return 0; +} + +/* item_plugin->f.scan */ +/* Performs leftward scanning starting from an unformatted node and its parent coordinate. + This scan continues, advancing the parent coordinate, until either it encounters a + formatted child or it finishes scanning this node. + + If unallocated, the entire extent must be dirty and in the same atom. (Actually, I'm + not sure this is last property (same atom) is enforced, but it should be the case since + one atom must write the parent and the others must read the parent, thus fusing?). In + any case, the code below asserts this case for unallocated extents. Unallocated + extents are thus optimized because we can skip to the endpoint when scanning. + + It returns control to scan_extent, handles these terminating conditions, e.g., by + loading the next twig. +*/ +reiser4_internal int scan_extent(flush_scan * scan) +{ + coord_t coord; + jnode *neighbor; + unsigned long scan_index, unit_index, unit_width, scan_max, scan_dist; + reiser4_block_nr unit_start; + __u64 oid; + reiser4_key key; + int ret = 0, allocated, incr; + reiser4_tree *tree; + + if (!jnode_check_dirty(scan->node)) { + scan->stop = 1; + return 0; /* Race with truncate, this node is already + * truncated. */ + } + + coord_dup(&coord, &scan->parent_coord); + + assert("jmacd-1404", !scan_finished(scan)); + assert("jmacd-1405", jnode_get_level(scan->node) == LEAF_LEVEL); + assert("jmacd-1406", jnode_is_unformatted(scan->node)); + + /* The scan_index variable corresponds to the current page index of the + unformatted block scan position. */ + scan_index = index_jnode(scan->node); + + assert("jmacd-7889", item_is_extent(&coord)); + +repeat: + /* objectid of file */ + oid = get_key_objectid(item_key_by_coord(&coord, &key)); + + allocated = !extent_is_unallocated(&coord); + /* Get the values of this extent unit: */ + unit_index = extent_unit_index(&coord); + unit_width = extent_unit_width(&coord); + unit_start = extent_unit_start(&coord); + + assert("jmacd-7187", unit_width > 0); + assert("jmacd-7188", scan_index >= unit_index); + assert("jmacd-7189", scan_index <= unit_index + unit_width - 1); + + /* Depending on the scan direction, we set different maximum values for scan_index + (scan_max) and the number of nodes that would be passed if the scan goes the + entire way (scan_dist). Incr is an integer reflecting the incremental + direction of scan_index. */ + if (scanning_left(scan)) { + scan_max = unit_index; + scan_dist = scan_index - unit_index; + incr = -1; + } else { + scan_max = unit_index + unit_width - 1; + scan_dist = scan_max - unit_index; + incr = +1; + } + + tree = coord.node->zjnode.tree; + + /* If the extent is allocated we have to check each of its blocks. If the extent + is unallocated we can skip to the scan_max. */ + if (allocated) { + do { + neighbor = jlookup(tree, oid, scan_index); + if (neighbor == NULL) + goto stop_same_parent; + + if (scan->node != neighbor && !scan_goto(scan, neighbor)) { + /* @neighbor was jput() by scan_goto(). */ + goto stop_same_parent; + } + + ret = scan_set_current(scan, neighbor, 1, &coord); + if (ret != 0) { + goto exit; + } + + /* reference to @neighbor is stored in @scan, no need + to jput(). */ + scan_index += incr; + + } while (incr + scan_max != scan_index); + + } else { + /* Optimized case for unallocated extents, skip to the end. */ + neighbor = jlookup(tree, oid, scan_max/*index*/); + if (neighbor == NULL) { + /* Race with truncate */ + scan->stop = 1; + ret = 0; + goto exit; + } + + assert ("zam-1043", blocknr_is_fake(jnode_get_block(neighbor))); + + /* XXX commented assertion out, because it is inherently + * racy */ + /* assert("jmacd-3551", !jnode_check_flushprepped(neighbor) + && same_slum_check(neighbor, scan->node, 0, 0)); */ + + ret = scan_set_current(scan, neighbor, scan_dist, &coord); + if (ret != 0) { + goto exit; + } + } + + if (coord_sideof_unit(&coord, scan->direction) == 0 && item_is_extent(&coord)) { + /* Continue as long as there are more extent units. */ + + scan_index = + extent_unit_index(&coord) + (scanning_left(scan) ? extent_unit_width(&coord) - 1 : 0); + goto repeat; + } + + if (0) { +stop_same_parent: + + /* If we are scanning left and we stop in the middle of an allocated + extent, we know the preceder immediately.. */ + /* middle of extent is (scan_index - unit_index) != 0. */ + if (scanning_left(scan) && (scan_index - unit_index) != 0) { + /* FIXME(B): Someone should step-through and verify that this preceder + calculation is indeed correct. */ + /* @unit_start is starting block (number) of extent + unit. Flush stopped at the @scan_index block from + the beginning of the file, which is (scan_index - + unit_index) block within extent. + */ + if (unit_start) { + /* skip preceder update when we are at hole */ + scan->preceder_blk = unit_start + scan_index - unit_index; + check_preceder(scan->preceder_blk); + } + } + + /* In this case, we leave coord set to the parent of scan->node. */ + scan->stop = 1; + + } else { + /* In this case, we are still scanning, coord is set to the next item which is + either off-the-end of the node or not an extent. */ + assert("jmacd-8912", scan->stop == 0); + assert("jmacd-7812", (coord_is_after_sideof_unit(&coord, scan->direction) + || !item_is_extent(&coord))); + } + + ret = 0; +exit: + return ret; +} + +/* ask block allocator for some blocks */ +static void +extent_allocate_blocks(reiser4_blocknr_hint *preceder, + reiser4_block_nr wanted_count, reiser4_block_nr *first_allocated, reiser4_block_nr *allocated, block_stage_t block_stage) +{ + *allocated = wanted_count; + preceder->max_dist = 0; /* scan whole disk, if needed */ + + /* that number of blocks (wanted_count) is either in UNALLOCATED or in GRABBED */ + preceder->block_stage = block_stage; + + /* FIXME: we do not handle errors here now */ + check_me("vs-420", reiser4_alloc_blocks (preceder, first_allocated, allocated, BA_PERMANENT) == 0); + /* update flush_pos's preceder to last allocated block number */ + preceder->blk = *first_allocated + *allocated - 1; +} + +/* when on flush time unallocated extent is to be replaced with allocated one it may happen that one unallocated extent + will have to be replaced with set of allocated extents. In this case insert_into_item will be called which may have + to add new nodes into tree. Space for that is taken from inviolable reserve (5%). */ +static reiser4_block_nr +reserve_replace(void) +{ + reiser4_block_nr grabbed, needed; + + grabbed = get_current_context()->grabbed_blocks; + needed = estimate_one_insert_into_item(current_tree); + check_me("vpf-340", !reiser4_grab_space_force(needed, BA_RESERVED)); + return grabbed; +} + +static void +free_replace_reserved(reiser4_block_nr grabbed) +{ + reiser4_context *ctx; + + ctx = get_current_context(); + grabbed2free(ctx, get_super_private(ctx->super), + ctx->grabbed_blocks - grabbed); +} + +/* Block offset of first block addressed by unit */ +reiser4_internal __u64 +extent_unit_index(const coord_t *item) +{ + reiser4_key key; + + assert("vs-648", coord_is_existing_unit(item)); + unit_key_by_coord(item, &key); + return get_key_offset(&key) >> current_blocksize_bits; +} + +/* AUDIT shouldn't return value be of reiser4_block_nr type? + Josh's answer: who knows? Is a "number of blocks" the same type as "block offset"? */ +reiser4_internal __u64 +extent_unit_width(const coord_t *item) +{ + assert("vs-649", coord_is_existing_unit(item)); + return width_by_coord(item); +} + +/* Starting block location of this unit */ +static reiser4_block_nr +extent_unit_start(const coord_t *item) +{ + return extent_get_start(extent_by_coord(item)); +} + +/* replace allocated extent with two allocated extents */ +static int +split_allocated_extent(coord_t *coord, reiser4_block_nr pos_in_unit) +{ + int result; + reiser4_extent *ext; + reiser4_extent replace_ext; + reiser4_extent append_ext; + reiser4_key key; + reiser4_item_data item; + reiser4_block_nr grabbed; + + ext = extent_by_coord(coord); + assert("vs-1410", state_of_extent(ext) == ALLOCATED_EXTENT); + assert("vs-1411", extent_get_width(ext) > pos_in_unit); + + set_extent(&replace_ext, extent_get_start(ext), pos_in_unit); + set_extent(&append_ext, extent_get_start(ext) + pos_in_unit, extent_get_width(ext) - pos_in_unit); + + /* insert_into_item will insert new unit after the one @coord is set to. So, update key correspondingly */ + unit_key_by_coord(coord, &key); + set_key_offset(&key, (get_key_offset(&key) + pos_in_unit * current_blocksize)); + + grabbed = reserve_replace(); + result = replace_extent(coord, znode_lh(coord->node), &key, init_new_extent(&item, &append_ext, 1), + &replace_ext, COPI_DONT_SHIFT_LEFT, 0/* return replaced position */); + free_replace_reserved(grabbed); + return result; +} + +/* clear bit preventing node from being written bypassing extent allocation procedure */ +static inline void +junprotect (jnode * node) +{ + assert("zam-837", !JF_ISSET(node, JNODE_EFLUSH)); + assert("zam-838", JF_ISSET(node, JNODE_EPROTECTED)); + + JF_CLR(node, JNODE_EPROTECTED); +} + +/* this is used to unprotect nodes which were protected before allocating but which will not be allocated either because + space allocator allocates less blocks than were protected and/or if allocation of those nodes failed */ +static void +unprotect_extent_nodes(flush_pos_t *flush_pos, __u64 count, capture_list_head *protected_nodes) +{ + jnode *node, *tmp; + capture_list_head unprotected_nodes; + txn_atom *atom; + + capture_list_init(&unprotected_nodes); + + atom = atom_locked_by_fq(pos_fq(flush_pos)); + assert("vs-1468", atom); + + assert("vs-1469", !capture_list_empty(protected_nodes)); + assert("vs-1474", count > 0); + node = capture_list_back(protected_nodes); + do { + count --; + junprotect(node); + ON_DEBUG( + LOCK_JNODE(node); + count_jnode(atom, node, PROTECT_LIST, DIRTY_LIST, 0); + UNLOCK_JNODE(node); + ); + if (count == 0) { + break; + } + tmp = capture_list_prev(node); + node = tmp; + assert("vs-1470", !capture_list_end(protected_nodes, node)); + } while (1); + + /* move back to dirty list */ + capture_list_split(protected_nodes, &unprotected_nodes, node); + capture_list_splice(ATOM_DIRTY_LIST(atom, LEAF_LEVEL), &unprotected_nodes); + + UNLOCK_ATOM(atom); +} + +extern int getjevent(void); + +/* remove node from atom's list and put to the end of list @jnodes */ +static void +protect_reloc_node(capture_list_head *jnodes, jnode *node) +{ + assert("zam-836", !JF_ISSET(node, JNODE_EPROTECTED)); + assert("vs-1216", jnode_is_unformatted(node)); + assert("vs-1477", spin_atom_is_locked(node->atom)); + assert("nikita-3390", spin_jnode_is_locked(node)); + + JF_SET(node, JNODE_EPROTECTED); + capture_list_remove_clean(node); + capture_list_push_back(jnodes, node); + ON_DEBUG(count_jnode(node->atom, node, DIRTY_LIST, PROTECT_LIST, 0)); +} + +#define JNODES_TO_UNFLUSH (16) + +/* @count nodes of file (objectid @oid) starting from @index are going to be allocated. Protect those nodes from + e-flushing. Nodes which are eflushed already will be un-eflushed. There will be not more than JNODES_TO_UNFLUSH + un-eflushed nodes. If a node is not found or flushprepped - stop protecting */ +/* FIXME: it is likely that not flushprepped jnodes are on dirty capture list in sequential order.. */ +static int +protect_extent_nodes(flush_pos_t *flush_pos, oid_t oid, unsigned long index, reiser4_block_nr count, + reiser4_block_nr *protected, reiser4_extent *ext, + capture_list_head *protected_nodes) +{ + __u64 i; + __u64 j; + int result; + reiser4_tree *tree; + int eflushed; + jnode *buf[JNODES_TO_UNFLUSH]; + txn_atom *atom; + + assert("nikita-3394", capture_list_empty(protected_nodes)); + + tree = current_tree; + + atom = atom_locked_by_fq(pos_fq(flush_pos)); + assert("vs-1468", atom); + + assert("vs-1470", extent_get_width(ext) == count); + eflushed = 0; + *protected = 0; + for (i = 0; i < count; ++i, ++index) { + jnode *node; + + node = jlookup(tree, oid, index); + if (!node) + break; + + if (jnode_check_flushprepped(node)) { + atomic_dec(&node->x_count); + break; + } + + LOCK_JNODE(node); + assert("vs-1476", atomic_read(&node->x_count) > 1); + assert("nikita-3393", !JF_ISSET(node, JNODE_EPROTECTED)); + + if (JF_ISSET(node, JNODE_EFLUSH)) { + if (eflushed == JNODES_TO_UNFLUSH) { + UNLOCK_JNODE(node); + atomic_dec(&node->x_count); + break; + } + buf[eflushed] = node; + eflushed ++; + protect_reloc_node(protected_nodes, node); + UNLOCK_JNODE(node); + } else { + assert("nikita-3384", node->atom == atom); + protect_reloc_node(protected_nodes, node); + assert("nikita-3383", !JF_ISSET(node, JNODE_EFLUSH)); + UNLOCK_JNODE(node); + atomic_dec(&node->x_count); + } + + (*protected) ++; + } + UNLOCK_ATOM(atom); + + /* start io for eflushed nodes */ + for (j = 0; j < eflushed; ++ j) + jstartio(buf[j]); + + result = 0; + for (j = 0; j < eflushed; ++ j) { + if (result == 0) { + result = emergency_unflush(buf[j]); + if (result != 0) { + warning("nikita-3179", + "unflush failed: %i", result); + print_jnode("node", buf[j]); + } + } + jput(buf[j]); + } + if (result != 0) { + /* unprotect all the jnodes we have protected so far */ + unprotect_extent_nodes(flush_pos, i, protected_nodes); + } + return result; +} + +/* replace extent @ext by extent @replace. Try to merge @replace with previous extent of the item (if there is + one). Return 1 if it succeeded, 0 - otherwise */ +static int +try_to_merge_with_left(coord_t *coord, reiser4_extent *ext, reiser4_extent *replace) +{ + assert("vs-1415", extent_by_coord(coord) == ext); + + if (coord->unit_pos == 0 || state_of_extent(ext - 1) != ALLOCATED_EXTENT) + /* @ext either does not exist or is not allocated extent */ + return 0; + if (extent_get_start(ext - 1) + extent_get_width(ext - 1) != extent_get_start(replace)) + return 0; + + /* we can glue, widen previous unit */ + extent_set_width(ext - 1, extent_get_width(ext - 1) + extent_get_width(replace)); + + if (extent_get_width(ext) != extent_get_width(replace)) { + /* make current extent narrower */ + if (state_of_extent(ext) == ALLOCATED_EXTENT) + extent_set_start(ext, extent_get_start(ext) + extent_get_width(replace)); + extent_set_width(ext, extent_get_width(ext) - extent_get_width(replace)); + } else { + /* current extent completely glued with its left neighbor, remove it */ + coord_t from, to; + + coord_dup(&from, coord); + from.unit_pos = nr_units_extent(coord) - 1; + coord_dup(&to, &from); + + /* currently cut from extent can cut either from the beginning or from the end. Move place which got + freed after unit removal to end of item */ + memmove(ext, ext + 1, (from.unit_pos - coord->unit_pos) * sizeof(reiser4_extent)); + /* wipe part of item which is going to be cut, so that node_check will not be confused */ + cut_node_content(&from, &to, NULL, NULL, NULL); + } + znode_make_dirty(coord->node); + /* move coord back */ + coord->unit_pos --; + return 1; +} + +/* replace extent (unallocated or allocated) pointed by @coord with extent @replace (allocated). If @replace is shorter + than @coord - add padding extent */ +static int +conv_extent(coord_t *coord, reiser4_extent *replace) +{ + int result; + reiser4_extent *ext; + reiser4_extent padd_ext; + reiser4_block_nr start, width, new_width; + reiser4_block_nr grabbed; + reiser4_item_data item; + reiser4_key key; + extent_state state; + + ext = extent_by_coord(coord); + state = state_of_extent(ext); + start = extent_get_start(ext); + width = extent_get_width(ext); + new_width = extent_get_width(replace); + + assert("vs-1458", state == UNALLOCATED_EXTENT || state == ALLOCATED_EXTENT); + assert("vs-1459", width >= new_width); + + if (try_to_merge_with_left(coord, ext, replace)) { + /* merged @replace with left neighbor. Current unit is either removed or narrowed */ + return 0; + } + + if (width == new_width) { + /* replace current extent with @replace */ + *ext = *replace; + znode_make_dirty(coord->node); + return 0; + } + + /* replace @ext with @replace and padding extent */ + set_extent(&padd_ext, state == ALLOCATED_EXTENT ? (start + new_width) : UNALLOCATED_EXTENT_START, + width - new_width); + + /* insert_into_item will insert new units after the one @coord is set to. So, update key correspondingly */ + unit_key_by_coord(coord, &key); + set_key_offset(&key, (get_key_offset(&key) + new_width * current_blocksize)); + + grabbed = reserve_replace(); + result = replace_extent(coord, znode_lh(coord->node), &key, init_new_extent(&item, &padd_ext, 1), + replace, COPI_DONT_SHIFT_LEFT, 0/* return replaced position */); + + free_replace_reserved(grabbed); + return result; +} + +/* for every jnode from @protected_nodes list assign block number and mark it RELOC and FLUSH_QUEUED. Attach whole + @protected_nodes list to flush queue's prepped list */ +static void +assign_real_blocknrs(flush_pos_t *flush_pos, reiser4_block_nr first, reiser4_block_nr count, + extent_state state, capture_list_head *protected_nodes) +{ + jnode *node; + txn_atom *atom; + flush_queue_t *fq; + int i; + + fq = pos_fq(flush_pos); + atom = atom_locked_by_fq(fq); + assert("vs-1468", atom); + + i = 0; + for_all_type_safe_list(capture, protected_nodes, node) { + LOCK_JNODE(node); + assert("vs-1132", ergo(state == UNALLOCATED_EXTENT, blocknr_is_fake(jnode_get_block(node)))); + assert("vs-1475", node->atom == atom); + assert("vs-1476", atomic_read(&node->x_count) > 0); + JF_CLR(node, JNODE_FLUSH_RESERVED); + jnode_set_block(node, &first); + unformatted_make_reloc(node, fq); + /*XXXX*/ON_DEBUG(count_jnode(node->atom, node, PROTECT_LIST, FQ_LIST, 0)); + junprotect(node); + assert("", NODE_LIST(node) == FQ_LIST); + UNLOCK_JNODE(node); + first ++; + i ++; + } + + capture_list_splice(ATOM_FQ_LIST(fq), protected_nodes); + /*XXX*/ + assert("vs-1687", count == i); + if (state == UNALLOCATED_EXTENT) + dec_unalloc_unfm_ptrs(count); + UNLOCK_ATOM(atom); +} + +static void +make_node_ovrwr(capture_list_head *jnodes, jnode *node) +{ + LOCK_JNODE(node); + + assert ("zam-917", !JF_ISSET(node, JNODE_RELOC)); + assert ("zam-918", !JF_ISSET(node, JNODE_OVRWR)); + + JF_SET(node, JNODE_OVRWR); + capture_list_remove_clean(node); + capture_list_push_back(jnodes, node); + ON_DEBUG(count_jnode(node->atom, node, DIRTY_LIST, OVRWR_LIST, 0)); + + UNLOCK_JNODE(node); +} + +/* put nodes of one extent (file objectid @oid, extent width @width) to overwrite set. Starting from the one with index + @index. If end of slum is detected (node is not found or flushprepped) - stop iterating and set flush position's + state to POS_INVALID */ +static void +mark_jnodes_overwrite(flush_pos_t *flush_pos, oid_t oid, unsigned long index, reiser4_block_nr width) +{ + unsigned long i; + reiser4_tree *tree; + jnode *node; + txn_atom *atom; + capture_list_head jnodes; + + capture_list_init(&jnodes); + + tree = current_tree; + + atom = atom_locked_by_fq(pos_fq(flush_pos)); + assert("vs-1478", atom); + + for (i = flush_pos->pos_in_unit; i < width; i ++, index ++) { + node = jlookup(tree, oid, index); + if (!node) { + flush_pos->state = POS_INVALID; + break; + } + if (jnode_check_flushprepped(node)) { + flush_pos->state = POS_INVALID; + atomic_dec(&node->x_count); + break; + } + make_node_ovrwr(&jnodes, node); + atomic_dec(&node->x_count); + } + + capture_list_splice(ATOM_OVRWR_LIST(atom), &jnodes); + UNLOCK_ATOM(atom); +} + +/* this is called by handle_pos_on_twig to proceed extent unit flush_pos->coord is set to. It is to prepare for flushing + sequence of not flushprepped nodes (slum). It supposes that slum starts at flush_pos->pos_in_unit position within the + extent. Slum gets to relocate set if flush_pos->leaf_relocate is set to 1 and to overwrite set otherwise */ +reiser4_internal int +alloc_extent(flush_pos_t *flush_pos) +{ + coord_t *coord; + reiser4_extent *ext; + reiser4_extent replace_ext; + oid_t oid; + reiser4_block_nr protected; + reiser4_block_nr start; + __u64 index; + __u64 width; + extent_state state; + int result; + reiser4_block_nr first_allocated; + __u64 allocated; + reiser4_key key; + block_stage_t block_stage; + + assert("vs-1468", flush_pos->state == POS_ON_EPOINT); + assert("vs-1469", coord_is_existing_unit(&flush_pos->coord) && item_is_extent(&flush_pos->coord)); + + coord = &flush_pos->coord; + + ext = extent_by_coord(coord); + state = state_of_extent(ext); + if (state == HOLE_EXTENT) { + flush_pos->state = POS_INVALID; + return 0; + } + + item_key_by_coord(coord, &key); + oid = get_key_objectid(&key); + index = extent_unit_index(coord) + flush_pos->pos_in_unit; + start = extent_get_start(ext); + width = extent_get_width(ext); + + assert("vs-1457", width > flush_pos->pos_in_unit); + + if (flush_pos->leaf_relocate || state == UNALLOCATED_EXTENT) { + protected_jnodes jnodes; + + /* relocate */ + if (flush_pos->pos_in_unit) { + /* split extent unit into two */ + result = split_allocated_extent(coord, flush_pos->pos_in_unit); + flush_pos->pos_in_unit = 0; + return result; + } + + /* Prevent nodes from e-flushing before allocating disk space for them. Nodes which were eflushed will be + read from their temporary locations (but not more than certain limit: JNODES_TO_UNFLUSH) and that + disk space will be freed. */ + + protected_jnodes_init(&jnodes); + + result = protect_extent_nodes(flush_pos, oid, index, width, &protected, ext, &jnodes.nodes); + if (result) { + warning("vs-1469", "Failed to protect extent. Should not happen\n"); + protected_jnodes_done(&jnodes); + return result; + } + if (protected == 0) { + flush_pos->state = POS_INVALID; + flush_pos->pos_in_unit = 0; + protected_jnodes_done(&jnodes); + return 0; + } + + if (state == ALLOCATED_EXTENT) + /* all protected nodes are not flushprepped, therefore + * they are counted as flush_reserved */ + block_stage = BLOCK_FLUSH_RESERVED; + else + block_stage = BLOCK_UNALLOCATED; + + /* allocate new block numbers for protected nodes */ + extent_allocate_blocks(pos_hint(flush_pos), protected, &first_allocated, &allocated, block_stage); + + if (allocated != protected) + /* unprotect nodes which will not be + * allocated/relocated on this iteration */ + unprotect_extent_nodes(flush_pos, protected - allocated, + &jnodes.nodes); + if (state == ALLOCATED_EXTENT) { + /* on relocating - free nodes which are going to be + * relocated */ + reiser4_dealloc_blocks(&start, &allocated, BLOCK_ALLOCATED, BA_DEFER); + } + + /* assign new block numbers to protected nodes */ + assign_real_blocknrs(flush_pos, first_allocated, allocated, state, &jnodes.nodes); + + protected_jnodes_done(&jnodes); + + /* prepare extent which will replace current one */ + set_extent(&replace_ext, first_allocated, allocated); + + /* adjust extent item */ + result = conv_extent(coord, &replace_ext); + if (result != 0 && result != -ENOMEM) { + warning("vs-1461", "Failed to allocate extent. Should not happen\n"); + return result; + } + } else { + /* overwrite */ + mark_jnodes_overwrite(flush_pos, oid, index, width); + } + flush_pos->pos_in_unit = 0; + return 0; +} + +/* if @key is glueable to the item @coord is set to */ +static int +must_insert(const coord_t *coord, const reiser4_key *key) +{ + reiser4_key last; + + if (item_id_by_coord(coord) == EXTENT_POINTER_ID && keyeq(append_key_extent(coord, &last), key)) + return 0; + return 1; +} + + /* copy extent @copy to the end of @node. It may have to either insert new item after the last one, or append last item, + or modify last unit of last item to have greater width */ +static int +put_unit_to_end(znode *node, const reiser4_key *key, reiser4_extent *copy_ext) +{ + int result; + coord_t coord; + cop_insert_flag flags; + reiser4_extent *last_ext; + reiser4_item_data data; + + /* set coord after last unit in an item */ + coord_init_last_unit(&coord, node); + coord.between = AFTER_UNIT; + + flags = COPI_DONT_SHIFT_LEFT | COPI_DONT_SHIFT_RIGHT | COPI_DONT_ALLOCATE; + if (must_insert(&coord, key)) { + result = insert_by_coord(&coord, init_new_extent(&data, copy_ext, 1), key, 0 /*lh */ , flags); + + } else { + /* try to glue with last unit */ + last_ext = extent_by_coord(&coord); + if (state_of_extent(last_ext) && + extent_get_start(last_ext) + extent_get_width(last_ext) == extent_get_start(copy_ext)) { + /* widen last unit of node */ + extent_set_width(last_ext, extent_get_width(last_ext) + extent_get_width(copy_ext)); + znode_make_dirty(node); + return 0; + } + + /* FIXME: put an assertion here that we can not merge last unit in @node and new unit */ + result = insert_into_item(&coord, 0 /*lh */ , key, init_new_extent(&data, copy_ext, 1), flags); + } + + assert("vs-438", result == 0 || result == -E_NODE_FULL); + return result; +} + +/* @coord is set to extent unit */ +reiser4_internal squeeze_result +squalloc_extent(znode *left, const coord_t *coord, flush_pos_t *flush_pos, reiser4_key *stop_key) +{ + reiser4_extent *ext; + __u64 index; + __u64 width; + reiser4_block_nr start; + extent_state state; + oid_t oid; + reiser4_block_nr first_allocated; + __u64 allocated; + __u64 protected; + reiser4_extent copy_extent; + reiser4_key key; + int result; + block_stage_t block_stage; + + assert("vs-1457", flush_pos->pos_in_unit == 0); + assert("vs-1467", coord_is_leftmost_unit(coord)); + assert("vs-1467", item_is_extent(coord)); + + ext = extent_by_coord(coord); + index = extent_unit_index(coord); + start = extent_get_start(ext); + width = extent_get_width(ext); + state = state_of_extent(ext); + unit_key_by_coord(coord, &key); + oid = get_key_objectid(&key); + + if (flush_pos->leaf_relocate || state == UNALLOCATED_EXTENT) { + protected_jnodes jnodes; + + /* relocate */ + protected_jnodes_init(&jnodes); + result = protect_extent_nodes(flush_pos, oid, index, width, &protected, ext, &jnodes.nodes); + if (result) { + warning("vs-1469", "Failed to protect extent. Should not happen\n"); + protected_jnodes_done(&jnodes); + return result; + } + if (protected == 0) { + flush_pos->state = POS_INVALID; + protected_jnodes_done(&jnodes); + return 0; + } + + if (state == ALLOCATED_EXTENT) + /* all protected nodes are not flushprepped, therefore + * they are counted as flush_reserved */ + block_stage = BLOCK_FLUSH_RESERVED; + else + block_stage = BLOCK_UNALLOCATED; + + /* allocate new block numbers for protected nodes */ + extent_allocate_blocks(pos_hint(flush_pos), protected, &first_allocated, &allocated, block_stage); + if (allocated != protected) + unprotect_extent_nodes(flush_pos, protected - allocated, + &jnodes.nodes); + + /* prepare extent which will be copied to left */ + set_extent(©_extent, first_allocated, allocated); + + result = put_unit_to_end(left, &key, ©_extent); + if (result == -E_NODE_FULL) { + int target_block_stage; + + /* free blocks which were just allocated */ + target_block_stage = (state == ALLOCATED_EXTENT) ? BLOCK_FLUSH_RESERVED : BLOCK_UNALLOCATED; + reiser4_dealloc_blocks(&first_allocated, &allocated, target_block_stage, BA_PERMANENT); + unprotect_extent_nodes(flush_pos, allocated, &jnodes.nodes); + + /* rewind the preceder. */ + flush_pos->preceder.blk = first_allocated; + check_preceder(flush_pos->preceder.blk); + + protected_jnodes_done(&jnodes); + return SQUEEZE_TARGET_FULL; + } + + if (state == ALLOCATED_EXTENT) { + /* free nodes which were relocated */ + reiser4_dealloc_blocks(&start, &allocated, BLOCK_ALLOCATED, BA_DEFER); + } + + /* assign new block numbers to protected nodes */ + assign_real_blocknrs(flush_pos, first_allocated, allocated, state, &jnodes.nodes); + protected_jnodes_done(&jnodes); + + set_key_offset(&key, get_key_offset(&key) + (allocated << current_blocksize_bits)); + } else { + /* overwrite: try to copy unit as it is to left neighbor and + make all first not flushprepped nodes overwrite nodes */ + set_extent(©_extent, start, width); + result = put_unit_to_end(left, &key, ©_extent); + if (result == -E_NODE_FULL) { + return SQUEEZE_TARGET_FULL; + } + mark_jnodes_overwrite(flush_pos, oid, index, width); + set_key_offset(&key, get_key_offset(&key) + (width << current_blocksize_bits)); + } + *stop_key = key; + return SQUEEZE_CONTINUE; +} + +reiser4_internal int +key_by_offset_extent(struct inode *inode, loff_t off, reiser4_key *key) +{ + return key_by_inode_and_offset_common(inode, off, key); +} + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/extent.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/extent.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,173 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#ifndef __REISER4_EXTENT_H__ +#define __REISER4_EXTENT_H__ + +/* on disk extent */ +typedef struct { + reiser4_dblock_nr start; + reiser4_dblock_nr width; +} reiser4_extent; + +typedef struct extent_stat { + int unallocated_units; + int unallocated_blocks; + int allocated_units; + int allocated_blocks; + int hole_units; + int hole_blocks; +} extent_stat; + +/* extents in an extent item can be either holes, or unallocated or allocated + extents */ +typedef enum { + HOLE_EXTENT, + UNALLOCATED_EXTENT, + ALLOCATED_EXTENT +} extent_state; + +#define HOLE_EXTENT_START 0 +#define UNALLOCATED_EXTENT_START 1 +#define UNALLOCATED_EXTENT_START2 2 + +typedef struct { + reiser4_block_nr pos_in_unit; + reiser4_block_nr width; /* width of current unit */ + pos_in_node_t nr_units; /* number of units */ + int ext_offset; /* offset from the beginning of zdata() */ + unsigned long expected_page; +#if REISER4_DEBUG + reiser4_extent extent; +#endif +} extent_coord_extension_t; + +/* macros to set/get fields of on-disk extent */ +static inline reiser4_block_nr +extent_get_start(const reiser4_extent * ext) +{ + return dblock_to_cpu(&ext->start); +} + +static inline reiser4_block_nr +extent_get_width(const reiser4_extent * ext) +{ + return dblock_to_cpu(&ext->width); +} + +extern __u64 reiser4_current_block_count(void); + +static inline void +extent_set_start(reiser4_extent * ext, reiser4_block_nr start) +{ + cassert(sizeof (ext->start) == 8); + assert("nikita-2510", ergo(start > 1, start < reiser4_current_block_count())); + cpu_to_dblock(start, &ext->start); +} + +static inline void +extent_set_width(reiser4_extent *ext, reiser4_block_nr width) +{ + cassert(sizeof (ext->width) == 8); + cpu_to_dblock(width, &ext->width); + assert("nikita-2511", + ergo(extent_get_start(ext) > 1, + extent_get_start(ext) + width <= reiser4_current_block_count())); +} + +#define extent_item(coord) \ +({ \ + assert("nikita-3143", item_is_extent(coord)); \ + ((reiser4_extent *)item_body_by_coord (coord)); \ +}) + +#define extent_by_coord(coord) \ +({ \ + assert("nikita-3144", item_is_extent(coord)); \ + (extent_item (coord) + (coord)->unit_pos); \ +}) + +#define width_by_coord(coord) \ +({ \ + assert("nikita-3145", item_is_extent(coord)); \ + extent_get_width (extent_by_coord(coord)); \ +}) + +struct carry_cut_data; +struct carry_kill_data; + +/* plugin->u.item.b.* */ +reiser4_key *max_key_inside_extent(const coord_t *, reiser4_key *); +int can_contain_key_extent(const coord_t * coord, const reiser4_key * key, const reiser4_item_data *); +int mergeable_extent(const coord_t * p1, const coord_t * p2); +pos_in_node_t nr_units_extent(const coord_t *); +lookup_result lookup_extent(const reiser4_key *, lookup_bias, coord_t *); +void init_coord_extent(coord_t *); +int init_extent(coord_t *, reiser4_item_data *); +int paste_extent(coord_t *, reiser4_item_data *, carry_plugin_info *); +int can_shift_extent(unsigned free_space, + coord_t * source, znode * target, shift_direction, unsigned *size, unsigned want); +void copy_units_extent(coord_t * target, + coord_t * source, + unsigned from, unsigned count, shift_direction where_is_free_space, unsigned free_space); +int kill_hook_extent(const coord_t *, pos_in_node_t from, pos_in_node_t count, struct carry_kill_data *); +int create_hook_extent(const coord_t * coord, void *arg); +int cut_units_extent(coord_t *coord, pos_in_node_t from, pos_in_node_t to, + struct carry_cut_data *, reiser4_key *smallest_removed, reiser4_key *new_first); +int kill_units_extent(coord_t *coord, pos_in_node_t from, pos_in_node_t to, + struct carry_kill_data *, reiser4_key *smallest_removed, reiser4_key *new_first); +reiser4_key *unit_key_extent(const coord_t *, reiser4_key *); +reiser4_key *max_unit_key_extent(const coord_t *, reiser4_key *); +void print_extent(const char *, coord_t *); +int utmost_child_extent(const coord_t * coord, sideof side, jnode ** child); +int utmost_child_real_block_extent(const coord_t * coord, sideof side, reiser4_block_nr * block); +void item_stat_extent(const coord_t * coord, void *vp); +int check_extent(const coord_t * coord, const char **error); + +/* plugin->u.item.s.file.* */ +int write_extent(struct inode *, flow_t *, hint_t *, int grabbed, write_mode_t); +int read_extent(struct file *, flow_t *, hint_t *); +int readpage_extent(void *, struct page *); +void readpages_extent(void *, struct address_space *, struct list_head *pages); +int capture_extent(reiser4_key *, uf_coord_t *, struct page *, write_mode_t); +reiser4_key *append_key_extent(const coord_t *, reiser4_key *); +void init_coord_extension_extent(uf_coord_t *, loff_t offset); +int get_block_address_extent(const coord_t *, sector_t block, struct buffer_head *); + + +/* these are used in flush.c + FIXME-VS: should they be somewhere in item_plugin? */ +int allocate_extent_item_in_place(coord_t *, lock_handle *, flush_pos_t * pos); +int allocate_and_copy_extent(znode * left, coord_t * right, flush_pos_t * pos, reiser4_key * stop_key); + +int extent_is_unallocated(const coord_t * item); /* True if this extent is unallocated (i.e., not a hole, not allocated). */ +__u64 extent_unit_index(const coord_t * item); /* Block offset of this unit. */ +__u64 extent_unit_width(const coord_t * item); /* Number of blocks in this unit. */ + +/* plugin->u.item.f. */ +int scan_extent (flush_scan * scan); +extern int key_by_offset_extent(struct inode *, loff_t, reiser4_key *); + +reiser4_item_data *init_new_extent(reiser4_item_data *data, void *ext_unit, int nr_extents); +reiser4_block_nr extent_size(const coord_t *coord, pos_in_node_t nr); +extent_state state_of_extent(reiser4_extent *ext); +void set_extent(reiser4_extent *ext, reiser4_block_nr start, reiser4_block_nr width); +int replace_extent(coord_t *un_extent, lock_handle *lh, + reiser4_key *key, reiser4_item_data *data, const reiser4_extent *new_ext, unsigned flags, int); +lock_handle *znode_lh(znode *); + +/* the reiser4 repacker support */ +struct repacker_cursor; +extern int process_extent_backward_for_repacking (tap_t *, struct repacker_cursor *); +extern int mark_extent_for_repacking (tap_t *, int); + +/* __REISER4_EXTENT_H__ */ +#endif +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/extent_item_ops.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/extent_item_ops.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,791 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "item.h" +#include "../../inode.h" +#include "../../tree_walk.h" /* check_sibling_list() */ +#include "../../page_cache.h" +#include "../../carry.h" + +#include + +/* item_plugin->b.max_key_inside */ +reiser4_internal reiser4_key * +max_key_inside_extent(const coord_t *coord, reiser4_key *key) +{ + item_key_by_coord(coord, key); + set_key_offset(key, get_key_offset(max_key())); + return key; +} + +/* item_plugin->b.can_contain_key + this checks whether @key of @data is matching to position set by @coord */ +reiser4_internal int +can_contain_key_extent(const coord_t *coord, const reiser4_key *key, const reiser4_item_data *data) +{ + reiser4_key item_key; + + if (item_plugin_by_coord(coord) != data->iplug) + return 0; + + item_key_by_coord(coord, &item_key); + if (get_key_locality(key) != get_key_locality(&item_key) || + get_key_objectid(key) != get_key_objectid(&item_key) || + get_key_ordering(key) != get_key_ordering(&item_key)) return 0; + + return 1; +} + +/* item_plugin->b.mergeable + first item is of extent type */ +/* Audited by: green(2002.06.13) */ +reiser4_internal int +mergeable_extent(const coord_t *p1, const coord_t *p2) +{ + reiser4_key key1, key2; + + assert("vs-299", item_id_by_coord(p1) == EXTENT_POINTER_ID); + /* FIXME-VS: Which is it? Assert or return 0 */ + if (item_id_by_coord(p2) != EXTENT_POINTER_ID) { + return 0; + } + + item_key_by_coord(p1, &key1); + item_key_by_coord(p2, &key2); + if (get_key_locality(&key1) != get_key_locality(&key2) || + get_key_objectid(&key1) != get_key_objectid(&key2) || + get_key_ordering(&key1) != get_key_ordering(&key2) || + get_key_type(&key1) != get_key_type(&key2)) + return 0; + if (get_key_offset(&key1) + extent_size(p1, nr_units_extent(p1)) != get_key_offset(&key2)) + return 0; + return 1; +} + +/* item_plugin->b.nr_units */ +reiser4_internal pos_in_node_t +nr_units_extent(const coord_t *coord) +{ + /* length of extent item has to be multiple of extent size */ + assert("vs-1424", (item_length_by_coord(coord) % sizeof (reiser4_extent)) == 0); + return item_length_by_coord(coord) / sizeof (reiser4_extent); +} + +/* item_plugin->b.lookup */ +reiser4_internal lookup_result +lookup_extent(const reiser4_key *key, lookup_bias bias UNUSED_ARG, coord_t *coord) +{ /* znode and item_pos are + set to an extent item to + look through */ + reiser4_key item_key; + reiser4_block_nr lookuped, offset; + unsigned i, nr_units; + reiser4_extent *ext; + unsigned blocksize; + unsigned char blocksize_bits; + + item_key_by_coord(coord, &item_key); + offset = get_key_offset(&item_key); + + /* key we are looking for must be greater than key of item @coord */ + assert("vs-414", keygt(key, &item_key)); + + assert("umka-99945", + !keygt(key, max_key_inside_extent(coord, &item_key))); + + ext = extent_item(coord); + assert("vs-1350", (char *)ext == (zdata(coord->node) + coord->offset)); + + blocksize = current_blocksize; + blocksize_bits = current_blocksize_bits; + + /* offset we are looking for */ + lookuped = get_key_offset(key); + + nr_units = nr_units_extent(coord); + /* go through all extents until the one which address given offset */ + for (i = 0; i < nr_units; i++, ext++) { + offset += (extent_get_width(ext) << blocksize_bits); + if (offset > lookuped) { + /* desired byte is somewhere in this extent */ + coord->unit_pos = i; + coord->between = AT_UNIT; + return CBK_COORD_FOUND; + } + } + + /* set coord after last unit */ + coord->unit_pos = nr_units - 1; + coord->between = AFTER_UNIT; + return CBK_COORD_FOUND; +} + +/* item_plugin->b.paste + item @coord is set to has been appended with @data->length of free + space. data->data contains data to be pasted into the item in position + @coord->in_item.unit_pos. It must fit into that free space. + @coord must be set between units. +*/ +reiser4_internal int +paste_extent(coord_t *coord, reiser4_item_data *data, carry_plugin_info *info UNUSED_ARG) +{ + unsigned old_nr_units; + reiser4_extent *ext; + int item_length; + + ext = extent_item(coord); + item_length = item_length_by_coord(coord); + old_nr_units = (item_length - data->length) / sizeof (reiser4_extent); + + /* this is also used to copy extent into newly created item, so + old_nr_units could be 0 */ + assert("vs-260", item_length >= data->length); + + /* make sure that coord is set properly */ + assert("vs-35", ((!coord_is_existing_unit(coord)) || (!old_nr_units && !coord->unit_pos))); + + /* first unit to be moved */ + switch (coord->between) { + case AFTER_UNIT: + coord->unit_pos++; + case BEFORE_UNIT: + coord->between = AT_UNIT; + break; + case AT_UNIT: + assert("vs-331", !old_nr_units && !coord->unit_pos); + break; + default: + impossible("vs-330", "coord is set improperly"); + } + + /* prepare space for new units */ + memmove(ext + coord->unit_pos + data->length / sizeof (reiser4_extent), + ext + coord->unit_pos, (old_nr_units - coord->unit_pos) * sizeof (reiser4_extent)); + + /* copy new data from kernel space */ + assert("vs-556", data->user == 0); + memcpy(ext + coord->unit_pos, data->data, (unsigned) data->length); + + /* after paste @coord is set to first of pasted units */ + assert("vs-332", coord_is_existing_unit(coord)); + assert("vs-333", !memcmp(data->data, extent_by_coord(coord), (unsigned) data->length)); + return 0; +} + +/* item_plugin->b.can_shift */ +reiser4_internal int +can_shift_extent(unsigned free_space, coord_t *source, + znode *target UNUSED_ARG, shift_direction pend UNUSED_ARG, unsigned *size, unsigned want) +{ + *size = item_length_by_coord(source); + if (*size > free_space) + /* never split a unit of extent item */ + *size = free_space - free_space % sizeof (reiser4_extent); + + /* we can shift *size bytes, calculate how many do we want to shift */ + if (*size > want * sizeof (reiser4_extent)) + *size = want * sizeof (reiser4_extent); + + if (*size % sizeof (reiser4_extent) != 0) + impossible("vs-119", "Wrong extent size: %i %i", *size, sizeof (reiser4_extent)); + return *size / sizeof (reiser4_extent); + +} + +/* item_plugin->b.copy_units */ +reiser4_internal void +copy_units_extent(coord_t *target, coord_t *source, + unsigned from, unsigned count, shift_direction where_is_free_space, unsigned free_space) +{ + char *from_ext, *to_ext; + + assert("vs-217", free_space == count * sizeof (reiser4_extent)); + + from_ext = item_body_by_coord(source); + to_ext = item_body_by_coord(target); + + if (where_is_free_space == SHIFT_LEFT) { + assert("vs-215", from == 0); + + /* At this moment, item length was already updated in the item + header by shifting code, hence nr_units_extent() will + return "new" number of units---one we obtain after copying + units. + */ + to_ext += (nr_units_extent(target) - count) * sizeof (reiser4_extent); + } else { + reiser4_key key; + coord_t coord; + + assert("vs-216", from + count == coord_last_unit_pos(source) + 1); + + from_ext += item_length_by_coord(source) - free_space; + + /* new units are inserted before first unit in an item, + therefore, we have to update item key */ + coord = *source; + coord.unit_pos = from; + unit_key_extent(&coord, &key); + + node_plugin_by_node(target->node)->update_item_key(target, &key, 0/*info */); + } + + memcpy(to_ext, from_ext, free_space); +} + +/* item_plugin->b.create_hook + @arg is znode of leaf node for which we need to update right delimiting key */ +reiser4_internal int +create_hook_extent(const coord_t *coord, void *arg) +{ + coord_t *child_coord; + znode *node; + reiser4_key key; + reiser4_tree *tree; + + if (!arg) + return 0; + + child_coord = arg; + tree = znode_get_tree(coord->node); + + assert("nikita-3246", znode_get_level(child_coord->node) == LEAF_LEVEL); + + WLOCK_TREE(tree); + WLOCK_DK(tree); + /* find a node on the left level for which right delimiting key has to + be updated */ + if (coord_wrt(child_coord) == COORD_ON_THE_LEFT) { + assert("vs-411", znode_is_left_connected(child_coord->node)); + node = child_coord->node->left; + } else { + assert("vs-412", coord_wrt(child_coord) == COORD_ON_THE_RIGHT); + node = child_coord->node; + assert("nikita-3314", node != NULL); + } + + if (node != NULL) { + znode_set_rd_key(node, item_key_by_coord(coord, &key)); + + assert("nikita-3282", check_sibling_list(node)); + /* break sibling links */ + if (ZF_ISSET(node, JNODE_RIGHT_CONNECTED) && node->right) { + ON_DEBUG( + node->right->left_version = atomic_inc_return(&delim_key_version); + node->right_version = atomic_inc_return(&delim_key_version); + ); + + node->right->left = NULL; + node->right = NULL; + } + } + WUNLOCK_DK(tree); + WUNLOCK_TREE(tree); + return 0; +} + + +#define ITEM_TAIL_KILLED 0 +#define ITEM_HEAD_KILLED 1 +#define ITEM_KILLED 2 + +/* item_plugin->b.kill_hook + this is called when @count units starting from @from-th one are going to be removed + */ +reiser4_internal int +kill_hook_extent(const coord_t *coord, pos_in_node_t from, pos_in_node_t count, struct carry_kill_data *kdata) +{ + reiser4_extent *ext; + reiser4_block_nr start, length; + reiser4_key min_item_key, max_item_key; + reiser4_key from_key, to_key; + const reiser4_key *pfrom_key, *pto_key; + struct inode *inode; + reiser4_tree *tree; + pgoff_t from_off, to_off, offset, skip; + int retval; + + assert ("zam-811", znode_is_write_locked(coord->node)); + assert("nikita-3315", kdata != NULL); + + item_key_by_coord(coord, &min_item_key); + max_item_key_by_coord(coord, &max_item_key); + + if (kdata->params.from_key) { + pfrom_key = kdata->params.from_key; + pto_key = kdata->params.to_key; + } else { + coord_t dup; + + assert("vs-1549", from == coord->unit_pos); + unit_key_by_coord(coord, &from_key); + pfrom_key = &from_key; + + coord_dup(&dup, coord); + dup.unit_pos = from + count - 1; + max_unit_key_by_coord(&dup, &to_key); + pto_key = &to_key; + } + + if (!keylt(pto_key, &max_item_key)) { + if (!keygt(pfrom_key, &min_item_key)) { + znode *left, *right; + + /* item is to be removed completely */ + assert("nikita-3316", kdata->left != NULL && kdata->right != NULL); + + left = kdata->left->node; + right = kdata->right->node; + + tree = current_tree; + /* we have to do two things: + * + * 1. link left and right formatted neighbors of + * extent being removed, and + * + * 2. update their delimiting keys. + * + * atomicity of these operations is protected by + * taking dk-lock and tree-lock. + */ + /* if neighbors of item being removed are znodes - + * link them */ + UNDER_RW_VOID(tree, tree, + write, link_left_and_right(left, right)); + + WLOCK_DK(tree); + if (left) { + /* update right delimiting key of left + * neighbor of extent item */ + coord_t next; + reiser4_key key; + + coord_dup(&next, coord); + + if (coord_next_item(&next)) + key = *znode_get_rd_key(coord->node); + else + item_key_by_coord(&next, &key); + znode_set_rd_key(left, &key); + } + WUNLOCK_DK(tree); + + from_off = get_key_offset(&min_item_key) >> PAGE_CACHE_SHIFT; + to_off = (get_key_offset(&max_item_key) + 1) >> PAGE_CACHE_SHIFT; + retval = ITEM_KILLED; + } else { + /* tail of item is to be removed */ + from_off = (get_key_offset(pfrom_key) + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT; + to_off = (get_key_offset(&max_item_key) + 1) >> PAGE_CACHE_SHIFT; + retval = ITEM_TAIL_KILLED; + } + } else { + /* head of item is to be removed */ + assert("vs-1571", keyeq(pfrom_key, &min_item_key)); + assert("vs-1572", (get_key_offset(pfrom_key) & (PAGE_CACHE_SIZE - 1)) == 0); + assert("vs-1573", ((get_key_offset(pto_key) + 1) & (PAGE_CACHE_SIZE - 1)) == 0); + + if (kdata->left->node) { + /* update right delimiting key of left neighbor of extent item */ + reiser4_key key; + + key = *pto_key; + set_key_offset(&key, get_key_offset(pto_key) + 1); + + UNDER_RW_VOID(dk, current_tree, write, znode_set_rd_key(kdata->left->node, &key)); + } + + from_off = get_key_offset(pfrom_key) >> PAGE_CACHE_SHIFT; + to_off = (get_key_offset(pto_key) + 1) >> PAGE_CACHE_SHIFT; + retval = ITEM_HEAD_KILLED; + } + + inode = kdata->inode; + assert("vs-1545", inode != NULL); + if (inode != NULL) + /* take care of pages and jnodes corresponding to part of item being killed */ + reiser4_invalidate_pages( + inode->i_mapping, from_off, to_off - from_off, + kdata->params.truncate); + + ext = extent_item(coord) + from; + offset = (get_key_offset(&min_item_key) + extent_size(coord, from)) >> PAGE_CACHE_SHIFT; + + assert("vs-1551", from_off >= offset); + assert("vs-1552", from_off - offset <= extent_get_width(ext)); + skip = from_off - offset; + offset = from_off; + + while (offset < to_off) { + length = extent_get_width(ext) - skip; + if (state_of_extent(ext) == HOLE_EXTENT) { + skip = 0; + offset += length; + ext ++; + continue; + } + + if (offset + length > to_off) { + length = to_off - offset; + } + + DQUOT_FREE_BLOCK(inode, length); + + if (state_of_extent(ext) == UNALLOCATED_EXTENT) { + /* some jnodes corresponding to this unallocated extent */ + fake_allocated2free(length, + 0 /* unformatted */); + + skip = 0; + offset += length; + ext ++; + continue; + } + + assert("vs-1218", state_of_extent(ext) == ALLOCATED_EXTENT); + + if (length != 0) { + start = extent_get_start(ext) + skip; + + /* BA_DEFER bit parameter is turned on because blocks which get freed are not safe to be freed + immediately */ + reiser4_dealloc_blocks(&start, &length, 0 /* not used */, + BA_DEFER/* unformatted with defer */); + } + skip = 0; + offset += length; + ext ++; + } + return retval; +} + +/* item_plugin->b.kill_units */ +reiser4_internal int +kill_units_extent(coord_t *coord, pos_in_node_t from, pos_in_node_t to, struct carry_kill_data *kdata, + reiser4_key *smallest_removed, reiser4_key *new_first) +{ + reiser4_extent *ext; + reiser4_key item_key; + pos_in_node_t count; + reiser4_key from_key, to_key; + const reiser4_key *pfrom_key, *pto_key; + loff_t off; + int result; + + assert("vs-1541", ((kdata->params.from_key == NULL && kdata->params.to_key == NULL) || + (kdata->params.from_key != NULL && kdata->params.to_key != NULL))); + + if (kdata->params.from_key) { + pfrom_key = kdata->params.from_key; + pto_key = kdata->params.to_key; + } else { + coord_t dup; + + /* calculate key range of kill */ + assert("vs-1549", from == coord->unit_pos); + unit_key_by_coord(coord, &from_key); + pfrom_key = &from_key; + + coord_dup(&dup, coord); + dup.unit_pos = to; + max_unit_key_by_coord(&dup, &to_key); + pto_key = &to_key; + } + + item_key_by_coord(coord, &item_key); + +#if REISER4_DEBUG + { + reiser4_key max_item_key; + + max_item_key_by_coord(coord, &max_item_key); + + if (new_first) { + /* head of item is to be cut */ + assert("vs-1542", keyeq(pfrom_key, &item_key)); + assert("vs-1538", keylt(pto_key, &max_item_key)); + } else { + /* tail of item is to be cut */ + assert("vs-1540", keygt(pfrom_key, &item_key)); + assert("vs-1543", !keylt(pto_key, &max_item_key)); + } + } +#endif + + if (smallest_removed) + *smallest_removed = *pfrom_key; + + if (new_first) { + /* item head is cut. Item key will change. This new key is calculated here */ + assert("vs-1556", (get_key_offset(pto_key) & (PAGE_CACHE_SIZE - 1)) == (PAGE_CACHE_SIZE - 1)); + *new_first = *pto_key; + set_key_offset(new_first, get_key_offset(new_first) + 1); + } + + count = to - from + 1; + result = kill_hook_extent(coord, from, count, kdata); + if (result == ITEM_TAIL_KILLED) { + assert("vs-1553", get_key_offset(pfrom_key) >= get_key_offset(&item_key) + extent_size(coord, from)); + off = get_key_offset(pfrom_key) - (get_key_offset(&item_key) + extent_size(coord, from)); + if (off) { + /* unit @from is to be cut partially. Its width decreases */ + ext = extent_item(coord) + from; + extent_set_width(ext, (off + PAGE_CACHE_SIZE - 1) >> PAGE_CACHE_SHIFT); + count --; + } + } else { + __u64 max_to_offset; + __u64 rest; + + assert("vs-1575", result == ITEM_HEAD_KILLED); + assert("", from == 0); + assert("", ((get_key_offset(pto_key) + 1) & (PAGE_CACHE_SIZE - 1)) == 0); + assert("", get_key_offset(pto_key) + 1 > get_key_offset(&item_key) + extent_size(coord, to)); + max_to_offset = get_key_offset(&item_key) + extent_size(coord, to + 1) - 1; + assert("", get_key_offset(pto_key) <= max_to_offset); + + rest = (max_to_offset - get_key_offset(pto_key)) >> PAGE_CACHE_SHIFT; + if (rest) { + /* unit @to is to be cut partially */ + ext = extent_item(coord) + to; + + assert("", extent_get_width(ext) > rest); + + if (state_of_extent(ext) == ALLOCATED_EXTENT) + extent_set_start(ext, extent_get_start(ext) + (extent_get_width(ext) - rest)); + + extent_set_width(ext, rest); + count --; + } + } + return count * sizeof(reiser4_extent); +} + +/* item_plugin->b.cut_units + this is too similar to kill_units_extent */ +reiser4_internal int +cut_units_extent(coord_t *coord, pos_in_node_t from, pos_in_node_t to, struct carry_cut_data *cdata, + reiser4_key *smallest_removed, reiser4_key *new_first) +{ + reiser4_extent *ext; + reiser4_key item_key; + pos_in_node_t count; + reiser4_key from_key, to_key; + const reiser4_key *pfrom_key, *pto_key; + loff_t off; + + assert("vs-1541", ((cdata->params.from_key == NULL && cdata->params.to_key == NULL) || + (cdata->params.from_key != NULL && cdata->params.to_key != NULL))); + + if (cdata->params.from_key) { + pfrom_key = cdata->params.from_key; + pto_key = cdata->params.to_key; + } else { + coord_t dup; + + /* calculate key range of kill */ + coord_dup(&dup, coord); + dup.unit_pos = from; + unit_key_by_coord(&dup, &from_key); + + dup.unit_pos = to; + max_unit_key_by_coord(&dup, &to_key); + + pfrom_key = &from_key; + pto_key = &to_key; + } + + assert("vs-1555", (get_key_offset(pfrom_key) & (PAGE_CACHE_SIZE - 1)) == 0); + assert("vs-1556", (get_key_offset(pto_key) & (PAGE_CACHE_SIZE - 1)) == (PAGE_CACHE_SIZE - 1)); + + item_key_by_coord(coord, &item_key); + +#if REISER4_DEBUG + { + reiser4_key max_item_key; + + assert("vs-1584", get_key_locality(pfrom_key) == get_key_locality(&item_key)); + assert("vs-1585", get_key_type(pfrom_key) == get_key_type(&item_key)); + assert("vs-1586", get_key_objectid(pfrom_key) == get_key_objectid(&item_key)); + assert("vs-1587", get_key_ordering(pfrom_key) == get_key_ordering(&item_key)); + + max_item_key_by_coord(coord, &max_item_key); + + if (new_first != NULL) { + /* head of item is to be cut */ + assert("vs-1542", keyeq(pfrom_key, &item_key)); + assert("vs-1538", keylt(pto_key, &max_item_key)); + } else { + /* tail of item is to be cut */ + assert("vs-1540", keygt(pfrom_key, &item_key)); + assert("vs-1543", keyeq(pto_key, &max_item_key)); + } + } +#endif + + if (smallest_removed) + *smallest_removed = *pfrom_key; + + if (new_first) { + /* item head is cut. Item key will change. This new key is calculated here */ + *new_first = *pto_key; + set_key_offset(new_first, get_key_offset(new_first) + 1); + } + + count = to - from + 1; + + assert("vs-1553", get_key_offset(pfrom_key) >= get_key_offset(&item_key) + extent_size(coord, from)); + off = get_key_offset(pfrom_key) - (get_key_offset(&item_key) + extent_size(coord, from)); + if (off) { + /* tail of unit @from is to be cut partially. Its width decreases */ + assert("vs-1582", new_first == NULL); + ext = extent_item(coord) + from; + extent_set_width(ext, off >> PAGE_CACHE_SHIFT); + count --; + } + + assert("vs-1554", get_key_offset(pto_key) <= get_key_offset(&item_key) + extent_size(coord, to + 1) - 1); + off = (get_key_offset(&item_key) + extent_size(coord, to + 1) - 1) - get_key_offset(pto_key); + if (off) { + /* @to_key is smaller than max key of unit @to. Unit @to will not be removed. It gets start increased + and width decreased. */ + assert("vs-1583", (off & (PAGE_CACHE_SIZE - 1)) == 0); + ext = extent_item(coord) + to; + if (state_of_extent(ext) == ALLOCATED_EXTENT) + extent_set_start(ext, extent_get_start(ext) + (extent_get_width(ext) - (off >> PAGE_CACHE_SHIFT))); + + extent_set_width(ext, (off >> PAGE_CACHE_SHIFT)); + count --; + } + return count * sizeof(reiser4_extent); +} + +/* item_plugin->b.unit_key */ +reiser4_internal reiser4_key * +unit_key_extent(const coord_t *coord, reiser4_key *key) +{ + assert("vs-300", coord_is_existing_unit(coord)); + + item_key_by_coord(coord, key); + set_key_offset(key, (get_key_offset(key) + extent_size(coord, coord->unit_pos))); + + return key; +} + +/* item_plugin->b.max_unit_key */ +reiser4_internal reiser4_key * +max_unit_key_extent(const coord_t *coord, reiser4_key *key) +{ + assert("vs-300", coord_is_existing_unit(coord)); + + item_key_by_coord(coord, key); + set_key_offset(key, (get_key_offset(key) + extent_size(coord, coord->unit_pos + 1) - 1)); + return key; +} + +/* item_plugin->b.estimate + item_plugin->b.item_data_by_flow */ + +#if REISER4_DEBUG + +/* item_plugin->b.check + used for debugging, every item should have here the most complete + possible check of the consistency of the item that the inventor can + construct +*/ +int +check_extent(const coord_t *coord /* coord of item to check */ , + const char **error /* where to store error message */ ) +{ + reiser4_extent *ext, *first; + unsigned i, j; + reiser4_block_nr start, width, blk_cnt; + unsigned num_units; + reiser4_tree *tree; + oid_t oid; + reiser4_key key; + coord_t scan; + + assert("vs-933", REISER4_DEBUG); + + if (znode_get_level(coord->node) != TWIG_LEVEL) { + *error = "Extent on the wrong level"; + return -1; + } + if (item_length_by_coord(coord) % sizeof (reiser4_extent) != 0) { + *error = "Wrong item size"; + return -1; + } + ext = first = extent_item(coord); + blk_cnt = reiser4_block_count(reiser4_get_current_sb()); + num_units = coord_num_units(coord); + tree = znode_get_tree(coord->node); + item_key_by_coord(coord, &key); + oid = get_key_objectid(&key); + coord_dup(&scan, coord); + + for (i = 0; i < num_units; ++i, ++ext) { + __u64 index; + + scan.unit_pos = i; + index = extent_unit_index(&scan); + +#if 0 + /* check that all jnodes are present for the unallocated + * extent */ + if (state_of_extent(ext) == UNALLOCATED_EXTENT) { + for (j = 0; j < extent_get_width(ext); j ++) { + jnode *node; + + node = jlookup(tree, oid, index + j); + if (node == NULL) { + print_coord("scan", &scan, 0); + *error = "Jnode missing"; + return -1; + } + jput(node); + } + } +#endif + + start = extent_get_start(ext); + if (start < 2) + continue; + /* extent is allocated one */ + width = extent_get_width(ext); + if (start >= blk_cnt) { + *error = "Start too large"; + return -1; + } + if (start + width > blk_cnt) { + *error = "End too large"; + return -1; + } + /* make sure that this extent does not overlap with other + allocated extents extents */ + for (j = 0; j < i; j++) { + if (state_of_extent(first + j) != ALLOCATED_EXTENT) + continue; + if (!((extent_get_start(ext) >= extent_get_start(first + j) + extent_get_width(first + j)) + || (extent_get_start(ext) + extent_get_width(ext) <= extent_get_start(first + j)))) { + *error = "Extent overlaps with others"; + return -1; + } + } + + } + + return 0; +} + +#endif /* REISER4_DEBUG */ + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/internal.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/internal.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,398 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Implementation of internal-item plugin methods. */ + +#include "../../forward.h" +#include "../../debug.h" +#include "../../dformat.h" +#include "../../key.h" +#include "../../coord.h" +#include "internal.h" +#include "item.h" +#include "../node/node.h" +#include "../plugin.h" +#include "../../jnode.h" +#include "../../znode.h" +#include "../../tree_walk.h" +#include "../../tree_mod.h" +#include "../../tree.h" +#include "../../super.h" +#include "../../block_alloc.h" + +/* see internal.h for explanation */ + +/* plugin->u.item.b.mergeable */ +reiser4_internal int +mergeable_internal(const coord_t * p1 UNUSED_ARG /* first item */ , + const coord_t * p2 UNUSED_ARG /* second item */ ) +{ + /* internal items are not mergeable */ + return 0; +} + +/* ->lookup() method for internal items */ +reiser4_internal lookup_result +lookup_internal(const reiser4_key * key /* key to look up */ , + lookup_bias bias UNUSED_ARG /* lookup bias */ , + coord_t * coord /* coord of item */ ) +{ + reiser4_key ukey; + + switch (keycmp(unit_key_by_coord(coord, &ukey), key)) { + default: + impossible("", "keycmp()?!"); + case LESS_THAN: + /* FIXME-VS: AFTER_ITEM used to be here. But with new coord + item plugin can not be taken using coord set this way */ + assert("vs-681", coord->unit_pos == 0); + coord->between = AFTER_UNIT; + case EQUAL_TO: + return CBK_COORD_FOUND; + case GREATER_THAN: + return CBK_COORD_NOTFOUND; + } +} + +/* return body of internal item at @coord */ +static internal_item_layout * +internal_at(const coord_t * coord /* coord of + * item */ ) +{ + assert("nikita-607", coord != NULL); + assert("nikita-1650", item_plugin_by_coord(coord) == item_plugin_by_id(NODE_POINTER_ID)); + return (internal_item_layout *) item_body_by_coord(coord); +} + +reiser4_internal void +update_internal(const coord_t * coord, const reiser4_block_nr * blocknr) +{ + internal_item_layout *item = internal_at(coord); + assert("nikita-2959", reiser4_blocknr_is_sane(blocknr)); + + cpu_to_dblock(*blocknr, &item->pointer); +} + +/* return child block number stored in the internal item at @coord */ +static reiser4_block_nr +pointer_at(const coord_t * coord /* coord of item */ ) +{ + assert("nikita-608", coord != NULL); + return dblock_to_cpu(&internal_at(coord)->pointer); +} + +/* get znode pointed to by internal @item */ +static znode * +znode_at(const coord_t * item /* coord of item */ , + znode * parent /* parent node */) +{ + return child_znode(item, parent, 1, 0); +} + +/* store pointer from internal item into "block". Implementation of + ->down_link() method */ +reiser4_internal void +down_link_internal(const coord_t * coord /* coord of item */ , + const reiser4_key * key UNUSED_ARG /* key to get + * pointer for */ , + reiser4_block_nr * block /* resulting block number */ ) +{ + ON_DEBUG(reiser4_key item_key); + + assert("nikita-609", coord != NULL); + assert("nikita-611", block != NULL); + assert("nikita-612", (key == NULL) || + /* twig horrors */ + (znode_get_level(coord->node) == TWIG_LEVEL) || keyle(item_key_by_coord(coord, &item_key), key)); + + *block = pointer_at(coord); + assert("nikita-2960", reiser4_blocknr_is_sane(block)); +} + +/* Get the child's block number, or 0 if the block is unallocated. */ +reiser4_internal int +utmost_child_real_block_internal(const coord_t * coord, sideof side UNUSED_ARG, reiser4_block_nr * block) +{ + assert("jmacd-2059", coord != NULL); + + *block = pointer_at(coord); + assert("nikita-2961", reiser4_blocknr_is_sane(block)); + + if (blocknr_is_fake(block)) { + *block = 0; + } + + return 0; +} + +/* Return the child. */ +reiser4_internal int +utmost_child_internal(const coord_t * coord, sideof side UNUSED_ARG, jnode ** childp) +{ + reiser4_block_nr block = pointer_at(coord); + znode *child; + + assert("jmacd-2059", childp != NULL); + assert("nikita-2962", reiser4_blocknr_is_sane(&block)); + + child = zlook(znode_get_tree(coord->node), &block); + + if (IS_ERR(child)) { + return PTR_ERR(child); + } + + *childp = ZJNODE(child); + + return 0; +} + +static void check_link(znode *left, znode *right) +{ + znode *scan; + + for (scan = left; scan != right; scan = scan->right) { + if (ZF_ISSET(scan, JNODE_RIP)) + break; + if (znode_is_right_connected(scan) && scan->right != NULL) { + if (ZF_ISSET(scan->right, JNODE_RIP)) + break; + assert("nikita-3285", + znode_is_left_connected(scan->right)); + assert("nikita-3265", + ergo(scan != left, + ZF_ISSET(scan, JNODE_HEARD_BANSHEE))); + assert("nikita-3284", scan->right->left == scan); + } else + break; + } +} + +reiser4_internal int check__internal(const coord_t * coord, const char **error) +{ + reiser4_block_nr blk; + znode *child; + coord_t cpy; + + blk = pointer_at(coord); + if (!reiser4_blocknr_is_sane(&blk)) { + *error = "Invalid pointer"; + return -1; + } + coord_dup(&cpy, coord); + child = znode_at(&cpy, cpy.node); + if (child != NULL) { + znode *left_child; + znode *right_child; + + left_child = right_child = NULL; + + assert("nikita-3256", znode_invariant(child)); + if (coord_prev_item(&cpy) == 0 && item_is_internal(&cpy)) { + left_child = znode_at(&cpy, cpy.node); + RLOCK_TREE(znode_get_tree(child)); + if (left_child != NULL) + check_link(left_child, child); + RUNLOCK_TREE(znode_get_tree(child)); + if (left_child != NULL) + zput(left_child); + } + coord_dup(&cpy, coord); + if (coord_next_item(&cpy) == 0 && item_is_internal(&cpy)) { + right_child = znode_at(&cpy, cpy.node); + RLOCK_TREE(znode_get_tree(child)); + if (right_child != NULL) + check_link(child, right_child); + RUNLOCK_TREE(znode_get_tree(child)); + if (right_child != NULL) + zput(right_child); + } + zput(child); + } + return 0; +} + +#if REISER4_DEBUG_OUTPUT +/* debugging aid: print human readable information about internal item at + @coord */ +reiser4_internal void +print_internal(const char *prefix /* prefix to print */ , + coord_t * coord /* coord of item to print */ ) +{ + reiser4_block_nr blk; + + blk = pointer_at(coord); + assert("nikita-2963", reiser4_blocknr_is_sane(&blk)); + printk("%s: internal: %s\n", prefix, sprint_address(&blk)); +} +#endif + +/* return true only if this item really points to "block" */ +/* Audited by: green(2002.06.14) */ +reiser4_internal int +has_pointer_to_internal(const coord_t * coord /* coord of item */ , + const reiser4_block_nr * block /* block number to + * check */ ) +{ + assert("nikita-613", coord != NULL); + assert("nikita-614", block != NULL); + + return pointer_at(coord) == *block; +} + +/* hook called by ->create_item() method of node plugin after new internal + item was just created. + + This is point where pointer to new node is inserted into tree. Initialize + parent pointer in child znode, insert child into sibling list and slum. + +*/ +reiser4_internal int +create_hook_internal(const coord_t * item /* coord of item */ , + void *arg /* child's left neighbor, if any */ ) +{ + znode *child; + + assert("nikita-1252", item != NULL); + assert("nikita-1253", item->node != NULL); + assert("nikita-1181", znode_get_level(item->node) > LEAF_LEVEL); + assert("nikita-1450", item->unit_pos == 0); + + child = znode_at(item, item->node); + if (!IS_ERR(child)) { + znode *left; + int result = 0; + reiser4_tree *tree; + + left = arg; + tree = znode_get_tree(item->node); + WLOCK_TREE(tree); + WLOCK_DK(tree); + assert("nikita-1400", (child->in_parent.node == NULL) || (znode_above_root(child->in_parent.node))); + ++ item->node->c_count; + coord_to_parent_coord(item, &child->in_parent); + sibling_list_insert_nolock(child, left); + + assert("nikita-3297", ZF_ISSET(child, JNODE_ORPHAN)); + ZF_CLR(child, JNODE_ORPHAN); + + if ((left != NULL) && !keyeq(znode_get_rd_key(left), + znode_get_rd_key(child))) { + znode_set_rd_key(child, znode_get_rd_key(left)); + } + WUNLOCK_DK(tree); + WUNLOCK_TREE(tree); + zput(child); + return result; + } else + return PTR_ERR(child); +} + +/* hook called by ->cut_and_kill() method of node plugin just before internal + item is removed. + + This is point where empty node is removed from the tree. Clear parent + pointer in child, and mark node for pending deletion. + + Node will be actually deleted later and in several installations: + + . when last lock on this node will be released, node will be removed from + the sibling list and its lock will be invalidated + + . when last reference to this node will be dropped, bitmap will be updated + and node will be actually removed from the memory. + + +*/ +reiser4_internal int +kill_hook_internal(const coord_t * item /* coord of item */ , + pos_in_node_t from UNUSED_ARG /* start unit */ , + pos_in_node_t count UNUSED_ARG /* stop unit */, + struct carry_kill_data *p UNUSED_ARG) +{ + znode *child; + + assert("nikita-1222", item != NULL); + assert("nikita-1224", from == 0); + assert("nikita-1225", count == 1); + + child = znode_at(item, item->node); + if (IS_ERR(child)) + return PTR_ERR(child); + else if (node_is_empty(child)) { + reiser4_tree *tree; + + assert("nikita-1397", znode_is_write_locked(child)); + assert("nikita-1398", child->c_count == 0); + assert("nikita-2546", ZF_ISSET(child, JNODE_HEARD_BANSHEE)); + + tree = znode_get_tree(item->node); + WLOCK_TREE(tree); + init_parent_coord(&child->in_parent, NULL); + -- item->node->c_count; + WUNLOCK_TREE(tree); + zput(child); + return 0; + } else { + warning("nikita-1223", "Cowardly refuse to remove link to non-empty node"); + print_znode("parent", item->node); + print_znode("child", child); + zput(child); + return RETERR(-EIO); + } +} + +/* hook called by ->shift() node plugin method when iternal item was just + moved from one node to another. + + Update parent pointer in child and c_counts in old and new parent + +*/ +reiser4_internal int +shift_hook_internal(const coord_t * item /* coord of item */ , + unsigned from UNUSED_ARG /* start unit */ , + unsigned count UNUSED_ARG /* stop unit */ , + znode * old_node /* old parent */ ) +{ + znode *child; + znode *new_node; + reiser4_tree *tree; + + assert("nikita-1276", item != NULL); + assert("nikita-1277", from == 0); + assert("nikita-1278", count == 1); + assert("nikita-1451", item->unit_pos == 0); + + new_node = item->node; + assert("nikita-2132", new_node != old_node); + tree = znode_get_tree(item->node); + child = child_znode(item, old_node, 1, 0); + if (child == NULL) + return 0; + if (!IS_ERR(child)) { + WLOCK_TREE(tree); + ++ new_node->c_count; + assert("nikita-1395", znode_parent(child) == old_node); + assert("nikita-1396", old_node->c_count > 0); + coord_to_parent_coord(item, &child->in_parent); + assert("nikita-1781", znode_parent(child) == new_node); + assert("nikita-1782", check_tree_pointer(item, child) == NS_FOUND); + -- old_node->c_count; + WUNLOCK_TREE(tree); + zput(child); + return 0; + } else + return PTR_ERR(child); +} + +/* plugin->u.item.b.max_key_inside - not defined */ + +/* plugin->u.item.b.nr_units - item.c:single_unit */ + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/internal.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/internal.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,51 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ +/* Internal item contains down-link to the child of the internal/twig + node in a tree. It is internal items that are actually used during + tree traversal. */ + +#if !defined( __FS_REISER4_PLUGIN_ITEM_INTERNAL_H__ ) +#define __FS_REISER4_PLUGIN_ITEM_INTERNAL_H__ + +#include "../../forward.h" +#include "../../dformat.h" + +/* on-disk layout of internal item */ +typedef struct internal_item_layout { + /* 0 */ reiser4_dblock_nr pointer; + /* 4 */ +} internal_item_layout; + +struct cut_list; + +int mergeable_internal(const coord_t * p1, const coord_t * p2); +lookup_result lookup_internal(const reiser4_key * key, lookup_bias bias, coord_t * coord); +/* store pointer from internal item into "block". Implementation of + ->down_link() method */ +extern void down_link_internal(const coord_t * coord, const reiser4_key * key, reiser4_block_nr * block); +extern int has_pointer_to_internal(const coord_t * coord, const reiser4_block_nr * block); +extern int create_hook_internal(const coord_t * item, void *arg); +extern int kill_hook_internal(const coord_t * item, pos_in_node_t from, pos_in_node_t count, + struct carry_kill_data *); +extern int shift_hook_internal(const coord_t * item, unsigned from, unsigned count, znode * old_node); +extern void print_internal(const char *prefix, coord_t * coord); + +extern int utmost_child_internal(const coord_t * coord, sideof side, jnode ** child); +int utmost_child_real_block_internal(const coord_t * coord, sideof side, reiser4_block_nr * block); + +extern void update_internal(const coord_t * coord, + const reiser4_block_nr * blocknr); +/* FIXME: reiserfs has check_internal */ +extern int check__internal(const coord_t * coord, const char **error); + +/* __FS_REISER4_PLUGIN_ITEM_INTERNAL_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/item.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/item.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,760 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* definition of item plugins. */ + +#include "../../forward.h" +#include "../../debug.h" +#include "../../key.h" +#include "../../coord.h" +#include "../plugin_header.h" +#include "sde.h" +#include "internal.h" +#include "item.h" +#include "static_stat.h" +#include "../plugin.h" +#include "../cryptcompress.h" +#include "../../znode.h" +#include "../../tree.h" +#include "../../context.h" +#include "ctail.h" + +/* return pointer to item body */ +reiser4_internal void +item_body_by_coord_hard(coord_t * coord /* coord to query */ ) +{ + assert("nikita-324", coord != NULL); + assert("nikita-325", coord->node != NULL); + assert("nikita-326", znode_is_loaded(coord->node)); + assert("nikita-3200", coord->offset == INVALID_OFFSET); + + coord->offset = node_plugin_by_node(coord->node)->item_by_coord(coord) - zdata(coord->node); + ON_DEBUG(coord->body_v = coord->node->times_locked); +} + +reiser4_internal void * +item_body_by_coord_easy(const coord_t * coord /* coord to query */ ) +{ + return zdata(coord->node) + coord->offset; +} + +#if REISER4_DEBUG + +reiser4_internal int +item_body_is_valid(const coord_t * coord) +{ + return + coord->offset == + node_plugin_by_node(coord->node)->item_by_coord(coord) - zdata(coord->node); +} + +#endif + +/* return length of item at @coord */ +reiser4_internal pos_in_node_t +item_length_by_coord(const coord_t * coord /* coord to query */ ) +{ + int len; + + assert("nikita-327", coord != NULL); + assert("nikita-328", coord->node != NULL); + assert("nikita-329", znode_is_loaded(coord->node)); + + len = node_plugin_by_node(coord->node)->length_by_coord(coord); + return len; +} + +reiser4_internal void +obtain_item_plugin(const coord_t * coord) +{ + assert("nikita-330", coord != NULL); + assert("nikita-331", coord->node != NULL); + assert("nikita-332", znode_is_loaded(coord->node)); + + coord_set_iplug((coord_t *) coord, + node_plugin_by_node(coord->node)->plugin_by_coord(coord)); + assert("nikita-2479", + coord_iplug(coord) == node_plugin_by_node(coord->node)->plugin_by_coord(coord)); +} + +/* return type of item at @coord */ +reiser4_internal item_type_id +item_type_by_coord(const coord_t * coord /* coord to query */ ) +{ + assert("nikita-333", coord != NULL); + assert("nikita-334", coord->node != NULL); + assert("nikita-335", znode_is_loaded(coord->node)); + assert("nikita-336", item_plugin_by_coord(coord) != NULL); + + return item_plugin_by_coord(coord)->b.item_type; +} + +/* return id of item */ +/* Audited by: green(2002.06.15) */ +reiser4_internal item_id +item_id_by_coord(const coord_t * coord /* coord to query */ ) +{ + assert("vs-539", coord != NULL); + assert("vs-538", coord->node != NULL); + assert("vs-537", znode_is_loaded(coord->node)); + assert("vs-536", item_plugin_by_coord(coord) != NULL); + assert("vs-540", item_id_by_plugin(item_plugin_by_coord(coord)) < LAST_ITEM_ID); + + return item_id_by_plugin(item_plugin_by_coord(coord)); +} + +/* return key of item at @coord */ +/* Audited by: green(2002.06.15) */ +reiser4_internal reiser4_key * +item_key_by_coord(const coord_t * coord /* coord to query */ , + reiser4_key * key /* result */ ) +{ + assert("nikita-338", coord != NULL); + assert("nikita-339", coord->node != NULL); + assert("nikita-340", znode_is_loaded(coord->node)); + + return node_plugin_by_node(coord->node)->key_at(coord, key); +} + +/* this returns max key in the item */ +reiser4_internal reiser4_key * +max_item_key_by_coord(const coord_t *coord /* coord to query */ , + reiser4_key *key /* result */ ) +{ + coord_t last; + + assert("nikita-338", coord != NULL); + assert("nikita-339", coord->node != NULL); + assert("nikita-340", znode_is_loaded(coord->node)); + + /* make coord pointing to last item's unit */ + coord_dup(&last, coord); + last.unit_pos = coord_num_units(&last) - 1; + assert("vs-1560", coord_is_existing_unit(&last)); + + max_unit_key_by_coord(&last, key); + return key; +} + +/* return key of unit at @coord */ +reiser4_internal reiser4_key * +unit_key_by_coord(const coord_t * coord /* coord to query */ , + reiser4_key * key /* result */ ) +{ + assert("nikita-772", coord != NULL); + assert("nikita-774", coord->node != NULL); + assert("nikita-775", znode_is_loaded(coord->node)); + + if (item_plugin_by_coord(coord)->b.unit_key != NULL) + return item_plugin_by_coord(coord)->b.unit_key(coord, key); + else + return item_key_by_coord(coord, key); +} + +/* return the biggest key contained the unit @coord */ +reiser4_internal reiser4_key * +max_unit_key_by_coord(const coord_t * coord /* coord to query */ , + reiser4_key * key /* result */ ) +{ + assert("nikita-772", coord != NULL); + assert("nikita-774", coord->node != NULL); + assert("nikita-775", znode_is_loaded(coord->node)); + + if (item_plugin_by_coord(coord)->b.max_unit_key != NULL) + return item_plugin_by_coord(coord)->b.max_unit_key(coord, key); + else + return unit_key_by_coord(coord, key); +} + + +/* ->max_key_inside() method for items consisting of exactly one key (like + stat-data) */ +static reiser4_key * +max_key_inside_single_key(const coord_t * coord /* coord of item */ , + reiser4_key * result /* resulting key */) +{ + assert("nikita-604", coord != NULL); + + /* coord -> key is starting key of this item and it has to be already + filled in */ + return unit_key_by_coord(coord, result); +} + +/* ->nr_units() method for items consisting of exactly one unit always */ +static pos_in_node_t +nr_units_single_unit(const coord_t * coord UNUSED_ARG /* coord of item */ ) +{ + return 1; +} + +static int +paste_no_paste(coord_t * coord UNUSED_ARG, + reiser4_item_data * data UNUSED_ARG, + carry_plugin_info * info UNUSED_ARG) +{ + return 0; +} + +/* default ->fast_paste() method */ +static int +agree_to_fast_op(const coord_t * coord UNUSED_ARG /* coord of item */ ) +{ + return 1; +} + +reiser4_internal int +item_can_contain_key(const coord_t * item /* coord of item */ , + const reiser4_key * key /* key to check */ , + const reiser4_item_data * data /* parameters of item + * being created */ ) +{ + item_plugin *iplug; + reiser4_key min_key_in_item; + reiser4_key max_key_in_item; + + assert("nikita-1658", item != NULL); + assert("nikita-1659", key != NULL); + + iplug = item_plugin_by_coord(item); + if (iplug->b.can_contain_key != NULL) + return iplug->b.can_contain_key(item, key, data); + else { + assert("nikita-1681", iplug->b.max_key_inside != NULL); + item_key_by_coord(item, &min_key_in_item); + iplug->b.max_key_inside(item, &max_key_in_item); + + /* can contain key if + min_key_in_item <= key && + key <= max_key_in_item + */ + return keyle(&min_key_in_item, key) && keyle(key, &max_key_in_item); + } +} + +/* mergeable method for non mergeable items */ +static int +not_mergeable(const coord_t * i1 UNUSED_ARG, + const coord_t * i2 UNUSED_ARG) +{ + return 0; +} + +/* return 0 if @item1 and @item2 are not mergeable, !0 - otherwise */ +reiser4_internal int +are_items_mergeable(const coord_t * i1 /* coord of first item */ , + const coord_t * i2 /* coord of second item */ ) +{ + item_plugin *iplug; + reiser4_key k1; + reiser4_key k2; + + assert("nikita-1336", i1 != NULL); + assert("nikita-1337", i2 != NULL); + + iplug = item_plugin_by_coord(i1); + assert("nikita-1338", iplug != NULL); + + /* NOTE-NIKITA are_items_mergeable() is also called by assertions in + shifting code when nodes are in "suspended" state. */ + assert("nikita-1663", keyle(item_key_by_coord(i1, &k1), item_key_by_coord(i2, &k2))); + + if (iplug->b.mergeable != NULL) { + return iplug->b.mergeable(i1, i2); + } else if (iplug->b.max_key_inside != NULL) { + iplug->b.max_key_inside(i1, &k1); + item_key_by_coord(i2, &k2); + + /* mergeable if ->max_key_inside() >= key of i2; */ + return keyge(iplug->b.max_key_inside(i1, &k1), item_key_by_coord(i2, &k2)); + } else { + item_key_by_coord(i1, &k1); + item_key_by_coord(i2, &k2); + + return + (get_key_locality(&k1) == get_key_locality(&k2)) && + (get_key_objectid(&k1) == get_key_objectid(&k2)) && (iplug == item_plugin_by_coord(i2)); + } +} + +reiser4_internal int +item_is_extent(const coord_t * item) +{ + assert("vs-482", coord_is_existing_item(item)); + return item_id_by_coord(item) == EXTENT_POINTER_ID; +} + +reiser4_internal int +item_is_tail(const coord_t * item) +{ + assert("vs-482", coord_is_existing_item(item)); + return item_id_by_coord(item) == FORMATTING_ID; +} + +reiser4_internal int +item_is_statdata(const coord_t * item) +{ + assert("vs-516", coord_is_existing_item(item)); + return item_type_by_coord(item) == STAT_DATA_ITEM_TYPE; +} + +reiser4_internal int +item_is_ctail(const coord_t * item) +{ + assert("edward-xx", coord_is_existing_item(item)); + return item_id_by_coord(item) == CTAIL_ID; +} + +static int +change_item(struct inode * inode, reiser4_plugin * plugin) +{ + /* cannot change constituent item (sd, or dir_item) */ + return RETERR(-EINVAL); +} + +static reiser4_plugin_ops item_plugin_ops = { + .init = NULL, + .load = NULL, + .save_len = NULL, + .save = NULL, + .change = change_item +}; + + +item_plugin item_plugins[LAST_ITEM_ID] = { + [STATIC_STAT_DATA_ID] = { + .h = { + .type_id = REISER4_ITEM_PLUGIN_TYPE, + .id = STATIC_STAT_DATA_ID, + .pops = &item_plugin_ops, + .label = "sd", + .desc = "stat-data", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .b = { + .item_type = STAT_DATA_ITEM_TYPE, + .max_key_inside = max_key_inside_single_key, + .can_contain_key = NULL, + .mergeable = not_mergeable, + .nr_units = nr_units_single_unit, + .lookup = NULL, + .init = NULL, + .paste = paste_no_paste, + .fast_paste = NULL, + .can_shift = NULL, + .copy_units = NULL, + .create_hook = NULL, + .kill_hook = NULL, + .shift_hook = NULL, + .cut_units = NULL, + .kill_units = NULL, + .unit_key = NULL, + .max_unit_key = NULL, + .estimate = NULL, + .item_data_by_flow = NULL, +#if REISER4_DEBUG_OUTPUT + .print = print_sd, + .item_stat = item_stat_static_sd, +#endif +#if REISER4_DEBUG + .check = NULL +#endif + }, + .f = { + .utmost_child = NULL, + .utmost_child_real_block = NULL, + .update = NULL, + .scan = NULL, + .convert = NULL + }, + .s = { + .sd = { + .init_inode = init_inode_static_sd, + .save_len = save_len_static_sd, + .save = save_static_sd + } + } + }, + [SIMPLE_DIR_ENTRY_ID] = { + .h = { + .type_id = REISER4_ITEM_PLUGIN_TYPE, + .id = SIMPLE_DIR_ENTRY_ID, + .pops = &item_plugin_ops, + .label = "de", + .desc = "directory entry", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .b = { + .item_type = DIR_ENTRY_ITEM_TYPE, + .max_key_inside = max_key_inside_single_key, + .can_contain_key = NULL, + .mergeable = NULL, + .nr_units = nr_units_single_unit, + .lookup = NULL, + .init = NULL, + .paste = NULL, + .fast_paste = NULL, + .can_shift = NULL, + .copy_units = NULL, + .create_hook = NULL, + .kill_hook = NULL, + .shift_hook = NULL, + .cut_units = NULL, + .kill_units = NULL, + .unit_key = NULL, + .max_unit_key = NULL, + .estimate = NULL, + .item_data_by_flow = NULL, +#if REISER4_DEBUG_OUTPUT + .print = print_de, + .item_stat = NULL, +#endif +#if REISER4_DEBUG + .check = NULL +#endif + }, + .f = { + .utmost_child = NULL, + .utmost_child_real_block = NULL, + .update = NULL, + .scan = NULL, + .convert = NULL + }, + .s = { + .dir = { + .extract_key = extract_key_de, + .update_key = update_key_de, + .extract_name = extract_name_de, + .extract_file_type = extract_file_type_de, + .add_entry = add_entry_de, + .rem_entry = rem_entry_de, + .max_name_len = max_name_len_de + } + } + }, + [COMPOUND_DIR_ID] = { + .h = { + .type_id = REISER4_ITEM_PLUGIN_TYPE, + .id = COMPOUND_DIR_ID, + .pops = &item_plugin_ops, + .label = "cde", + .desc = "compressed directory entry", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .b = { + .item_type = DIR_ENTRY_ITEM_TYPE, + .max_key_inside = max_key_inside_cde, + .can_contain_key = can_contain_key_cde, + .mergeable = mergeable_cde, + .nr_units = nr_units_cde, + .lookup = lookup_cde, + .init = init_cde, + .paste = paste_cde, + .fast_paste = agree_to_fast_op, + .can_shift = can_shift_cde, + .copy_units = copy_units_cde, + .create_hook = NULL, + .kill_hook = NULL, + .shift_hook = NULL, + .cut_units = cut_units_cde, + .kill_units = kill_units_cde, + .unit_key = unit_key_cde, + .max_unit_key = unit_key_cde, + .estimate = estimate_cde, + .item_data_by_flow = NULL +#if REISER4_DEBUG_OUTPUT + , .print = print_cde, + .item_stat = NULL +#endif +#if REISER4_DEBUG + , .check = check_cde +#endif + }, + .f = { + .utmost_child = NULL, + .utmost_child_real_block = NULL, + .update = NULL, + .scan = NULL, + .convert = NULL + }, + .s = { + .dir = { + .extract_key = extract_key_cde, + .update_key = update_key_cde, + .extract_name = extract_name_cde, + .extract_file_type = extract_file_type_de, + .add_entry = add_entry_cde, + .rem_entry = rem_entry_cde, + .max_name_len = max_name_len_cde + } + } + }, + [NODE_POINTER_ID] = { + .h = { + .type_id = REISER4_ITEM_PLUGIN_TYPE, + .id = NODE_POINTER_ID, + .pops = NULL, + .label = "internal", + .desc = "internal item", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .b = { + .item_type = INTERNAL_ITEM_TYPE, + .max_key_inside = NULL, + .can_contain_key = NULL, + .mergeable = mergeable_internal, + .nr_units = nr_units_single_unit, + .lookup = lookup_internal, + .init = NULL, + .paste = NULL, + .fast_paste = NULL, + .can_shift = NULL, + .copy_units = NULL, + .create_hook = create_hook_internal, + .kill_hook = kill_hook_internal, + .shift_hook = shift_hook_internal, + .cut_units = NULL, + .kill_units = NULL, + .unit_key = NULL, + .max_unit_key = NULL, + .estimate = NULL, + .item_data_by_flow = NULL +#if REISER4_DEBUG_OUTPUT + , .print = print_internal, + .item_stat = NULL +#endif +#if REISER4_DEBUG + , .check = check__internal +#endif + }, + .f = { + .utmost_child = utmost_child_internal, + .utmost_child_real_block = utmost_child_real_block_internal, + .update = update_internal, + .scan = NULL, + .convert = NULL + }, + .s = { + .internal = { + .down_link = down_link_internal, + .has_pointer_to = has_pointer_to_internal + } + } + }, + [EXTENT_POINTER_ID] = { + .h = { + .type_id = REISER4_ITEM_PLUGIN_TYPE, + .id = EXTENT_POINTER_ID, + .pops = NULL, + .label = "extent", + .desc = "extent item", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .b = { + .item_type = UNIX_FILE_METADATA_ITEM_TYPE, + .max_key_inside = max_key_inside_extent, + .can_contain_key = can_contain_key_extent, + .mergeable = mergeable_extent, + .nr_units = nr_units_extent, + .lookup = lookup_extent, + .init = NULL, + .paste = paste_extent, + .fast_paste = agree_to_fast_op, + .can_shift = can_shift_extent, + .create_hook = create_hook_extent, + .copy_units = copy_units_extent, + .kill_hook = kill_hook_extent, + .shift_hook = NULL, + .cut_units = cut_units_extent, + .kill_units = kill_units_extent, + .unit_key = unit_key_extent, + .max_unit_key = max_unit_key_extent, + .estimate = NULL, + .item_data_by_flow = NULL, +#if REISER4_DEBUG + .check = check_extent +#endif + }, + .f = { + .utmost_child = utmost_child_extent, + .utmost_child_real_block = utmost_child_real_block_extent, + .update = NULL, + .scan = scan_extent, + .convert = NULL, + .key_by_offset = key_by_offset_extent + }, + .s = { + .file = { + .write = write_extent, + .read = read_extent, + .readpage = readpage_extent, + .capture = capture_extent, + .get_block = get_block_address_extent, + .readpages = readpages_extent, + .append_key = append_key_extent, + .init_coord_extension = init_coord_extension_extent + } + } + }, + [FORMATTING_ID] = { + .h = { + .type_id = REISER4_ITEM_PLUGIN_TYPE, + .id = FORMATTING_ID, + .pops = NULL, + .label = "body", + .desc = "body (or tail?) item", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .b = { + .item_type = UNIX_FILE_METADATA_ITEM_TYPE, + .max_key_inside = max_key_inside_tail, + .can_contain_key = can_contain_key_tail, + .mergeable = mergeable_tail, + .nr_units = nr_units_tail, + .lookup = lookup_tail, + .init = NULL, + .paste = paste_tail, + .fast_paste = agree_to_fast_op, + .can_shift = can_shift_tail, + .create_hook = NULL, + .copy_units = copy_units_tail, + .kill_hook = kill_hook_tail, + .shift_hook = NULL, + .cut_units = cut_units_tail, + .kill_units = kill_units_tail, + .unit_key = unit_key_tail, + .max_unit_key = unit_key_tail, + .estimate = NULL, + .item_data_by_flow = NULL, +#if REISER4_DEBUG + .check = NULL +#endif + }, + .f = { + .utmost_child = NULL, + .utmost_child_real_block = NULL, + .update = NULL, + .scan = NULL, + .convert = NULL + }, + .s = { + .file = { + .write = write_tail, + .read = read_tail, + .readpage = readpage_tail, + .capture = NULL, + .get_block = NULL, + .readpages = NULL, + .append_key = append_key_tail, + .init_coord_extension = init_coord_extension_tail + } + } + }, + [CTAIL_ID] = { + .h = { + .type_id = REISER4_ITEM_PLUGIN_TYPE, + .id = CTAIL_ID, + .pops = NULL, + .label = "ctail", + .desc = "cryptcompress tail item", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .b = { + .item_type = UNIX_FILE_METADATA_ITEM_TYPE, + .max_key_inside = max_key_inside_tail, + .can_contain_key = can_contain_key_ctail, + .mergeable = mergeable_ctail, + .nr_units = nr_units_ctail, + .lookup = NULL, + .init = init_ctail, + .paste = paste_ctail, + .fast_paste = agree_to_fast_op, + .can_shift = can_shift_ctail, + .create_hook = create_hook_ctail, + .copy_units = copy_units_ctail, + .kill_hook = kill_hook_ctail, + .shift_hook = shift_hook_ctail, + .cut_units = cut_units_ctail, + .kill_units = kill_units_ctail, + .unit_key = unit_key_tail, + .max_unit_key = unit_key_tail, + .estimate = estimate_ctail, + .item_data_by_flow = NULL +#if REISER4_DEBUG_OUTPUT + , .print = print_ctail, + .item_stat = NULL +#endif +#if REISER4_DEBUG + , .check = check_ctail +#endif + }, + .f = { + .utmost_child = utmost_child_ctail, + /* FIXME-EDWARD: write this */ + .utmost_child_real_block = NULL, + .update = NULL, + .scan = scan_ctail, + .convert = convert_ctail + }, + .s = { + .file = { + .write = NULL, + .read = read_ctail, + .readpage = readpage_ctail, + .capture = NULL, + .get_block = get_block_address_tail, + .readpages = readpages_ctail, + .append_key = append_key_ctail, + .init_coord_extension = init_coord_extension_tail + } + } + }, + [BLACK_BOX_ID] = { + .h = { + .type_id = REISER4_ITEM_PLUGIN_TYPE, + .id = BLACK_BOX_ID, + .pops = NULL, + .label = "blackbox", + .desc = "black box item", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .b = { + .item_type = OTHER_ITEM_TYPE, + .max_key_inside = NULL, + .can_contain_key = NULL, + .mergeable = not_mergeable, + .nr_units = nr_units_single_unit, + /* to need for ->lookup method */ + .lookup = NULL, + .init = NULL, + .paste = NULL, + .fast_paste = NULL, + .can_shift = NULL, + .copy_units = NULL, + .create_hook = NULL, + .kill_hook = NULL, + .shift_hook = NULL, + .cut_units = NULL, + .kill_units = NULL, + .unit_key = NULL, + .max_unit_key = NULL, + .estimate = NULL, + .item_data_by_flow = NULL, +#if REISER4_DEBUG_OUTPUT + .print = NULL, + .item_stat = NULL, +#endif +#if REISER4_DEBUG + .check = NULL +#endif + } + } +}; + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/item.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/item.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,387 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* first read balance.c comments before reading this */ + +/* An item_plugin implements all of the operations required for + balancing that are item specific. */ + +/* an item plugin also implements other operations that are specific to that + item. These go into the item specific operations portion of the item + handler, and all of the item specific portions of the item handler are put + into a union. */ + +#if !defined( __REISER4_ITEM_H__ ) +#define __REISER4_ITEM_H__ + +#include "../../forward.h" +#include "../plugin_header.h" +#include "../../dformat.h" +#include "../../seal.h" +#include "../../plugin/file/file.h" + +#include /* for struct file, struct inode */ +#include /* for struct page */ +#include /* for struct dentry */ + +typedef enum { + STAT_DATA_ITEM_TYPE, + DIR_ENTRY_ITEM_TYPE, + INTERNAL_ITEM_TYPE, + UNIX_FILE_METADATA_ITEM_TYPE, + OTHER_ITEM_TYPE +} item_type_id; + + +/* this is the part of each item plugin that all items are expected to + support or at least explicitly fail to support by setting the + pointer to null. */ +typedef struct { + item_type_id item_type; + + /* operations called by balancing + + It is interesting to consider that some of these item + operations could be given sources or targets that are not + really items in nodes. This could be ok/useful. + + */ + /* maximal key that can _possibly_ be occupied by this item + + When inserting, and node ->lookup() method (called by + coord_by_key()) reaches an item after binary search, + the ->max_key_inside() item plugin method is used to determine + whether new item should pasted into existing item + (new_key<=max_key_inside()) or new item has to be created + (new_key>max_key_inside()). + + For items that occupy exactly one key (like stat-data) + this method should return this key. For items that can + grow indefinitely (extent, directory item) this should + return max_key(). + + For example extent with the key + + (LOCALITY,4,OBJID,STARTING-OFFSET), and length BLK blocks, + + ->max_key_inside is (LOCALITY,4,OBJID,0xffffffffffffffff), and + */ + reiser4_key *(*max_key_inside) (const coord_t *, reiser4_key *); + + /* true if item @coord can merge data at @key. */ + int (*can_contain_key) (const coord_t *, const reiser4_key *, const reiser4_item_data *); + /* mergeable() - check items for mergeability + + Optional method. Returns true if two items can be merged. + + */ + int (*mergeable) (const coord_t *, const coord_t *); + + /* number of atomic things in an item */ + pos_in_node_t (*nr_units) (const coord_t *); + + /* search within item for a unit within the item, and return a + pointer to it. This can be used to calculate how many + bytes to shrink an item if you use pointer arithmetic and + compare to the start of the item body if the item's data + are continuous in the node, if the item's data are not + continuous in the node, all sorts of other things are maybe + going to break as well. */ + lookup_result(*lookup) (const reiser4_key *, lookup_bias, coord_t *); + /* method called by ode_plugin->create_item() to initialise new + item */ + int (*init) (coord_t * target, coord_t * from, reiser4_item_data * data); + /* method called (e.g., by resize_item()) to place new data into + item when it grows*/ + int (*paste) (coord_t *, reiser4_item_data *, carry_plugin_info *); + /* return true if paste into @coord is allowed to skip + carry. That is, if such paste would require any changes + at the parent level + */ + int (*fast_paste) (const coord_t *); + /* how many but not more than @want units of @source can be + shifted into @target node. If pend == append - we try to + append last item of @target by first units of @source. If + pend == prepend - we try to "prepend" first item in @target + by last units of @source. @target node has @free_space + bytes of free space. Total size of those units are returned + via @size. + + @target is not NULL if shifting to the mergeable item and + NULL is new item will be created during shifting. + */ + int (*can_shift) (unsigned free_space, coord_t *, + znode *, shift_direction, unsigned *size, unsigned want); + + /* starting off @from-th unit of item @source append or + prepend @count units to @target. @target has been already + expanded by @free_space bytes. That must be exactly what is + needed for those items in @target. If @where_is_free_space + == SHIFT_LEFT - free space is at the end of @target item, + othersize - it is in the beginning of it. */ + void (*copy_units) (coord_t *, coord_t *, + unsigned from, unsigned count, shift_direction where_is_free_space, unsigned free_space); + + int (*create_hook) (const coord_t *, void *); + /* do whatever is necessary to do when @count units starting + from @from-th one are removed from the tree */ + /* FIXME-VS: this is used to be here for, in particular, + extents and items of internal type to free blocks they point + to at the same time with removing items from a + tree. Problems start, however, when dealloc_block fails due + to some reason. Item gets removed, but blocks it pointed to + are not freed. It is not clear how to fix this for items of + internal type because a need to remove internal item may + appear in the middle of balancing, and there is no way to + undo changes made. OTOH, if space allocator involves + balancing to perform dealloc_block - this will probably + break balancing due to deadlock issues + */ + int (*kill_hook) (const coord_t *, pos_in_node_t from, pos_in_node_t count, struct carry_kill_data *); + int (*shift_hook) (const coord_t *, unsigned from, unsigned count, znode *_node); + + /* unit @*from contains @from_key. unit @*to contains @to_key. Cut all keys between @from_key and @to_key + including boundaries. When units are cut from item beginning - move space which gets freed to head of + item. When units are cut from item end - move freed space to item end. When units are cut from the middle of + item - move freed space to item head. Return amount of space which got freed. Save smallest removed key in + @smallest_removed if it is not 0. Save new first item key in @new_first_key if it is not 0 + */ + int (*cut_units) (coord_t *, pos_in_node_t from, pos_in_node_t to, struct carry_cut_data *, + reiser4_key *smallest_removed, reiser4_key *new_first_key); + + /* like cut_units, except that these units are removed from the + tree, not only from a node */ + int (*kill_units) (coord_t *, pos_in_node_t from, pos_in_node_t to, struct carry_kill_data *, + reiser4_key *smallest_removed, reiser4_key *new_first); + + /* if @key_of_coord == 1 - returned key of coord, otherwise - + key of unit is returned. If @coord is not set to certain + unit - ERR_PTR(-ENOENT) is returned */ + reiser4_key *(*unit_key) (const coord_t *, reiser4_key *); + reiser4_key *(*max_unit_key) (const coord_t *, reiser4_key *); + /* estimate how much space is needed for paste @data into item at + @coord. if @coord==0 - estimate insertion, otherwise - estimate + pasting + */ + int (*estimate) (const coord_t *, const reiser4_item_data *); + + /* converts flow @f to item data. @coord == 0 on insert */ + int (*item_data_by_flow) (const coord_t *, const flow_t *, reiser4_item_data *); + + /*void (*show) (struct seq_file *, coord_t *);*/ + +#if REISER4_DEBUG + /* used for debugging, every item should have here the most + complete possible check of the consistency of the item that + the inventor can construct */ + int (*check) (const coord_t *, const char **error); +#endif + +} balance_ops; + +typedef struct { + /* return the right or left child of @coord, only if it is in memory */ + int (*utmost_child) (const coord_t *, sideof side, jnode ** child); + + /* return whether the right or left child of @coord has a non-fake + block number. */ + int (*utmost_child_real_block) (const coord_t *, sideof side, reiser4_block_nr *); + /* relocate child at @coord to the @block */ + void (*update) (const coord_t *, const reiser4_block_nr *); + /* count unformatted nodes per item for leave relocation policy, etc.. */ + int (*scan) (flush_scan * scan); + /* convert item by flush */ + int (*convert) (flush_pos_t * pos); + /* backward mapping from jnode offset to a key. */ + int (*key_by_offset) (struct inode *, loff_t, reiser4_key *); +} flush_ops; + +/* operations specific to the directory item */ +typedef struct { + /* extract stat-data key from directory entry at @coord and place it + into @key. */ + int (*extract_key) (const coord_t *, reiser4_key * key); + /* update object key in item. */ + int (*update_key) (const coord_t *, const reiser4_key *, lock_handle *); + /* extract name from directory entry at @coord and return it */ + char *(*extract_name) (const coord_t *, char *buf); + /* extract file type (DT_* stuff) from directory entry at @coord and + return it */ + unsigned (*extract_file_type) (const coord_t *); + int (*add_entry) (struct inode *dir, + coord_t *, lock_handle *, + const struct dentry *name, reiser4_dir_entry_desc *entry); + int (*rem_entry) (struct inode *dir, const struct qstr *name, + coord_t *, lock_handle *, + reiser4_dir_entry_desc *entry); + int (*max_name_len) (const struct inode *dir); +} dir_entry_ops; + +/* operations specific to items regular (unix) file metadata are built of */ +typedef struct { + int (*write)(struct inode *, flow_t *, hint_t *, int grabbed, write_mode_t); + int (*read)(struct file *, flow_t *, hint_t *); + int (*readpage) (void *, struct page *); + int (*capture) (reiser4_key *, uf_coord_t *, struct page *, write_mode_t); + int (*get_block) (const coord_t *, sector_t, struct buffer_head *); + void (*readpages) (void *, struct address_space *, struct list_head *pages); + /* key of first byte which is not addressed by the item @coord is set to + For example extent with the key + + (LOCALITY,4,OBJID,STARTING-OFFSET), and length BLK blocks, + + ->append_key is + + (LOCALITY,4,OBJID,STARTING-OFFSET + BLK * block_size) */ + /* FIXME: could be uf_coord also */ + reiser4_key *(*append_key) (const coord_t *, reiser4_key *); + + void (*init_coord_extension)(uf_coord_t *, loff_t); +} file_ops; + +/* operations specific to items of stat data type */ +typedef struct { + int (*init_inode) (struct inode * inode, char *sd, int len); + int (*save_len) (struct inode * inode); + int (*save) (struct inode * inode, char **area); +} sd_ops; + +/* operations specific to internal item */ +typedef struct { + /* all tree traversal want to know from internal item is where + to go next. */ + void (*down_link) (const coord_t * coord, + const reiser4_key * key, reiser4_block_nr * block); + /* check that given internal item contains given pointer. */ + int (*has_pointer_to) (const coord_t * coord, + const reiser4_block_nr * block); +} internal_item_ops; + +struct item_plugin { + /* generic fields */ + plugin_header h; + + /* methods common for all item types */ + balance_ops b; + /* methods used during flush */ + flush_ops f; + + /* methods specific to particular type of item */ + union { + dir_entry_ops dir; + file_ops file; + sd_ops sd; + internal_item_ops internal; + } s; + +}; + +static inline item_id +item_id_by_plugin(item_plugin * plugin) +{ + return plugin->h.id; +} + +static inline char +get_iplugid(item_plugin *iplug) +{ + assert("nikita-2838", iplug != NULL); + assert("nikita-2839", 0 <= iplug->h.id && iplug->h.id < 0xff); + return (char)item_id_by_plugin(iplug); +} + +extern unsigned long znode_times_locked(const znode *z); + +static inline void +coord_set_iplug(coord_t * coord, item_plugin *iplug) +{ + assert("nikita-2837", coord != NULL); + assert("nikita-2838", iplug != NULL); + coord->iplugid = get_iplugid(iplug); + ON_DEBUG(coord->plug_v = znode_times_locked(coord->node)); +} + +static inline item_plugin * +coord_iplug(const coord_t * coord) +{ + assert("nikita-2833", coord != NULL); + assert("nikita-2834", coord->iplugid != INVALID_PLUGID); + assert("nikita-3549", coord->plug_v == znode_times_locked(coord->node)); + return (item_plugin *)plugin_by_id(REISER4_ITEM_PLUGIN_TYPE, + coord->iplugid); +} + +extern int item_can_contain_key(const coord_t * item, const reiser4_key * key, const reiser4_item_data *); +extern int are_items_mergeable(const coord_t * i1, const coord_t * i2); +extern int item_is_extent(const coord_t *); +extern int item_is_tail(const coord_t *); +extern int item_is_statdata(const coord_t * item); +extern int item_is_ctail(const coord_t *); + +extern pos_in_node_t item_length_by_coord(const coord_t * coord); +extern item_type_id item_type_by_coord(const coord_t * coord); +extern item_id item_id_by_coord(const coord_t * coord /* coord to query */ ); +extern reiser4_key *item_key_by_coord(const coord_t * coord, reiser4_key * key); +extern reiser4_key *max_item_key_by_coord(const coord_t *, reiser4_key *); +extern reiser4_key *unit_key_by_coord(const coord_t * coord, reiser4_key * key); +extern reiser4_key *max_unit_key_by_coord(const coord_t * coord, reiser4_key * key); + +extern void obtain_item_plugin(const coord_t * coord); + +#if defined(REISER4_DEBUG) || defined(REISER4_DEBUG_MODIFY) || defined(REISER4_DEBUG_OUTPUT) +extern int znode_is_loaded(const znode * node); +#endif + +/* return plugin of item at @coord */ +static inline item_plugin * +item_plugin_by_coord(const coord_t * coord /* coord to query */ ) +{ + assert("nikita-330", coord != NULL); + assert("nikita-331", coord->node != NULL); + assert("nikita-332", znode_is_loaded(coord->node)); + + if (unlikely(!coord_is_iplug_set(coord))) + obtain_item_plugin(coord); + return coord_iplug(coord); +} + +/* this returns true if item is of internal type */ +static inline int +item_is_internal(const coord_t * item) +{ + assert("vs-483", coord_is_existing_item(item)); + return item_type_by_coord(item) == INTERNAL_ITEM_TYPE; +} + +extern void item_body_by_coord_hard(coord_t * coord); +extern void *item_body_by_coord_easy(const coord_t * coord); +#if REISER4_DEBUG +extern int item_body_is_valid(const coord_t * coord); +#endif + +/* return pointer to item body */ +static inline void * +item_body_by_coord(const coord_t * coord /* coord to query */ ) +{ + assert("nikita-324", coord != NULL); + assert("nikita-325", coord->node != NULL); + assert("nikita-326", znode_is_loaded(coord->node)); + + if (coord->offset == INVALID_OFFSET) + item_body_by_coord_hard((coord_t *)coord); + assert("nikita-3201", item_body_is_valid(coord)); + assert("nikita-3550", coord->body_v == znode_times_locked(coord->node)); + return item_body_by_coord_easy(coord); +} + +/* __REISER4_ITEM_H__ */ +#endif +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/sde.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/sde.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,216 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Directory entry implementation */ +#include "../../forward.h" +#include "../../debug.h" +#include "../../dformat.h" +#include "../../kassign.h" +#include "../../coord.h" +#include "sde.h" +#include "item.h" +#include "../plugin.h" +#include "../../znode.h" +#include "../../carry.h" +#include "../../tree.h" +#include "../../inode.h" + +#include /* for struct inode */ +#include /* for struct dentry */ +#include + +#if REISER4_DEBUG_OUTPUT +reiser4_internal void +print_de(const char *prefix /* prefix to print */ , + coord_t * coord /* item to print */ ) +{ + assert("nikita-1456", prefix != NULL); + assert("nikita-1457", coord != NULL); + + if (item_length_by_coord(coord) < (int) sizeof (directory_entry_format)) { + printk("%s: wrong size: %i < %i\n", prefix, item_length_by_coord(coord), sizeof (directory_entry_format)); + } else { + reiser4_key sdkey; + char *name; + char buf[DE_NAME_BUF_LEN]; + + extract_key_de(coord, &sdkey); + name = extract_name_de(coord, buf); + printk("%s: name: %s\n", prefix, name); + print_key("\tsdkey", &sdkey); + } +} +#endif + +/* ->extract_key() method of simple directory item plugin. */ +reiser4_internal int +extract_key_de(const coord_t * coord /* coord of item */ , + reiser4_key * key /* resulting key */ ) +{ + directory_entry_format *dent; + + assert("nikita-1458", coord != NULL); + assert("nikita-1459", key != NULL); + + dent = (directory_entry_format *) item_body_by_coord(coord); + assert("nikita-1158", item_length_by_coord(coord) >= (int) sizeof *dent); + return extract_key_from_id(&dent->id, key); +} + +reiser4_internal int +update_key_de(const coord_t * coord, const reiser4_key * key, lock_handle * lh UNUSED_ARG) +{ + directory_entry_format *dent; + obj_key_id obj_id; + int result; + + assert("nikita-2342", coord != NULL); + assert("nikita-2343", key != NULL); + + dent = (directory_entry_format *) item_body_by_coord(coord); + result = build_obj_key_id(key, &obj_id); + if (result == 0) { + dent->id = obj_id; + znode_make_dirty(coord->node); + } + return 0; +} + +reiser4_internal char * +extract_dent_name(const coord_t * coord, directory_entry_format *dent, char *buf) +{ + reiser4_key key; + + unit_key_by_coord(coord, &key); + if (get_key_type(&key) != KEY_FILE_NAME_MINOR) + print_address("oops", znode_get_block(coord->node)); + if (!is_longname_key(&key)) { + if (is_dot_key(&key)) + return (char *) "."; + else + return extract_name_from_key(&key, buf); + } else + return (char *) dent->name; +} + +/* ->extract_name() method of simple directory item plugin. */ +reiser4_internal char * +extract_name_de(const coord_t * coord /* coord of item */, char *buf) +{ + directory_entry_format *dent; + + assert("nikita-1460", coord != NULL); + + dent = (directory_entry_format *) item_body_by_coord(coord); + return extract_dent_name(coord, dent, buf); +} + +/* ->extract_file_type() method of simple directory item plugin. */ +reiser4_internal unsigned +extract_file_type_de(const coord_t * coord UNUSED_ARG /* coord of + * item */ ) +{ + assert("nikita-1764", coord != NULL); + /* we don't store file type in the directory entry yet. + + But see comments at kassign.h:obj_key_id + */ + return DT_UNKNOWN; +} + +reiser4_internal int +add_entry_de(struct inode *dir /* directory of item */ , + coord_t * coord /* coord of item */ , + lock_handle * lh /* insertion lock handle */ , + const struct dentry *de /* name to add */ , + reiser4_dir_entry_desc * entry /* parameters of new directory + * entry */ ) +{ + reiser4_item_data data; + directory_entry_format *dent; + int result; + const char *name; + int len; + int longname; + + name = de->d_name.name; + len = de->d_name.len; + assert("nikita-1163", strlen(name) == len); + + longname = is_longname(name, len); + + data.length = sizeof *dent; + if (longname) + data.length += len + 1; + data.data = NULL; + data.user = 0; + data.iplug = item_plugin_by_id(SIMPLE_DIR_ENTRY_ID); + + /* NOTE-NIKITA quota plugin */ + if (DQUOT_ALLOC_SPACE_NODIRTY(dir, data.length)) + return -EDQUOT; + + result = insert_by_coord(coord, &data, &entry->key, lh, 0 /*flags */ ); + if (result != 0) + return result; + + dent = (directory_entry_format *) item_body_by_coord(coord); + build_inode_key_id(entry->obj, &dent->id); + if (longname) { + memcpy(dent->name, name, len); + cputod8(0, &dent->name[len]); + } + return 0; +} + +reiser4_internal int +rem_entry_de(struct inode *dir /* directory of item */ , + const struct qstr * name UNUSED_ARG, + coord_t * coord /* coord of item */ , + lock_handle * lh UNUSED_ARG /* lock handle for + * removal */ , + reiser4_dir_entry_desc * entry UNUSED_ARG /* parameters of + * directory entry + * being removed */ ) +{ + coord_t shadow; + int result; + int length; + + length = item_length_by_coord(coord); + if (inode_get_bytes(dir) < length) { + warning("nikita-2627", "Dir is broke: %llu: %llu", + (unsigned long long)get_inode_oid(dir), + inode_get_bytes(dir)); + + return RETERR(-EIO); + } + + /* cut_node() is supposed to take pointers to _different_ + coords, because it will modify them without respect to + possible aliasing. To work around this, create temporary copy + of @coord. + */ + coord_dup(&shadow, coord); + result = kill_node_content(coord, &shadow, NULL, NULL, NULL, NULL, NULL, 0); + if (result == 0) { + /* NOTE-NIKITA quota plugin */ + DQUOT_FREE_SPACE_NODIRTY(dir, length); + } + return result; +} + +reiser4_internal int +max_name_len_de(const struct inode *dir) +{ + return tree_by_inode(dir)->nplug->max_item_size() - sizeof (directory_entry_format) - 2; +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/sde.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/sde.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,64 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Directory entry. */ + +#if !defined( __FS_REISER4_PLUGIN_DIRECTORY_ENTRY_H__ ) +#define __FS_REISER4_PLUGIN_DIRECTORY_ENTRY_H__ + +#include "../../forward.h" +#include "../../dformat.h" +#include "../../kassign.h" +#include "../../key.h" + +#include +#include /* for struct dentry */ + +typedef struct directory_entry_format { + /* key of object stat-data. It's not necessary to store whole + key here, because it's always key of stat-data, so minor + packing locality and offset can be omitted here. But this + relies on particular key allocation scheme for stat-data, so, + for extensibility sake, whole key can be stored here. + + We store key as array of bytes, because we don't want 8-byte + alignment of dir entries. + */ + obj_key_id id; + /* file name. Null terminated string. */ + d8 name[0]; +} directory_entry_format; + +void print_de(const char *prefix, coord_t * coord); +int extract_key_de(const coord_t * coord, reiser4_key * key); +int update_key_de(const coord_t * coord, const reiser4_key * key, lock_handle * lh); +char *extract_name_de(const coord_t * coord, char *buf); +unsigned extract_file_type_de(const coord_t * coord); +int add_entry_de(struct inode *dir, coord_t * coord, + lock_handle * lh, const struct dentry *name, reiser4_dir_entry_desc * entry); +int rem_entry_de(struct inode *dir, const struct qstr * name, coord_t * coord, lock_handle * lh, reiser4_dir_entry_desc * entry); +int max_name_len_de(const struct inode *dir); + + +int de_rem_and_shrink(struct inode *dir, coord_t * coord, int length); + +char *extract_dent_name(const coord_t * coord, + directory_entry_format *dent, char *buf); + +#if REISER4_LARGE_KEY +#define DE_NAME_BUF_LEN (24) +#else +#define DE_NAME_BUF_LEN (16) +#endif + +/* __FS_REISER4_PLUGIN_DIRECTORY_ENTRY_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/static_stat.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/static_stat.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1319 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* stat data manipulation. */ + +#include "../../forward.h" +#include "../../super.h" +#include "../../vfs_ops.h" +#include "../../inode.h" +#include "../../debug.h" +#include "../../dformat.h" +#include "../object.h" +#include "../plugin.h" +#include "../plugin_header.h" +#include "static_stat.h" +#include "item.h" + +#include +#include + +/* see static_stat.h for explanation */ + +/* helper function used while we are dumping/loading inode/plugin state + to/from the stat-data. */ + +static void +move_on(int *length /* space remaining in stat-data */ , + char **area /* current coord in stat data */ , + int size_of /* how many bytes to move forward */ ) +{ + assert("nikita-615", length != NULL); + assert("nikita-616", area != NULL); + + *length -= size_of; + *area += size_of; + + assert("nikita-617", *length >= 0); +} + +#if REISER4_DEBUG_OUTPUT +/* ->print() method of static sd item. Prints human readable information about + sd at @coord */ +reiser4_internal void +print_sd(const char *prefix /* prefix to print */ , + coord_t * coord /* coord of item */ ) +{ + char *sd; + int len; + int bit; + int chunk; + __u16 mask; + reiser4_stat_data_base *sd_base; + + assert("nikita-1254", prefix != NULL); + assert("nikita-1255", coord != NULL); + + sd = item_body_by_coord(coord); + len = item_length_by_coord(coord); + + sd_base = (reiser4_stat_data_base *) sd; + if (len < (int) sizeof *sd_base) { + printk("%s: wrong size: %i < %i\n", prefix, item_length_by_coord(coord), sizeof *sd_base); + return; + } + + mask = d16tocpu(&sd_base->extmask); + printk("%s: extmask: %x\n", prefix, mask); + + move_on(&len, &sd, sizeof *sd_base); + + for (bit = 0, chunk = 0; mask != 0; ++bit, mask >>= 1) { + if (((bit + 1) % 16) != 0) { + /* handle extension */ + sd_ext_plugin *sdplug; + + sdplug = sd_ext_plugin_by_id(bit); + if (sdplug == NULL) { + continue; + } + if ((mask & 1) && sdplug->print != NULL) { + /* alignment is not supported in node layout + plugin yet. + result = align( inode, &len, &sd, + sdplug -> alignment ); + if( result != 0 ) + return result; */ + sdplug->print(prefix, &sd, &len); + } + } else if (mask & 1) { + /* next portion of bitmask */ + if (len < (int) sizeof (d16)) { + warning("nikita-2708", "No space for bitmap"); + break; + } + mask = d16tocpu((d16 *) sd); + move_on(&len, &sd, sizeof (d16)); + ++chunk; + if (chunk == 3) { + if (!(mask & 0x8000)) { + /* clear last bit */ + mask &= ~0x8000; + continue; + } + /* too much */ + warning("nikita-2709", "Too many extensions"); + break; + } + } else + /* bitmask exhausted */ + break; + } +} +#endif + +reiser4_internal void +item_stat_static_sd(const coord_t * coord, void *vp) +{ + reiser4_stat_data_base *sd; + mode_t mode; + sd_stat *stat; + + stat = (sd_stat *) vp; + sd = (reiser4_stat_data_base *) item_body_by_coord(coord); + mode = 0; // d16tocpu( &sd -> mode ); + + if (S_ISREG(mode)) + stat->files++; + else if (S_ISDIR(mode)) + stat->dirs++; + else + stat->others++; +} + +/* helper function used while loading inode/plugin state from stat-data. + Complain if there is less space in stat-data than was expected. + Can only happen on disk corruption. */ +static int +not_enough_space(struct inode *inode /* object being processed */ , + const char *where /* error message */ ) +{ + assert("nikita-618", inode != NULL); + + warning("nikita-619", "Not enough space in %llu while loading %s", + (unsigned long long)get_inode_oid(inode), where); + + return RETERR(-EINVAL); +} + +/* helper function used while loading inode/plugin state from + stat-data. Call it if invalid plugin id was found. */ +static int +unknown_plugin(reiser4_plugin_id id /* invalid id */ , + struct inode *inode /* object being processed */ ) +{ + warning("nikita-620", "Unknown plugin %i in %llu", + id, (unsigned long long)get_inode_oid(inode)); + + return RETERR(-EINVAL); +} + +#if 0 /* Item alignment is not yet supported */ + +/* helper function used while storing/loading inode/plugin data to/from + stat-data. Move current coord in stat-data ("area") to position + aligned up to "alignment" bytes. */ +static int +align(struct inode *inode /* object being processed */ , + int *length /* space remaining in stat-data */ , + char **area /* current coord in stat data */ , + int alignment /* required alignment */ ) +{ + int delta; + + assert("nikita-621", inode != NULL); + assert("nikita-622", length != NULL); + assert("nikita-623", area != NULL); + assert("nikita-624", alignment > 0); + + delta = round_up(*area, alignment) - *area; + if (delta > *length) + return not_enough_space(inode, "padding"); + if (delta > 0) + move_on(length, area, delta); + return 0; +} + +#endif /* 0 */ + +/* this is installed as ->init_inode() method of + item_plugins[ STATIC_STAT_DATA_IT ] (fs/reiser4/plugin/item/item.c). + Copies data from on-disk stat-data format into inode. + Handles stat-data extensions. */ +/* was sd_load */ +reiser4_internal int +init_inode_static_sd(struct inode *inode /* object being processed */ , + char *sd /* stat-data body */ , + int len /* length of stat-data */ ) +{ + int result; + int bit; + int chunk; + __u16 mask; + __u64 bigmask; + reiser4_stat_data_base *sd_base; + reiser4_inode *state; + + assert("nikita-625", inode != NULL); + assert("nikita-626", sd != NULL); + + result = 0; + sd_base = (reiser4_stat_data_base *) sd; + state = reiser4_inode_data(inode); + mask = d16tocpu(&sd_base->extmask); + bigmask = mask; + inode_set_flag(inode, REISER4_SDLEN_KNOWN); + + move_on(&len, &sd, sizeof *sd_base); + for (bit = 0, chunk = 0; mask != 0 || bit <= LAST_IMPORTANT_SD_EXTENSION; ++bit, mask >>= 1) { + if (((bit + 1) % 16) != 0) { + /* handle extension */ + sd_ext_plugin *sdplug; + + sdplug = sd_ext_plugin_by_id(bit); + if (sdplug == NULL) { + warning("nikita-627", "No such extension %i in inode %llu", + bit, (unsigned long long)get_inode_oid(inode)); + + result = RETERR(-EINVAL); + break; + } + if (mask & 1) { + assert("nikita-628", sdplug->present); + /* alignment is not supported in node layout + plugin yet. + result = align( inode, &len, &sd, + sdplug -> alignment ); + if( result != 0 ) + return result; */ + result = sdplug->present(inode, &sd, &len); + } else if (sdplug->absent != NULL) + result = sdplug->absent(inode); + if (result) + break; + /* else, we are looking at the last bit in 16-bit + portion of bitmask */ + } else if (mask & 1) { + /* next portion of bitmask */ + if (len < (int) sizeof (d16)) { + warning("nikita-629", "No space for bitmap in inode %llu", + (unsigned long long)get_inode_oid(inode)); + + result = RETERR(-EINVAL); + break; + } + mask = d16tocpu((d16 *) sd); + bigmask <<= 16; + bigmask |= mask; + move_on(&len, &sd, sizeof (d16)); + ++chunk; + if (chunk == 3) { + if (!(mask & 0x8000)) { + /* clear last bit */ + mask &= ~0x8000; + continue; + } + /* too much */ + warning("nikita-630", "Too many extensions in %llu", + (unsigned long long)get_inode_oid(inode)); + + result = RETERR(-EINVAL); + break; + } + } else + /* bitmask exhausted */ + break; + } + state->extmask = bigmask; + /* common initialisations */ + inode->i_blksize = get_super_private(inode->i_sb)->optimal_io_size; + if (len - (sizeof (d16) * bit / 16) > 0) { + /* alignment in save_len_static_sd() is taken into account + -edward */ + warning("nikita-631", "unused space in inode %llu", + (unsigned long long)get_inode_oid(inode)); + } + + return result; +} + +/* estimates size of stat-data required to store inode. + Installed as ->save_len() method of + item_plugins[ STATIC_STAT_DATA_IT ] (fs/reiser4/plugin/item/item.c). */ +/* was sd_len */ +reiser4_internal int +save_len_static_sd(struct inode *inode /* object being processed */ ) +{ + unsigned int result; + __u64 mask; + int bit; + + assert("nikita-632", inode != NULL); + + result = sizeof (reiser4_stat_data_base); + mask = reiser4_inode_data(inode)->extmask; + for (bit = 0; mask != 0; ++bit, mask >>= 1) { + if (mask & 1) { + sd_ext_plugin *sdplug; + + sdplug = sd_ext_plugin_by_id(bit); + assert("nikita-633", sdplug != NULL); + /* no aligment support + result += + round_up( result, sdplug -> alignment ) - result; */ + result += sdplug->save_len(inode); + } + } + result += sizeof (d16) * bit / 16; + return result; +} + +/* saves inode into stat-data. + Installed as ->save() method of + item_plugins[ STATIC_STAT_DATA_IT ] (fs/reiser4/plugin/item/item.c). */ +/* was sd_save */ +reiser4_internal int +save_static_sd(struct inode *inode /* object being processed */ , + char **area /* where to save stat-data */ ) +{ + int result; + __u64 emask; + int bit; + unsigned int len; + reiser4_stat_data_base *sd_base; + + assert("nikita-634", inode != NULL); + assert("nikita-635", area != NULL); + + result = 0; + emask = reiser4_inode_data(inode)->extmask; + sd_base = (reiser4_stat_data_base *) * area; + cputod16((unsigned) (emask & 0xffff), &sd_base->extmask); + + *area += sizeof *sd_base; + len = 0xffffffffu; + for (bit = 0; emask != 0; ++bit, emask >>= 1) { + if (emask & 1) { + if ((bit + 1) % 16 != 0) { + sd_ext_plugin *sdplug; + sdplug = sd_ext_plugin_by_id(bit); + assert("nikita-636", sdplug != NULL); + /* no alignment support yet + align( inode, &len, area, + sdplug -> alignment ); */ + result = sdplug->save(inode, area); + if (result) + break; + } else { + cputod16((unsigned) (emask & 0xffff), (d16 *) * area); + *area += sizeof (d16); + } + } + } + return result; +} + +/* stat-data extension handling functions. */ + +static int +present_lw_sd(struct inode *inode /* object being processed */ , + char **area /* position in stat-data */ , + int *len /* remaining length */ ) +{ + if (*len >= (int) sizeof (reiser4_light_weight_stat)) { + reiser4_light_weight_stat *sd_lw; + + sd_lw = (reiser4_light_weight_stat *) * area; + + inode->i_mode = d16tocpu(&sd_lw->mode); + inode->i_nlink = d32tocpu(&sd_lw->nlink); + inode->i_size = d64tocpu(&sd_lw->size); + if ((inode->i_mode & S_IFMT) == (S_IFREG | S_IFIFO)) { + inode->i_mode &= ~S_IFIFO; + inode_set_flag(inode, REISER4_PART_CONV); + } + move_on(len, area, sizeof *sd_lw); + return 0; + } else + return not_enough_space(inode, "lw sd"); +} + +static int +save_len_lw_sd(struct inode *inode UNUSED_ARG /* object being + * processed */ ) +{ + return sizeof (reiser4_light_weight_stat); +} + +static int +save_lw_sd(struct inode *inode /* object being processed */ , + char **area /* position in stat-data */ ) +{ + reiser4_light_weight_stat *sd; + mode_t delta; + + assert("nikita-2705", inode != NULL); + assert("nikita-2706", area != NULL); + assert("nikita-2707", *area != NULL); + + sd = (reiser4_light_weight_stat *) * area; + + delta = inode_get_flag(inode, REISER4_PART_CONV) ? S_IFIFO : 0; + cputod16(inode->i_mode | delta, &sd->mode); + cputod32(inode->i_nlink, &sd->nlink); + cputod64((__u64) inode->i_size, &sd->size); + *area += sizeof *sd; + return 0; +} + +#if REISER4_DEBUG_OUTPUT +static void +print_lw_sd(const char *prefix, char **area /* position in stat-data */ , + int *len /* remaining length */ ) +{ + reiser4_light_weight_stat *sd; + + sd = (reiser4_light_weight_stat *) * area; + printk("%s: mode: %o, nlink: %i, size: %llu\n", prefix, + d16tocpu(&sd->mode), d32tocpu(&sd->nlink), d64tocpu(&sd->size)); + move_on(len, area, sizeof *sd); +} +#endif + +static int +present_unix_sd(struct inode *inode /* object being processed */ , + char **area /* position in stat-data */ , + int *len /* remaining length */ ) +{ + assert("nikita-637", inode != NULL); + assert("nikita-638", area != NULL); + assert("nikita-639", *area != NULL); + assert("nikita-640", len != NULL); + assert("nikita-641", *len > 0); + + if (*len >= (int) sizeof (reiser4_unix_stat)) { + reiser4_unix_stat *sd; + + sd = (reiser4_unix_stat *) * area; + + inode->i_uid = d32tocpu(&sd->uid); + inode->i_gid = d32tocpu(&sd->gid); + inode->i_atime.tv_sec = d32tocpu(&sd->atime); + inode->i_mtime.tv_sec = d32tocpu(&sd->mtime); + inode->i_ctime.tv_sec = d32tocpu(&sd->ctime); + if (S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode)) + inode->i_rdev = d64tocpu(&sd->u.rdev); + else + inode_set_bytes(inode, (loff_t) d64tocpu(&sd->u.bytes)); + move_on(len, area, sizeof *sd); + return 0; + } else + return not_enough_space(inode, "unix sd"); +} + +static int +absent_unix_sd(struct inode *inode /* object being processed */ ) +{ + inode->i_uid = get_super_private(inode->i_sb)->default_uid; + inode->i_gid = get_super_private(inode->i_sb)->default_gid; + inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME; + inode_set_bytes(inode, inode->i_size); + /* mark inode as lightweight, so that caller (reiser4_lookup) will + complete initialisation by copying [ug]id from a parent. */ + inode_set_flag(inode, REISER4_LIGHT_WEIGHT); + return 0; +} + +/* Audited by: green(2002.06.14) */ +static int +save_len_unix_sd(struct inode *inode UNUSED_ARG /* object being + * processed */ ) +{ + return sizeof (reiser4_unix_stat); +} + +static int +save_unix_sd(struct inode *inode /* object being processed */ , + char **area /* position in stat-data */ ) +{ + reiser4_unix_stat *sd; + + assert("nikita-642", inode != NULL); + assert("nikita-643", area != NULL); + assert("nikita-644", *area != NULL); + + sd = (reiser4_unix_stat *) * area; + cputod32(inode->i_uid, &sd->uid); + cputod32(inode->i_gid, &sd->gid); + cputod32((__u32) inode->i_atime.tv_sec, &sd->atime); + cputod32((__u32) inode->i_ctime.tv_sec, &sd->ctime); + cputod32((__u32) inode->i_mtime.tv_sec, &sd->mtime); + if (S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode)) + cputod64(inode->i_rdev, &sd->u.rdev); + else + cputod64((__u64) inode_get_bytes(inode), &sd->u.bytes); + *area += sizeof *sd; + return 0; +} + +#if REISER4_DEBUG_OUTPUT +static void +print_unix_sd(const char *prefix, char **area /* position in stat-data */ , + int *len /* remaining length */ ) +{ + reiser4_unix_stat *sd; + + sd = (reiser4_unix_stat *) * area; + printk("%s: uid: %i, gid: %i, atime: %i, mtime: %i, ctime: %i, " + "rdev: %llo, bytes: %llu\n", prefix, + d32tocpu(&sd->uid), + d32tocpu(&sd->gid), + d32tocpu(&sd->atime), + d32tocpu(&sd->mtime), d32tocpu(&sd->ctime), d64tocpu(&sd->u.rdev), d64tocpu(&sd->u.bytes)); + move_on(len, area, sizeof *sd); +} +#endif + +static int +present_large_times_sd(struct inode *inode /* object being processed */, + char **area /* position in stat-data */, + int *len /* remaining length */) +{ + if (*len >= (int) sizeof (reiser4_large_times_stat)) { + reiser4_large_times_stat *sd_lt; + + sd_lt = (reiser4_large_times_stat *) * area; + + inode->i_atime.tv_nsec = d32tocpu(&sd_lt->atime); + inode->i_mtime.tv_nsec = d32tocpu(&sd_lt->mtime); + inode->i_ctime.tv_nsec = d32tocpu(&sd_lt->ctime); + + move_on(len, area, sizeof *sd_lt); + return 0; + } else + return not_enough_space(inode, "large times sd"); +} + +static int +save_len_large_times_sd(struct inode *inode UNUSED_ARG /* object being processed */ ) +{ + return sizeof (reiser4_large_times_stat); +} + +static int +save_large_times_sd(struct inode *inode /* object being processed */ , + char **area /* position in stat-data */ ) +{ + reiser4_large_times_stat *sd; + + assert("nikita-2817", inode != NULL); + assert("nikita-2818", area != NULL); + assert("nikita-2819", *area != NULL); + + sd = (reiser4_large_times_stat *) * area; + + cputod32((__u32) inode->i_atime.tv_nsec, &sd->atime); + cputod32((__u32) inode->i_ctime.tv_nsec, &sd->ctime); + cputod32((__u32) inode->i_mtime.tv_nsec, &sd->mtime); + + *area += sizeof *sd; + return 0; +} + +#if REISER4_DEBUG_OUTPUT +static void +print_large_times_sd(const char *prefix, char **area /* position in stat-data */, + int *len /* remaining length */ ) +{ + reiser4_large_times_stat *sd; + + sd = (reiser4_large_times_stat *) * area; + printk("%s: nanotimes: a: %i, m: %i, c: %i\n", prefix, + d32tocpu(&sd->atime), d32tocpu(&sd->mtime), d32tocpu(&sd->ctime)); + move_on(len, area, sizeof *sd); +} +#endif + +/* symlink stat data extension */ + +/* allocate memory for symlink target and attach it to inode->u.generic_ip */ +static int +symlink_target_to_inode(struct inode *inode, const char *target, int len) +{ + assert("vs-845", inode->u.generic_ip == 0); + assert("vs-846", !inode_get_flag(inode, REISER4_GENERIC_PTR_USED)); + + /* FIXME-VS: this is prone to deadlock. Not more than other similar + places, though */ + inode->u.generic_ip = reiser4_kmalloc((size_t) len + 1, GFP_KERNEL); + if (!inode->u.generic_ip) + return RETERR(-ENOMEM); + + memcpy((char *) (inode->u.generic_ip), target, (size_t) len); + ((char *) (inode->u.generic_ip))[len] = 0; + inode_set_flag(inode, REISER4_GENERIC_PTR_USED); + return 0; +} + +/* this is called on read_inode. There is nothing to do actually, but some + sanity checks */ +static int +present_symlink_sd(struct inode *inode, char **area, int *len) +{ + int result; + int length; + reiser4_symlink_stat *sd; + + length = (int) inode->i_size; + /* + * *len is number of bytes in stat data item from *area to the end of + * item. It must be not less than size of symlink + 1 for ending 0 + */ + if (length > *len) + return not_enough_space(inode, "symlink"); + + if (*(*area + length) != 0) { + warning("vs-840", "Symlink is not zero terminated"); + return RETERR(-EIO); + } + + sd = (reiser4_symlink_stat *) * area; + result = symlink_target_to_inode(inode, sd->body, length); + + move_on(len, area, length + 1); + return result; +} + +static int +save_len_symlink_sd(struct inode *inode) +{ + return inode->i_size + 1; +} + +/* this is called on create and update stat data. Do nothing on update but + update @area */ +static int +save_symlink_sd(struct inode *inode, char **area) +{ + int result; + int length; + reiser4_symlink_stat *sd; + + length = (int) inode->i_size; + /* inode->i_size must be set already */ + assert("vs-841", length); + + result = 0; + sd = (reiser4_symlink_stat *) * area; + if (!inode_get_flag(inode, REISER4_GENERIC_PTR_USED)) { + const char *target; + + target = (const char *) (inode->u.generic_ip); + inode->u.generic_ip = 0; + + result = symlink_target_to_inode(inode, target, length); + + /* copy symlink to stat data */ + memcpy(sd->body, target, (size_t) length); + (*area)[length] = 0; + } else { + /* there is nothing to do in update but move area */ + assert("vs-844", !memcmp(inode->u.generic_ip, sd->body, (size_t) length + 1)); + } + + *area += (length + 1); + return result; +} + +#if REISER4_DEBUG_OUTPUT +static void +print_symlink_sd(const char *prefix, char **area /* position in stat-data */ , + int *len /* remaining length */ ) +{ + reiser4_symlink_stat *sd; + int length; + + sd = (reiser4_symlink_stat *) * area; + length = strlen(sd->body); + printk("%s: \"%s\"\n", prefix, sd->body); + move_on(len, area, length + 1); +} +#endif + +static int +present_flags_sd(struct inode *inode /* object being processed */ , + char **area /* position in stat-data */ , + int *len /* remaining length */ ) +{ + assert("nikita-645", inode != NULL); + assert("nikita-646", area != NULL); + assert("nikita-647", *area != NULL); + assert("nikita-648", len != NULL); + assert("nikita-649", *len > 0); + + if (*len >= (int) sizeof (reiser4_flags_stat)) { + reiser4_flags_stat *sd; + + sd = (reiser4_flags_stat *) * area; + inode->i_flags = d32tocpu(&sd->flags); + move_on(len, area, sizeof *sd); + return 0; + } else + return not_enough_space(inode, "generation and attrs"); +} + +/* Audited by: green(2002.06.14) */ +static int +save_len_flags_sd(struct inode *inode UNUSED_ARG /* object being + * processed */ ) +{ + return sizeof (reiser4_flags_stat); +} + +static int +save_flags_sd(struct inode *inode /* object being processed */ , + char **area /* position in stat-data */ ) +{ + reiser4_flags_stat *sd; + + assert("nikita-650", inode != NULL); + assert("nikita-651", area != NULL); + assert("nikita-652", *area != NULL); + + sd = (reiser4_flags_stat *) * area; + cputod32(inode->i_flags, &sd->flags); + *area += sizeof *sd; + return 0; +} + +static int absent_plugin_sd(struct inode *inode); +static int +present_plugin_sd(struct inode *inode /* object being processed */ , + char **area /* position in stat-data */ , + int *len /* remaining length */ ) +{ + reiser4_plugin_stat *sd; + reiser4_plugin *plugin; + int i; + __u16 mask; + int result; + int num_of_plugins; + + assert("nikita-653", inode != NULL); + assert("nikita-654", area != NULL); + assert("nikita-655", *area != NULL); + assert("nikita-656", len != NULL); + assert("nikita-657", *len > 0); + + if (*len < (int) sizeof (reiser4_plugin_stat)) + return not_enough_space(inode, "plugin"); + + sd = (reiser4_plugin_stat *) * area; + + mask = 0; + num_of_plugins = d16tocpu(&sd->plugins_no); + move_on(len, area, sizeof *sd); + result = 0; + for (i = 0; i < num_of_plugins; ++i) { + reiser4_plugin_slot *slot; + reiser4_plugin_type type; + pset_member memb; + + slot = (reiser4_plugin_slot *) * area; + if (*len < (int) sizeof *slot) + return not_enough_space(inode, "additional plugin"); + + memb = d16tocpu(&slot->pset_memb); + type = pset_member_to_type_unsafe(memb); + if (type == REISER4_PLUGIN_TYPES) { + warning("nikita-3502", "wrong pset member (%i) for %llu", + memb, (unsigned long long)get_inode_oid(inode)); + return RETERR(-EINVAL); + } + plugin = plugin_by_disk_id(tree_by_inode(inode), + type, &slot->id); + if (plugin == NULL) + return unknown_plugin(d16tocpu(&slot->id), inode); + + /* plugin is loaded into inode, mark this into inode's + bitmask of loaded non-standard plugins */ + if (!(mask & (1 << memb))) { + mask |= (1 << memb); + } else { + warning("nikita-658", "duplicate plugin for %llu", + (unsigned long long)get_inode_oid(inode)); + return RETERR(-EINVAL); + } + move_on(len, area, sizeof *slot); + /* load plugin data, if any */ + if (plugin->h.pops != NULL && plugin->h.pops->load) { + result = plugin->h.pops->load(inode, plugin, area, len); + if (result != 0) + return result; + } else + result = grab_plugin_from(inode, memb, plugin); + } + /* if object plugin wasn't loaded from stat-data, guess it by + mode bits */ + plugin = file_plugin_to_plugin(inode_file_plugin(inode)); + if (plugin == NULL) + result = absent_plugin_sd(inode); + + reiser4_inode_data(inode)->plugin_mask = mask; + return result; +} + +/* Audited by: green(2002.06.14) */ +static int +absent_plugin_sd(struct inode *inode /* object being processed */ ) +{ + int result; + + assert("nikita-659", inode != NULL); + + result = guess_plugin_by_mode(inode); + /* if mode was wrong, guess_plugin_by_mode() returns "regular file", + but setup_inode_ops() will call make_bad_inode(). + Another, more logical but bit more complex solution is to add + "bad-file plugin". */ + /* FIXME-VS: activate was called here */ + return result; +} + +/* helper function for plugin_sd_save_len(): calculate how much space + required to save state of given plugin */ +/* Audited by: green(2002.06.14) */ +static int +len_for(reiser4_plugin * plugin /* plugin to save */ , + struct inode *inode /* object being processed */ , + pset_member memb, int len) +{ + reiser4_inode *info; + assert("nikita-661", inode != NULL); + + info = reiser4_inode_data(inode); + if (plugin != NULL && (info->plugin_mask & (1 << memb))) { + len += sizeof (reiser4_plugin_slot); + if (plugin->h.pops && plugin->h.pops->save_len != NULL) { + /* non-standard plugin, call method */ + /* commented as it is incompatible with alignment + * policy in save_plug() -edward */ + /* len = round_up(len, plugin->h.pops->alignment); */ + len += plugin->h.pops->save_len(inode, plugin); + } + } + return len; +} + +/* calculate how much space is required to save state of all plugins, + associated with inode */ +static int +save_len_plugin_sd(struct inode *inode /* object being processed */ ) +{ + int len; + reiser4_inode *state; + pset_member memb; + + assert("nikita-663", inode != NULL); + + state = reiser4_inode_data(inode); + /* common case: no non-standard plugins */ + if (state->plugin_mask == 0) + return 0; + len = sizeof (reiser4_plugin_stat); + for (memb = 0; memb < PSET_LAST; ++ memb) + len = len_for(pset_get(state->pset, memb), inode, memb, len); + assert("nikita-664", len > (int) sizeof (reiser4_plugin_stat)); + return len; +} + +/* helper function for plugin_sd_save(): save plugin, associated with + inode. */ +static int +save_plug(reiser4_plugin * plugin /* plugin to save */ , + struct inode *inode /* object being processed */ , + pset_member memb /* what element of pset is saved*/, + char **area /* position in stat-data */ , + int *count /* incremented if plugin were actually + * saved. */ ) +{ + reiser4_plugin_slot *slot; + int fake_len; + int result; + + assert("nikita-665", inode != NULL); + assert("nikita-666", area != NULL); + assert("nikita-667", *area != NULL); + + if (plugin == NULL) + return 0; + if (!(reiser4_inode_data(inode)->plugin_mask & (1 << memb))) + return 0; + slot = (reiser4_plugin_slot *) * area; + cputod16(memb, &slot->pset_memb); + cputod16((unsigned) plugin->h.id, &slot->id); + fake_len = (int) 0xffff; + move_on(&fake_len, area, sizeof *slot); + ++*count; + result = 0; + if (plugin->h.pops != NULL) { + if (plugin->h.pops->save != NULL) + result = plugin->h.pops->save(inode, plugin, area); + } + return result; +} + +/* save state of all non-standard plugins associated with inode */ +static int +save_plugin_sd(struct inode *inode /* object being processed */ , + char **area /* position in stat-data */ ) +{ + int result = 0; + int num_of_plugins; + reiser4_plugin_stat *sd; + reiser4_inode *state; + int fake_len; + pset_member memb; + + assert("nikita-669", inode != NULL); + assert("nikita-670", area != NULL); + assert("nikita-671", *area != NULL); + + state = reiser4_inode_data(inode); + if (state->plugin_mask == 0) + return 0; + sd = (reiser4_plugin_stat *) * area; + fake_len = (int) 0xffff; + move_on(&fake_len, area, sizeof *sd); + + num_of_plugins = 0; + for (memb = 0; memb < PSET_LAST; ++ memb) { + result = save_plug(pset_get(state->pset, memb), + inode, memb, area, &num_of_plugins); + if (result != 0) + break; + } + + cputod16((unsigned) num_of_plugins, &sd->plugins_no); + return result; +} + + +/* helper function for crypto_sd_present(), crypto_sd_save. + Allocates memory for crypto stat, keyid and attaches it to the inode */ + +static int crypto_stat_to_inode (struct inode *inode, + crypto_stat_t * tmp, + unsigned int size /* fingerprint size */) +{ + crypto_stat_t * stat; + + assert ("edward-11", (reiser4_inode_data(inode))->crypt == NULL); + assert ("edward-33", !inode_get_flag(inode, REISER4_CRYPTO_STAT_LOADED)); + + stat = reiser4_kmalloc(sizeof(*stat), GFP_KERNEL); + if (!stat) + return RETERR(-ENOMEM); + stat->keyid = reiser4_kmalloc((size_t)size, GFP_KERNEL); + if (!stat->keyid) { + reiser4_kfree(stat); + return RETERR(-ENOMEM); + } + /* load inode crypto-stat */ + stat->keysize = tmp->keysize; + memcpy(stat->keyid, tmp->keyid, (size_t)size); + reiser4_inode_data(inode)->crypt = stat; + + inode_set_flag(inode, REISER4_CRYPTO_STAT_LOADED); + return 0; +} + +/* crypto stat-data extension */ + +static int present_crypto_sd(struct inode *inode, char **area, int *len) +{ + int result; + reiser4_crypto_stat *sd; + crypto_stat_t stat; + digest_plugin * dplug = inode_digest_plugin(inode); + + unsigned int keyid_size; + + assert("edward-06", dplug != NULL); + assert("edward-684", dplug->dsize); + assert("edward-07", area != NULL); + assert("edward-08", *area != NULL); + assert("edward-09", len != NULL); + assert("edward-10", *len > 0); + + if (*len < (int) sizeof (reiser4_crypto_stat)) { + return not_enough_space(inode, "crypto-sd"); + } + keyid_size = dplug->dsize; + /* *len is number of bytes in stat data item from *area to the end of + item. It must be not less than size of this extension */ + assert("edward-75", sizeof(*sd) + keyid_size <= *len); + + sd = (reiser4_crypto_stat *) * area; + stat.keysize = d16tocpu(&sd->keysize); + stat.keyid = (__u8 *)sd->keyid; + + result = crypto_stat_to_inode(inode, &stat, keyid_size); + move_on(len, area, sizeof(*sd) + keyid_size); + return result; +} + +static int absent_crypto_sd(struct inode * inode) +{ + return -EIO; +} + +static int save_len_crypto_sd(struct inode *inode) +{ + return (sizeof(reiser4_crypto_stat) + inode_digest_plugin(inode)->dsize); +} + +static int save_crypto_sd(struct inode *inode, char **area) +{ + int result = 0; + reiser4_crypto_stat *sd; + digest_plugin * dplug = inode_digest_plugin(inode); + + assert("edward-12", dplug != NULL); + assert("edward-13", area != NULL); + assert("edward-14", *area != NULL); + assert("edward-76", reiser4_inode_data(inode) != NULL); + + sd = (reiser4_crypto_stat *) *area; + if (!inode_get_flag(inode, REISER4_CRYPTO_STAT_LOADED)) { + /* file is just created */ + crypto_stat_t * stat = reiser4_inode_data(inode)->crypt; + + assert("edward-15", stat != NULL); + + /* copy inode crypto-stat to the disk stat-data */ + cputod16(stat->keysize, &sd->keysize); + memcpy(sd->keyid, stat->keyid, (size_t)dplug->dsize); + inode_set_flag(inode, REISER4_CRYPTO_STAT_LOADED); + } else { + /* do nothing */ + } + *area += (sizeof(*sd) + dplug->dsize); + return result; +} + +#if REISER4_DEBUG_OUTPUT +static void +print_crypto_sd(const char *prefix, char **area /* position in stat-data */ , + int *len /* remaining length */ ) +{ + /* FIXME-EDWARD Make sure we debug only with none digest plugin */ + digest_plugin * dplug = digest_plugin_by_id(NONE_DIGEST_ID); + reiser4_crypto_stat *sd = (reiser4_crypto_stat *) * area; + + printk("%s: keysize: %u keyid: \"%llx\"\n", prefix, d16tocpu(&sd->keysize), *(__u64 *)(sd->keyid)); + move_on(len, area, sizeof(*sd) + dplug->dsize); +} +#endif + +/* cluster stat-data extension */ + +static int present_cluster_sd(struct inode *inode, char **area, int *len) +{ + reiser4_inode * info; + + assert("edward-77", inode != NULL); + assert("edward-78", area != NULL); + assert("edward-79", *area != NULL); + assert("edward-80", len != NULL); + assert("edward-81", !inode_get_flag(inode, REISER4_CLUSTER_KNOWN)); + + info = reiser4_inode_data(inode); + + assert("edward-82", info != NULL); + + if (*len >= (int) sizeof (reiser4_cluster_stat)) { + reiser4_cluster_stat *sd; + sd = (reiser4_cluster_stat *) * area; + info->cluster_shift = d8tocpu(&sd->cluster_shift); + inode_set_flag(inode, REISER4_CLUSTER_KNOWN); + move_on(len, area, sizeof *sd); + return 0; + } + else + return not_enough_space(inode, "cluster sd"); +} + +static int absent_cluster_sd(struct inode * inode) +{ + return -EIO; +} + +static int save_len_cluster_sd(struct inode *inode UNUSED_ARG) +{ + return sizeof (reiser4_cluster_stat); +} + +static int save_cluster_sd(struct inode *inode, char **area) +{ + reiser4_cluster_stat *sd; + + assert("edward-106", inode != NULL); + assert("edward-107", area != NULL); + assert("edward-108", *area != NULL); + + sd = (reiser4_cluster_stat *) * area; + if (!inode_get_flag(inode, REISER4_CLUSTER_KNOWN)) { + cputod8(reiser4_inode_data(inode)->cluster_shift, &sd->cluster_shift); + inode_set_flag(inode, REISER4_CLUSTER_KNOWN); + } + else { + /* do nothing */ + } + *area += sizeof *sd; + return 0; +} + +#if REISER4_DEBUG_OUTPUT +static void +print_cluster_sd(const char *prefix, char **area /* position in stat-data */, + int *len /* remaining length */ ) +{ + reiser4_cluster_stat *sd = (reiser4_cluster_stat *) * area; + + printk("%s: %u\n", prefix, d8tocpu(&sd->cluster_shift)); + move_on(len, area, sizeof *sd); +} +#endif + +static int eio(struct inode *inode, char **area, int *len) +{ + return RETERR(-EIO); +} + +sd_ext_plugin sd_ext_plugins[LAST_SD_EXTENSION] = { + [LIGHT_WEIGHT_STAT] = { + .h = { + .type_id = REISER4_SD_EXT_PLUGIN_TYPE, + .id = LIGHT_WEIGHT_STAT, + .pops = NULL, + .label = "light-weight sd", + .desc = "sd for light-weight files", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .present = present_lw_sd, + .absent = NULL, + .save_len = save_len_lw_sd, + .save = save_lw_sd, +#if REISER4_DEBUG_OUTPUT + .print = print_lw_sd, +#endif + .alignment = 8 + }, + [UNIX_STAT] = { + .h = { + .type_id = REISER4_SD_EXT_PLUGIN_TYPE, + .id = UNIX_STAT, + .pops = NULL, + .label = "unix-sd", + .desc = "unix stat-data fields", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .present = present_unix_sd, + .absent = absent_unix_sd, + .save_len = save_len_unix_sd, + .save = save_unix_sd, +#if REISER4_DEBUG_OUTPUT + .print = print_unix_sd, +#endif + .alignment = 8 + }, + [LARGE_TIMES_STAT] = { + .h = { + .type_id = REISER4_SD_EXT_PLUGIN_TYPE, + .id = LARGE_TIMES_STAT, + .pops = NULL, + .label = "64time-sd", + .desc = "nanosecond resolution for times", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .present = present_large_times_sd, + .absent = NULL, + .save_len = save_len_large_times_sd, + .save = save_large_times_sd, +#if REISER4_DEBUG_OUTPUT + .print = print_large_times_sd, +#endif + .alignment = 8 + }, + [SYMLINK_STAT] = { + /* stat data of symlink has this extension */ + .h = { + .type_id = REISER4_SD_EXT_PLUGIN_TYPE, + .id = SYMLINK_STAT, + .pops = NULL, + .label = "symlink-sd", + .desc = "stat data is appended with symlink name", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .present = present_symlink_sd, + .absent = NULL, + .save_len = save_len_symlink_sd, + .save = save_symlink_sd, +#if REISER4_DEBUG_OUTPUT + .print = print_symlink_sd, +#endif + .alignment = 8 + }, + [PLUGIN_STAT] = { + .h = { + .type_id = REISER4_SD_EXT_PLUGIN_TYPE, + .id = PLUGIN_STAT, + .pops = NULL, + .label = "plugin-sd", + .desc = "plugin stat-data fields", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .present = present_plugin_sd, + .absent = absent_plugin_sd, + .save_len = save_len_plugin_sd, + .save = save_plugin_sd, +#if REISER4_DEBUG_OUTPUT + .print = NULL, +#endif + .alignment = 8 + }, + [FLAGS_STAT] = { + .h = { + .type_id = REISER4_SD_EXT_PLUGIN_TYPE, + .id = FLAGS_STAT, + .pops = NULL, + .label = "flags-sd", + .desc = "inode bit flags", + .linkage = TYPE_SAFE_LIST_LINK_ZERO} + , + .present = present_flags_sd, + .absent = NULL, + .save_len = save_len_flags_sd, + .save = save_flags_sd, +#if REISER4_DEBUG_OUTPUT + .print = NULL, +#endif + .alignment = 8 + }, + [CAPABILITIES_STAT] = { + .h = { + .type_id = REISER4_SD_EXT_PLUGIN_TYPE, + .id = CAPABILITIES_STAT, + .pops = NULL, + .label = "capabilities-sd", + .desc = "capabilities", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .present = eio, + .absent = NULL, + .save_len = save_len_flags_sd, + .save = save_flags_sd, +#if REISER4_DEBUG_OUTPUT + .print = NULL, +#endif + .alignment = 8 + }, + [CLUSTER_STAT] = { + .h = { + .type_id = REISER4_SD_EXT_PLUGIN_TYPE, + .id = CLUSTER_STAT, + .pops = NULL, + .label = "cluster-sd", + .desc = "cluster shift", + .linkage = TYPE_SAFE_LIST_LINK_ZERO} + , + .present = present_cluster_sd, + .absent = absent_cluster_sd, + /* return IO_ERROR if smthng is wrong */ + .save_len = save_len_cluster_sd, + .save = save_cluster_sd, +#if REISER4_DEBUG_OUTPUT + .print = print_cluster_sd, +#endif + .alignment = 8 + }, + [CRYPTO_STAT] = { + .h = { + .type_id = REISER4_SD_EXT_PLUGIN_TYPE, + .id = CRYPTO_STAT, + .pops = NULL, + .label = "crypto-sd", + .desc = "secret key size and id", + .linkage = TYPE_SAFE_LIST_LINK_ZERO} + , + .present = present_crypto_sd, + .absent = absent_crypto_sd, + /* return IO_ERROR if smthng is wrong */ + .save_len = save_len_crypto_sd, + .save = save_crypto_sd, +#if REISER4_DEBUG_OUTPUT + .print = print_crypto_sd, +#endif + .alignment = 8 + } +}; + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/static_stat.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/static_stat.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,220 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* This describes the static_stat item, used to hold all information needed by the stat() syscall. + +In the case where each file has not less than the fields needed by the +stat() syscall, it is more compact to store those fields in this +struct. + +If this item does not exist, then all stats are dynamically resolved. +At the moment, we either resolve all stats dynamically or all of them +statically. If you think this is not fully optimal, and the rest of +reiser4 is working, then fix it...:-) + +*/ + +#if !defined( __FS_REISER4_PLUGIN_ITEM_STATIC_STAT_H__ ) +#define __FS_REISER4_PLUGIN_ITEM_STATIC_STAT_H__ + +#include "../../forward.h" +#include "../../dformat.h" + +#include /* for struct inode */ + +/* Stat data layout: goals and implementation. + +We want to be able to have lightweight files which have complete flexibility in what semantic metadata is attached to +them, including not having semantic metadata attached to them. + +There is one problem with doing that, which is that if in fact you have exactly the same metadata for most files you want to store, then it takes more space to store that metadata in a dynamically sized structure than in a statically sized structure because the statically sized structure knows without recording it what the names and lengths of the attributes are. + +This leads to a natural compromise, which is to special case those files which have simply the standard unix file +attributes, and only employ the full dynamic stat data mechanism for those files that differ from the standard unix file +in their use of file attributes. + +Yet this compromise deserves to be compromised a little. + +We accommodate the case where you have no more than the standard unix file attributes by using an "extension bitmask": +each bit in it indicates presence or absence of or particular stat data extension (see sd_ext_bits enum). + + If the first +bit of the extension bitmask bit is 0, we have light-weight file whose attributes are either inherited from parent +directory (as uid, gid) or initialised to some sane values. + + To capitalize on existing code infrastructure, extensions are + implemented as plugins of type REISER4_SD_EXT_PLUGIN_TYPE. + Each stat-data extension plugin implements four methods: + + ->present() called by sd_load() when this extension is found in stat-data + ->absent() called by sd_load() when this extension is not found in stat-data + ->save_len() called by sd_len() to calculate total length of stat-data + ->save() called by sd_save() to store extension data into stat-data + + Implementation is in fs/reiser4/plugin/item/static_stat.c +*/ + +/* stat-data extension. Please order this by presumed frequency of use */ +typedef enum { + /* support for light-weight files */ + LIGHT_WEIGHT_STAT, + /* data required to implement unix stat(2) call. Layout is in + reiser4_unix_stat. If this is not present, file is light-weight */ + UNIX_STAT, + /* this contains additional set of 32bit [anc]time fields to implement + nanosecond resolution. Layout is in reiser4_large_times_stat. Usage + if this extension is governed by 32bittimes mount option. */ + LARGE_TIMES_STAT, + /* stat data has link name included */ + SYMLINK_STAT, + /* if this is present, file is controlled by non-standard + plugin (that is, plugin that cannot be deduced from file + mode bits), for example, aggregation, interpolation etc. */ + PLUGIN_STAT, + /* this extension contains persistent inode flags. These flags are + single bits: immutable, append, only, etc. Layout is in + reiser4_flags_stat. */ + FLAGS_STAT, + /* this extension contains capabilities sets, associated with this + file. Layout is in reiser4_capabilities_stat */ + CAPABILITIES_STAT, + /* this extension contains the information about minimal unit size for + file data processing. Layout is in reiser4_cluster_stat */ + CLUSTER_STAT, + /* this extension contains size and public id of the secret key. + Layout is in reiser4_crypto_stat */ + CRYPTO_STAT, + LAST_SD_EXTENSION, + /* + * init_inode_static_sd() iterates over extension mask until all + * non-zero bits are processed. This means, that neither ->present(), + * nor ->absent() methods will be called for stat-data extensions that + * go after last present extension. But some basic extensions, we want + * either ->absent() or ->present() method to be called, because these + * extensions set up something in inode even when they are not + * present. This is what LAST_IMPORTANT_SD_EXTENSION is for: for all + * extensions before and including LAST_IMPORTANT_SD_EXTENSION either + * ->present(), or ->absent() method will be called, independently of + * what other extensions are present. + */ + LAST_IMPORTANT_SD_EXTENSION = PLUGIN_STAT, +} sd_ext_bits; + +/* minimal stat-data. This allows to support light-weight files. */ +typedef struct reiser4_stat_data_base { + /* 0 */ d16 extmask; + /* 2 */ +} PACKED reiser4_stat_data_base; + +typedef struct reiser4_light_weight_stat { + /* 0 */ d16 mode; + /* 2 */ d32 nlink; + /* 8 */ d64 size; + /* size in bytes */ + /* 16 */ +} PACKED reiser4_light_weight_stat; + +typedef struct reiser4_unix_stat { + /* owner id */ + /* 0 */ d32 uid; + /* group id */ + /* 4 */ d32 gid; + /* access time */ + /* 8 */ d32 atime; + /* modification time */ + /* 12 */ d32 mtime; + /* change time */ + /* 16 */ d32 ctime; + union { + /* minor:major for device files */ + /* 20 */ d64 rdev; + /* bytes used by file */ + /* 20 */ d64 bytes; + } u; + /* 28 */ +} PACKED reiser4_unix_stat; + +/* symlink stored as part of inode */ +typedef struct reiser4_symlink_stat { + char body[0]; +} PACKED reiser4_symlink_stat; + +typedef struct reiser4_plugin_slot { + /* 0 */ d16 pset_memb; + /* 2 */ d16 id; +/* 4 *//* here plugin stores its persistent state */ +} PACKED reiser4_plugin_slot; + +/* stat-data extension for files with non-standard plugin. */ +typedef struct reiser4_plugin_stat { + /* number of additional plugins, associated with this object */ + /* 0 */ d16 plugins_no; + /* 2 */ reiser4_plugin_slot slot[0]; + /* 2 */ +} PACKED reiser4_plugin_stat; + +/* stat-data extension for inode flags. Currently it is just fixed-width 32 + * bit mask. If need arise, this can be replaced with variable width + * bitmask. */ +typedef struct reiser4_flags_stat { + /* 0 */ d32 flags; + /* 4 */ +} PACKED reiser4_flags_stat; + +typedef struct reiser4_capabilities_stat { + /* 0 */ d32 effective; + /* 8 */ d32 permitted; + /* 16 */ +} PACKED reiser4_capabilities_stat; + +typedef struct reiser4_cluster_stat { +/* this defines cluster size (an attribute of cryptcompress objects) as PAGE_SIZE << cluster shift */ + /* 0 */ d8 cluster_shift; + /* 1 */ +} PACKED reiser4_cluster_stat; + +typedef struct reiser4_crypto_stat { + /* secret key size, bits */ + /* 0 */ d16 keysize; + /* secret key id */ + /* 2 */ d8 keyid[0]; + /* 2 */ +} PACKED reiser4_crypto_stat; + +typedef struct reiser4_large_times_stat { + /* access time */ + /* 0 */ d32 atime; + /* modification time */ + /* 8 */ d32 mtime; + /* change time */ + /* 16 */ d32 ctime; + /* 24 */ +} PACKED reiser4_large_times_stat; + +/* this structure is filled by sd_item_stat */ +typedef struct sd_stat { + int dirs; + int files; + int others; +} sd_stat; + +/* plugin->item.common.* */ +extern void print_sd(const char *prefix, coord_t * coord); +extern void item_stat_static_sd(const coord_t * coord, void *vp); + +/* plugin->item.s.sd.* */ +extern int init_inode_static_sd(struct inode *inode, char *sd, int len); +extern int save_len_static_sd(struct inode *inode); +extern int save_static_sd(struct inode *inode, char **area); + +/* __FS_REISER4_PLUGIN_ITEM_STATIC_STAT_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/tail.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/tail.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,682 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "item.h" +#include "../../inode.h" +#include "../../page_cache.h" +#include "../../carry.h" +#include "../../vfs_ops.h" + +#include +#include +#include +#include + +/* plugin->u.item.b.max_key_inside */ +reiser4_internal reiser4_key * +max_key_inside_tail(const coord_t *coord, reiser4_key *key) +{ + item_key_by_coord(coord, key); + set_key_offset(key, get_key_offset(max_key())); + return key; +} + +/* plugin->u.item.b.can_contain_key */ +reiser4_internal int +can_contain_key_tail(const coord_t *coord, const reiser4_key *key, const reiser4_item_data *data) +{ + reiser4_key item_key; + + if (item_plugin_by_coord(coord) != data->iplug) + return 0; + + item_key_by_coord(coord, &item_key); + if (get_key_locality(key) != get_key_locality(&item_key) || + get_key_objectid(key) != get_key_objectid(&item_key)) return 0; + + return 1; +} + +/* plugin->u.item.b.mergeable + first item is of tail type */ +/* Audited by: green(2002.06.14) */ +reiser4_internal int +mergeable_tail(const coord_t *p1, const coord_t *p2) +{ + reiser4_key key1, key2; + + assert("vs-535", item_type_by_coord(p1) == UNIX_FILE_METADATA_ITEM_TYPE); + assert("vs-365", item_id_by_coord(p1) == FORMATTING_ID); + + if (item_id_by_coord(p2) != FORMATTING_ID) { + /* second item is of another type */ + return 0; + } + + item_key_by_coord(p1, &key1); + item_key_by_coord(p2, &key2); + if (get_key_locality(&key1) != get_key_locality(&key2) || + get_key_objectid(&key1) != get_key_objectid(&key2) || get_key_type(&key1) != get_key_type(&key2)) { + /* items of different objects */ + return 0; + } + if (get_key_offset(&key1) + nr_units_tail(p1) != get_key_offset(&key2)) { + /* not adjacent items */ + return 0; + } + return 1; +} + +/* plugin->u.item.b.print + plugin->u.item.b.check */ + +/* plugin->u.item.b.nr_units */ +reiser4_internal pos_in_node_t +nr_units_tail(const coord_t *coord) +{ + return item_length_by_coord(coord); +} + +/* plugin->u.item.b.lookup */ +reiser4_internal lookup_result +lookup_tail(const reiser4_key *key, lookup_bias bias, coord_t *coord) +{ + reiser4_key item_key; + __u64 lookuped, offset; + unsigned nr_units; + + item_key_by_coord(coord, &item_key); + offset = get_key_offset(item_key_by_coord(coord, &item_key)); + nr_units = nr_units_tail(coord); + + /* key we are looking for must be greater than key of item @coord */ + assert("vs-416", keygt(key, &item_key)); + + /* offset we are looking for */ + lookuped = get_key_offset(key); + + if (lookuped >= offset && lookuped < offset + nr_units) { + /* byte we are looking for is in this item */ + coord->unit_pos = lookuped - offset; + coord->between = AT_UNIT; + return CBK_COORD_FOUND; + } + + /* set coord after last unit */ + coord->unit_pos = nr_units - 1; + coord->between = AFTER_UNIT; + return bias == FIND_MAX_NOT_MORE_THAN ? CBK_COORD_FOUND : CBK_COORD_NOTFOUND; +} + +/* plugin->u.item.b.paste */ +reiser4_internal int +paste_tail(coord_t *coord, reiser4_item_data *data, carry_plugin_info *info UNUSED_ARG) +{ + unsigned old_item_length; + char *item; + + /* length the item had before resizing has been performed */ + old_item_length = item_length_by_coord(coord) - data->length; + + /* tail items never get pasted in the middle */ + assert("vs-363", + (coord->unit_pos == 0 && coord->between == BEFORE_UNIT) || + (coord->unit_pos == old_item_length - 1 && + coord->between == AFTER_UNIT) || + (coord->unit_pos == 0 && old_item_length == 0 && coord->between == AT_UNIT)); + + item = item_body_by_coord(coord); + if (coord->unit_pos == 0) + /* make space for pasted data when pasting at the beginning of + the item */ + memmove(item + data->length, item, old_item_length); + + if (coord->between == AFTER_UNIT) + coord->unit_pos++; + + if (data->data) { + assert("vs-554", data->user == 0 || data->user == 1); + if (data->user) { + assert("nikita-3035", schedulable()); + /* AUDIT: return result is not checked! */ + /* copy from user space */ + __copy_from_user(item + coord->unit_pos, data->data, (unsigned) data->length); + } else + /* copy from kernel space */ + memcpy(item + coord->unit_pos, data->data, (unsigned) data->length); + } else { + memset(item + coord->unit_pos, 0, (unsigned) data->length); + } + return 0; +} + +/* plugin->u.item.b.fast_paste */ + +/* plugin->u.item.b.can_shift + number of units is returned via return value, number of bytes via @size. For + tail items they coincide */ +reiser4_internal int +can_shift_tail(unsigned free_space, coord_t *source UNUSED_ARG, + znode *target UNUSED_ARG, shift_direction direction UNUSED_ARG, unsigned *size, unsigned want) +{ + /* make sure that that we do not want to shift more than we have */ + assert("vs-364", want > 0 && want <= (unsigned) item_length_by_coord(source)); + + *size = min(want, free_space); + return *size; +} + +/* plugin->u.item.b.copy_units */ +reiser4_internal void +copy_units_tail(coord_t *target, coord_t *source, + unsigned from, unsigned count, shift_direction where_is_free_space, unsigned free_space UNUSED_ARG) +{ + /* make sure that item @target is expanded already */ + assert("vs-366", (unsigned) item_length_by_coord(target) >= count); + assert("vs-370", free_space >= count); + + if (where_is_free_space == SHIFT_LEFT) { + /* append item @target with @count first bytes of @source */ + assert("vs-365", from == 0); + + memcpy((char *) item_body_by_coord(target) + + item_length_by_coord(target) - count, (char *) item_body_by_coord(source), count); + } else { + /* target item is moved to right already */ + reiser4_key key; + + assert("vs-367", (unsigned) item_length_by_coord(source) == from + count); + + memcpy((char *) item_body_by_coord(target), (char *) item_body_by_coord(source) + from, count); + + /* new units are inserted before first unit in an item, + therefore, we have to update item key */ + item_key_by_coord(source, &key); + set_key_offset(&key, get_key_offset(&key) + from); + + node_plugin_by_node(target->node)->update_item_key(target, &key, 0 /*info */); + } +} + +/* plugin->u.item.b.create_hook */ + + +/* item_plugin->b.kill_hook + this is called when @count units starting from @from-th one are going to be removed + */ +reiser4_internal int +kill_hook_tail(const coord_t *coord, pos_in_node_t from, + pos_in_node_t count, struct carry_kill_data *kdata) +{ + reiser4_key key; + loff_t start, end; + + assert("vs-1577", kdata); + assert("vs-1579", kdata->inode); + + item_key_by_coord(coord, &key); + start = get_key_offset(&key) + from; + end = start + count; + fake_kill_hook_tail(kdata->inode, start, end, kdata->params.truncate); + return 0; +} + +/* plugin->u.item.b.shift_hook */ + +/* helper for kill_units_tail and cut_units_tail */ +static int +do_cut_or_kill(coord_t *coord, pos_in_node_t from, pos_in_node_t to, + reiser4_key *smallest_removed, reiser4_key *new_first) +{ + pos_in_node_t count; + + /* this method is only called to remove part of item */ + assert("vs-374", (to - from + 1) < item_length_by_coord(coord)); + /* tails items are never cut from the middle of an item */ + assert("vs-396", ergo(from != 0, to == coord_last_unit_pos(coord))); + assert("vs-1558", ergo(from == 0, to < coord_last_unit_pos(coord))); + + count = to - from + 1; + + if (smallest_removed) { + /* store smallest key removed */ + item_key_by_coord(coord, smallest_removed); + set_key_offset(smallest_removed, get_key_offset(smallest_removed) + from); + } + if (new_first) { + /* head of item is cut */ + assert("vs-1529", from == 0); + + item_key_by_coord(coord, new_first); + set_key_offset(new_first, get_key_offset(new_first) + from + count); + } + + if (REISER4_DEBUG) + memset((char *) item_body_by_coord(coord) + from, 0, count); + return count; +} + +/* plugin->u.item.b.cut_units */ +reiser4_internal int +cut_units_tail(coord_t *coord, pos_in_node_t from, pos_in_node_t to, + struct carry_cut_data *cdata UNUSED_ARG, reiser4_key *smallest_removed, reiser4_key *new_first) +{ + return do_cut_or_kill(coord, from, to, smallest_removed, new_first); +} + +/* plugin->u.item.b.kill_units */ +reiser4_internal int +kill_units_tail(coord_t *coord, pos_in_node_t from, pos_in_node_t to, + struct carry_kill_data *kdata, reiser4_key *smallest_removed, reiser4_key *new_first) +{ + kill_hook_tail(coord, from, to - from + 1, kdata); + return do_cut_or_kill(coord, from, to, smallest_removed, new_first); +} + +/* plugin->u.item.b.unit_key */ +reiser4_internal reiser4_key * +unit_key_tail(const coord_t *coord, reiser4_key *key) +{ + assert("vs-375", coord_is_existing_unit(coord)); + + item_key_by_coord(coord, key); + set_key_offset(key, (get_key_offset(key) + coord->unit_pos)); + + return key; +} + +/* plugin->u.item.b.estimate + plugin->u.item.b.item_data_by_flow */ + +/* overwrite tail item or its part by use data */ +static int +overwrite_tail(coord_t *coord, flow_t *f) +{ + unsigned count; + + assert("vs-570", f->user == 1); + assert("vs-946", f->data); + assert("vs-947", coord_is_existing_unit(coord)); + assert("vs-948", znode_is_write_locked(coord->node)); + assert("nikita-3036", schedulable()); + + count = item_length_by_coord(coord) - coord->unit_pos; + if (count > f->length) + count = f->length; + + if (__copy_from_user((char *) item_body_by_coord(coord) + coord->unit_pos, f->data, count)) + return RETERR(-EFAULT); + + znode_make_dirty(coord->node); + + move_flow_forward(f, count); + return 0; +} + +/* tail redpage function. It is called from readpage_tail(). */ +static int +do_readpage_tail(uf_coord_t *uf_coord, struct page *page) +{ + tap_t tap; + int result; + coord_t coord; + lock_handle lh; + + int count, mapped; + struct inode *inode; + + /* saving passed coord in order to do not move it by tap. */ + init_lh(&lh); + copy_lh(&lh, uf_coord->lh); + inode = page->mapping->host; + coord_dup(&coord, &uf_coord->coord); + + tap_init(&tap, &coord, &lh, ZNODE_READ_LOCK); + + if ((result = tap_load(&tap))) + goto out_tap_done; + + /* lookup until page is filled up. */ + for (mapped = 0; mapped < PAGE_CACHE_SIZE; mapped += count) { + char *pagedata; + + /* number of bytes to be copied to page. */ + count = item_length_by_coord(&coord) - coord.unit_pos; + + if (count > PAGE_CACHE_SIZE - mapped) + count = PAGE_CACHE_SIZE - mapped; + + /* attaching @page to address space and getting data address. */ + pagedata = kmap_atomic(page, KM_USER0); + + /* copying tail body to page. */ + memcpy(pagedata + mapped, + ((char *)item_body_by_coord(&coord) + coord.unit_pos), count); + + flush_dcache_page(page); + + /* dettaching page from address space. */ + kunmap_atomic(pagedata, KM_USER0); + + /* Getting next tail item. */ + if (mapped + count < PAGE_CACHE_SIZE) { + + /* unlocking page in order to avoid keep it locked durring tree lookup, + which takes long term locks. */ + unlock_page(page); + + /* getting right neighbour. */ + result = go_dir_el(&tap, RIGHT_SIDE, 0); + + /* lock page back */ + lock_page(page); + + /* page is uptodate due to another thread made it up to date. Getting + out of here. */ + if (PageUptodate(page)) { + result = 0; + goto out_unlock_page; + } + + if (result) { + /* check if there is no neighbour node. */ + if (result == -E_NO_NEIGHBOR) { + result = 0; + goto out_update_page; + } else { + goto out_tap_relse; + } + } else { + /* check if found coord is not owned by file. */ + if (!inode_file_plugin(inode)->owns_item(inode, &coord)) { + result = 0; + goto out_update_page; + } + } + } + } + + /* making page up to date and releasing it. */ + SetPageUptodate(page); + unlock_page(page); + + /* releasing tap */ + tap_relse(&tap); + tap_done(&tap); + + return 0; + + out_update_page: + SetPageUptodate(page); + out_unlock_page: + unlock_page(page); + out_tap_relse: + tap_relse(&tap); + out_tap_done: + tap_done(&tap); + return result; +} + +/* + plugin->s.file.readpage + reiser4_read->unix_file_read->page_cache_readahead->reiser4_readpage->unix_file_readpage->readpage_tail + or + filemap_nopage->reiser4_readpage->readpage_unix_file->->readpage_tail + + At the beginning: coord->node is read locked, zloaded, page is locked, coord is set to existing unit inside of tail + item. */ +reiser4_internal int +readpage_tail(void *vp, struct page *page) +{ + uf_coord_t *uf_coord = vp; + ON_DEBUG(coord_t *coord = &uf_coord->coord); + ON_DEBUG(reiser4_key key); + + assert("umka-2515", PageLocked(page)); + assert("umka-2516", !PageUptodate(page)); + assert("umka-2517", !jprivate(page) && !PagePrivate(page)); + assert("umka-2518", page->mapping && page->mapping->host); + + assert("umka-2519", znode_is_loaded(coord->node)); + assert("umka-2520", item_is_tail(coord)); + assert("umka-2521", coord_is_existing_unit(coord)); + assert("umka-2522", znode_is_rlocked(coord->node)); + assert("umka-2523", page->mapping->host->i_ino == get_key_objectid(item_key_by_coord(coord, &key))); + + return do_readpage_tail(uf_coord, page); +} + +/* drop longterm znode lock before calling + balance_dirty_pages. balance_dirty_pages may cause transaction to close, + therefore we have to update stat data if necessary */ +static int +tail_balance_dirty_pages(struct address_space *mapping, const flow_t *f, + hint_t *hint) +{ + int result; + struct inode *inode; + + if (hint->ext_coord.valid) + set_hint(hint, &f->key, ZNODE_WRITE_LOCK); + else + unset_hint(hint); + longterm_unlock_znode(hint->ext_coord.lh); + + inode = mapping->host; + if (get_key_offset(&f->key) > inode->i_size) { + assert("vs-1649", f->user == 1); + INODE_SET_FIELD(inode, i_size, get_key_offset(&f->key)); + } + if (f->user != 0) { + /* this was writing data from user space. Update timestamps, therefore. Othrewise, this is tail + conversion where we should not update timestamps */ + inode->i_ctime = inode->i_mtime = CURRENT_TIME; + result = reiser4_update_sd(inode); + if (result) + return result; + } + + /* FIXME-VS: this is temporary: the problem is that bdp takes inodes + from sb's dirty list and it looks like nobody puts there inodes of + files which are built of tails */ + move_inode_out_from_sync_inodes_loop(mapping); + + reiser4_throttle_write(inode); + return 0; +} + +/* calculate number of blocks which can be dirtied/added when flow is inserted and stat data gets updated and grab them. + FIXME-VS: we may want to call grab_space with BA_CAN_COMMIT flag but that would require all that complexity with + sealing coord, releasing long term lock and validating seal later */ +static int +insert_flow_reserve(reiser4_tree *tree) +{ + grab_space_enable(); + return reiser4_grab_space(estimate_insert_flow(tree->height) + estimate_one_insert_into_item(tree), 0); +} + +/* one block gets overwritten and stat data may get updated */ +static int +overwrite_reserve(reiser4_tree *tree) +{ + grab_space_enable(); + return reiser4_grab_space(1 + estimate_one_insert_into_item(tree), 0); +} + +/* plugin->u.item.s.file.write + access to data stored in tails goes directly through formatted nodes */ +reiser4_internal int +write_tail(struct inode *inode, flow_t *f, hint_t *hint, + int grabbed, /* tail's write may be called from plain unix file write and from tail conversion. In first + case (grabbed == 0) space is not reserved forehand, so, it must be done here. When it is + being called from tail conversion - space is reserved already for whole operation which may + involve several calls to item write. In this case space reservation will not be done here */ + write_mode_t mode) +{ + int result; + coord_t *coord; + + assert("vs-1338", hint->ext_coord.valid == 1); + + coord = &hint->ext_coord.coord; + result = 0; + while (f->length && hint->ext_coord.valid == 1) { + switch (mode) { + case FIRST_ITEM: + case APPEND_ITEM: + /* check quota before appending data */ + if (DQUOT_ALLOC_SPACE_NODIRTY(inode, f->length)) { + result = RETERR(-EDQUOT); + break; + } + + if (!grabbed) + result = insert_flow_reserve(znode_get_tree(coord->node)); + if (!result) + result = insert_flow(coord, hint->ext_coord.lh, f); + if (f->length) + DQUOT_FREE_SPACE_NODIRTY(inode, f->length); + break; + + case OVERWRITE_ITEM: + if (!grabbed) + result = overwrite_reserve(znode_get_tree(coord->node)); + if (!result) + result = overwrite_tail(coord, f); + break; + + default: + impossible("vs-1031", "does this ever happen?"); + result = RETERR(-EIO); + break; + + } + + if (result) { + if (!grabbed) + all_grabbed2free(); + unset_hint(hint); + longterm_unlock_znode(hint->ext_coord.lh); + break; + } + + /* FIXME: do not rely on a coord yet */ + unset_hint(hint); + + /* throttle the writer */ + result = tail_balance_dirty_pages(inode->i_mapping, f, hint); + if (!grabbed) + all_grabbed2free(); + if (result) { + // reiser4_stat_tail_add(bdp_caused_repeats); + break; + } + } + + return result; +} + +#if REISER4_DEBUG + +static int +coord_matches_key_tail(const coord_t *coord, const reiser4_key *key) +{ + reiser4_key item_key; + + assert("vs-1356", coord_is_existing_unit(coord)); + assert("vs-1354", keylt(key, append_key_tail(coord, &item_key))); + assert("vs-1355", keyge(key, item_key_by_coord(coord, &item_key))); + return get_key_offset(key) == get_key_offset(&item_key) + coord->unit_pos; + +} + +#endif + +/* plugin->u.item.s.file.read */ +reiser4_internal int +read_tail(struct file *file UNUSED_ARG, flow_t *f, hint_t *hint) +{ + unsigned count; + int item_length; + coord_t *coord; + uf_coord_t *uf_coord; + + uf_coord = &hint->ext_coord; + coord = &uf_coord->coord; + + assert("vs-571", f->user == 1); + assert("vs-571", f->data); + assert("vs-967", coord && coord->node); + assert("vs-1117", znode_is_rlocked(coord->node)); + assert("vs-1118", znode_is_loaded(coord->node)); + + assert("nikita-3037", schedulable()); + assert("vs-1357", coord_matches_key_tail(coord, &f->key)); + + /* calculate number of bytes to read off the item */ + item_length = item_length_by_coord(coord); + count = item_length_by_coord(coord) - coord->unit_pos; + if (count > f->length) + count = f->length; + + + /* FIXME: unlock long term lock ! */ + + if (__copy_to_user(f->data, ((char *) item_body_by_coord(coord) + coord->unit_pos), count)) + return RETERR(-EFAULT); + + /* probably mark_page_accessed() should only be called if + * coord->unit_pos is zero. */ + mark_page_accessed(znode_page(coord->node)); + move_flow_forward(f, count); + + coord->unit_pos += count; + if (item_length == coord->unit_pos) { + coord->unit_pos --; + coord->between = AFTER_UNIT; + } + + return 0; +} + +/* + plugin->u.item.s.file.append_key + key of first byte which is the next to last byte by addressed by this item +*/ +reiser4_internal reiser4_key * +append_key_tail(const coord_t *coord, reiser4_key *key) +{ + item_key_by_coord(coord, key); + set_key_offset(key, get_key_offset(key) + item_length_by_coord(coord)); + return key; +} + +/* plugin->u.item.s.file.init_coord_extension */ +reiser4_internal void +init_coord_extension_tail(uf_coord_t *uf_coord, loff_t lookuped) +{ + uf_coord->valid = 1; +} + +/* + plugin->u.item.s.file.get_block +*/ +reiser4_internal int +get_block_address_tail(const coord_t *coord, sector_t block, struct buffer_head *bh) +{ + assert("nikita-3252", + znode_get_level(coord->node) == LEAF_LEVEL); + + bh->b_blocknr = *znode_get_block(coord->node); + return 0; +} + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/item/tail.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/item/tail.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,54 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#if !defined( __REISER4_TAIL_H__ ) +#define __REISER4_TAIL_H__ + +typedef struct { + int not_used; +} tail_coord_extension_t; + +struct cut_list; + + +/* plugin->u.item.b.* */ +reiser4_key *max_key_inside_tail(const coord_t *, reiser4_key *); +int can_contain_key_tail(const coord_t * coord, const reiser4_key * key, const reiser4_item_data *); +int mergeable_tail(const coord_t * p1, const coord_t * p2); +pos_in_node_t nr_units_tail(const coord_t *); +lookup_result lookup_tail(const reiser4_key *, lookup_bias, coord_t *); +int paste_tail(coord_t *, reiser4_item_data *, carry_plugin_info *); +int can_shift_tail(unsigned free_space, coord_t * source, + znode * target, shift_direction, unsigned *size, unsigned want); +void copy_units_tail(coord_t * target, coord_t * source, + unsigned from, unsigned count, shift_direction, unsigned free_space); +int kill_hook_tail(const coord_t *, pos_in_node_t from, pos_in_node_t count, struct carry_kill_data *); +int cut_units_tail(coord_t *, pos_in_node_t from, pos_in_node_t to, + struct carry_cut_data *, reiser4_key *smallest_removed, reiser4_key *new_first); +int kill_units_tail(coord_t *, pos_in_node_t from, pos_in_node_t to, + struct carry_kill_data *, reiser4_key *smallest_removed, reiser4_key *new_first); +reiser4_key *unit_key_tail(const coord_t *, reiser4_key *); + +/* plugin->u.item.s.* */ +int write_tail(struct inode *, flow_t *, hint_t *, int grabbed, write_mode_t); +int read_tail(struct file *, flow_t *, hint_t *); +int readpage_tail(void *vp, struct page *page); +reiser4_key *append_key_tail(const coord_t *, reiser4_key *); +void init_coord_extension_tail(uf_coord_t *, loff_t offset); +int get_block_address_tail(const coord_t *coord, + sector_t block, struct buffer_head *bh); +int item_balance_dirty_pages(struct address_space *mapping, const flow_t *f, + hint_t *hint, int back_to_dirty, int set_hint); + +/* __REISER4_TAIL_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/node/node40.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/node/node40.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,2783 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "../../debug.h" +#include "../../key.h" +#include "../../coord.h" +#include "../plugin_header.h" +#include "../item/item.h" +#include "node.h" +#include "node40.h" +#include "../plugin.h" +#include "../../jnode.h" +#include "../../znode.h" +#include "../../pool.h" +#include "../../carry.h" +#include "../../tap.h" +#include "../../tree.h" +#include "../../super.h" +#include "../../reiser4.h" + +#include +#include +#include + +/* leaf 40 format: + + [node header | item 0, item 1, .., item N-1 | free space | item_head N-1, .. item_head 1, item head 0 ] + plugin_id (16) key + free_space (16) pluginid (16) + free_space_start (16) offset (16) + level (8) + num_items (16) + magic (32) + flush_time (32) +*/ +/* NIKITA-FIXME-HANS: I told you guys not less than 10 times to not call it r4fs. Change to "ReIs". */ +/* magic number that is stored in ->magic field of node header */ +static const __u32 REISER4_NODE_MAGIC = 0x52344653; /* (*(__u32 *)"R4FS"); */ + +static int prepare_for_update(znode * left, znode * right, carry_plugin_info * info); + +/* header of node of reiser40 format is at the beginning of node */ +static inline node40_header * +node40_node_header(const znode * node /* node to + * query */ ) +{ + assert("nikita-567", node != NULL); + assert("nikita-568", znode_page(node) != NULL); + assert("nikita-569", zdata(node) != NULL); + return (node40_header *) zdata(node); +} + +/* functions to get/set fields of node40_header */ + +static __u32 +nh40_get_magic(node40_header * nh) +{ + return d32tocpu(&nh->magic); +} + +static void +nh40_set_magic(node40_header * nh, __u32 magic) +{ + cputod32(magic, &nh->magic); +} + +static void +nh40_set_free_space(node40_header * nh, unsigned value) +{ + cputod16(value, &nh->free_space); + /*node->free_space = value; */ +} + +static inline unsigned +nh40_get_free_space(node40_header * nh) +{ + return d16tocpu(&nh->free_space); +} + +static void +nh40_set_free_space_start(node40_header * nh, unsigned value) +{ + cputod16(value, &nh->free_space_start); +} + +static inline unsigned +nh40_get_free_space_start(node40_header * nh) +{ + return d16tocpu(&nh->free_space_start); +} + +static inline void +nh40_set_level(node40_header * nh, unsigned value) +{ + cputod8(value, &nh->level); +} + +static unsigned +nh40_get_level(node40_header * nh) +{ + return d8tocpu(&nh->level); +} + +static void +nh40_set_num_items(node40_header * nh, unsigned value) +{ + cputod16(value, &nh->nr_items); +} + +static inline unsigned +nh40_get_num_items(node40_header * nh) +{ + return d16tocpu(&nh->nr_items); +} + +static void +nh40_set_mkfs_id(node40_header * nh, __u32 id) +{ + cputod32(id, &nh->mkfs_id); +} + +static inline __u64 +nh40_get_flush_id(node40_header * nh) +{ + return d64tocpu(&nh->flush_id); +} + +/* plugin field of node header should be read/set by + plugin_by_disk_id/save_disk_plugin */ + +/* array of item headers is at the end of node */ +static inline item_header40 * +node40_ih_at(const znode * node, unsigned pos) +{ + return (item_header40 *) (zdata(node) + znode_size(node)) - pos - 1; +} + +/* ( page_address( node -> pg ) + PAGE_CACHE_SIZE ) - pos - 1 + */ +static inline item_header40 * +node40_ih_at_coord(const coord_t * coord) +{ + return (item_header40 *) (zdata(coord->node) + znode_size(coord->node)) - (coord->item_pos) - 1; +} + +/* functions to get/set fields of item_header40 */ +static void +ih40_set_offset(item_header40 * ih, unsigned offset) +{ + cputod16(offset, &ih->offset); +} + +static inline unsigned +ih40_get_offset(item_header40 * ih) +{ + return d16tocpu(&ih->offset); +} + +/* plugin field of item header should be read/set by + plugin_by_disk_id/save_disk_plugin */ + +/* plugin methods */ + +/* plugin->u.node.item_overhead + look for description of this method in plugin/node/node.h */ +reiser4_internal size_t +item_overhead_node40(const znode * node UNUSED_ARG, flow_t * f UNUSED_ARG) +{ + return sizeof (item_header40); +} + +/* plugin->u.node.free_space + look for description of this method in plugin/node/node.h */ +reiser4_internal size_t free_space_node40(znode * node) +{ + assert("nikita-577", node != NULL); + assert("nikita-578", znode_is_loaded(node)); + assert("nikita-579", zdata(node) != NULL); + + return nh40_get_free_space(node40_node_header(node)); +} + +/* private inline version of node40_num_of_items() for use in this file. This + is necessary, because address of node40_num_of_items() is taken and it is + never inlined as a result. */ +static inline short +node40_num_of_items_internal(const znode * node) +{ + return nh40_get_num_items(node40_node_header(node)); +} + +#if REISER4_DEBUG +static inline void check_num_items(const znode *node) +{ + assert("nikita-2749", + node40_num_of_items_internal(node) == node->nr_items); + assert("nikita-2746", znode_is_write_locked(node)); +} +#else +#define check_num_items(node) noop +#endif + +/* plugin->u.node.num_of_items + look for description of this method in plugin/node/node.h */ +reiser4_internal int +num_of_items_node40(const znode * node) +{ + return node40_num_of_items_internal(node); +} + +static void +node40_set_num_items(znode * node, node40_header * nh, unsigned value) +{ + assert("nikita-2751", node != NULL); + assert("nikita-2750", nh == node40_node_header(node)); + + check_num_items(node); + nh40_set_num_items(nh, value); + node->nr_items = value; + check_num_items(node); +} + +/* plugin->u.node.item_by_coord + look for description of this method in plugin/node/node.h */ +reiser4_internal char * +item_by_coord_node40(const coord_t * coord) +{ + item_header40 *ih; + char *p; + + /* @coord is set to existing item */ + assert("nikita-596", coord != NULL); + assert("vs-255", coord_is_existing_item(coord)); + + ih = node40_ih_at_coord(coord); + p = zdata(coord->node) + ih40_get_offset(ih); + return p; +} + +/* plugin->u.node.length_by_coord + look for description of this method in plugin/node/node.h */ +reiser4_internal int +length_by_coord_node40(const coord_t * coord) +{ + item_header40 *ih; + int result; + + /* @coord is set to existing item */ + assert("vs-256", coord != NULL); + assert("vs-257", coord_is_existing_item(coord)); + + ih = node40_ih_at_coord(coord); + if ((int) coord->item_pos == node40_num_of_items_internal(coord->node) - 1) + result = nh40_get_free_space_start(node40_node_header(coord->node)) - ih40_get_offset(ih); + else + result = ih40_get_offset(ih - 1) - ih40_get_offset(ih); + + return result; +} + +static pos_in_node_t +node40_item_length(const znode *node, pos_in_node_t item_pos) +{ + item_header40 *ih; + pos_in_node_t result; + + /* @coord is set to existing item */ + assert("vs-256", node != NULL); + assert("vs-257", node40_num_of_items_internal(node) > item_pos); + + ih = node40_ih_at(node, item_pos); + if (item_pos == node40_num_of_items_internal(node) - 1) + result = nh40_get_free_space_start(node40_node_header(node)) - ih40_get_offset(ih); + else + result = ih40_get_offset(ih - 1) - ih40_get_offset(ih); + + return result; +} + +/* plugin->u.node.plugin_by_coord + look for description of this method in plugin/node/node.h */ +reiser4_internal item_plugin * +plugin_by_coord_node40(const coord_t * coord) +{ + item_header40 *ih; + item_plugin *result; + + /* @coord is set to existing item */ + assert("vs-258", coord != NULL); + assert("vs-259", coord_is_existing_item(coord)); + + ih = node40_ih_at_coord(coord); + /* pass NULL in stead of current tree. This is time critical call. */ + result = item_plugin_by_disk_id(NULL, &ih->plugin_id); + return result; +} + +/* plugin->u.node.key_at + look for description of this method in plugin/node/node.h */ +reiser4_internal reiser4_key * +key_at_node40(const coord_t * coord, reiser4_key * key) +{ + item_header40 *ih; + + assert("nikita-1765", coord_is_existing_item(coord)); + + /* @coord is set to existing item */ + ih = node40_ih_at_coord(coord); + memcpy(key, &ih->key, sizeof (reiser4_key)); + return key; +} + +/* VS-FIXME-HANS: please review whether the below are properly disabled when debugging is disabled */ + +#define NODE_INCSTAT(n, counter) \ + reiser4_stat_inc_at_level(znode_get_level(n), node.lookup.counter) + +#define NODE_ADDSTAT(n, counter, val) \ + reiser4_stat_add_at_level(znode_get_level(n), node.lookup.counter, val) + +/* plugin->u.node.lookup + look for description of this method in plugin/node/node.h */ +reiser4_internal node_search_result +lookup_node40(znode * node /* node to query */ , + const reiser4_key * key /* key to look for */ , + lookup_bias bias /* search bias */ , + coord_t * coord /* resulting coord */ ) +{ + int left; + int right; + int found; + int items; + + item_header40 *lefth; + item_header40 *righth; + + item_plugin *iplug; + item_header40 *bstop; + item_header40 *ih; + cmp_t order; + + assert("nikita-583", node != NULL); + assert("nikita-584", key != NULL); + assert("nikita-585", coord != NULL); + assert("nikita-2693", znode_is_any_locked(node)); + cassert(REISER4_SEQ_SEARCH_BREAK > 2); + + items = node_num_items(node); + + if (unlikely(items == 0)) { + coord_init_first_unit(coord, node); + return NS_NOT_FOUND; + } + + /* binary search for item that can contain given key */ + left = 0; + right = items - 1; + coord->node = node; + coord_clear_iplug(coord); + found = 0; + + lefth = node40_ih_at(node, left); + righth = node40_ih_at(node, right); + + /* It is known that for small arrays sequential search is on average + more efficient than binary. This is because sequential search is + coded as tight loop that can be better optimized by compilers and + for small array size gain from this optimization makes sequential + search the winner. Another, maybe more important, reason for this, + is that sequential array is more CPU cache friendly, whereas binary + search effectively destroys CPU caching. + + Critical here is the notion of "smallness". Reasonable value of + REISER4_SEQ_SEARCH_BREAK can be found by playing with code in + fs/reiser4/ulevel/ulevel.c:test_search(). + + Don't try to further optimize sequential search by scanning from + right to left in attempt to use more efficient loop termination + condition (comparison with 0). This doesn't work. + + */ + + while (right - left >= REISER4_SEQ_SEARCH_BREAK) { + int median; + item_header40 *medianh; + + median = (left + right) / 2; + medianh = node40_ih_at(node, median); + + assert("nikita-1084", median >= 0); + assert("nikita-1085", median < items); + switch (keycmp(key, &medianh->key)) { + case LESS_THAN: + right = median; + righth = medianh; + break; + default: + wrong_return_value("nikita-586", "keycmp"); + case GREATER_THAN: + left = median; + lefth = medianh; + break; + case EQUAL_TO: + do { + -- median; + /* headers are ordered from right to left */ + ++ medianh; + } while (median >= 0 && keyeq(key, &medianh->key)); + right = left = median + 1; + ih = lefth = righth = medianh - 1; + found = 1; + break; + } + } + /* sequential scan. Item headers, and, therefore, keys are stored at + the rightmost part of a node from right to left. We are trying to + access memory from left to right, and hence, scan in _descending_ + order of item numbers. + */ + if (!found) { + for (left = right, ih = righth; left >= 0; ++ ih, -- left) { + cmp_t comparison; + + prefetchkey(&(ih + 1)->key); + comparison = keycmp(&ih->key, key); + if (comparison == GREATER_THAN) + continue; + if (comparison == EQUAL_TO) { + found = 1; + do { + -- left; + ++ ih; + } while (left >= 0 && keyeq(&ih->key, key)); + ++ left; + -- ih; + } else { + assert("nikita-1256", comparison == LESS_THAN); + } + break; + } + if (unlikely(left < 0)) + left = 0; + } + + assert("nikita-3212", right >= left); + assert("nikita-3214", + equi(found, keyeq(&node40_ih_at(node, left)->key, key))); + +#if REISER4_STATS + NODE_ADDSTAT(node, found, !!found); + NODE_ADDSTAT(node, pos, left); + if (items > 1) + NODE_ADDSTAT(node, posrelative, (left << 10) / (items - 1)); + else + NODE_ADDSTAT(node, posrelative, 1 << 10); + if (left == node->last_lookup_pos) + NODE_INCSTAT(node, samepos); + if (left == node->last_lookup_pos + 1) + NODE_INCSTAT(node, nextpos); + node->last_lookup_pos = left; +#endif + + coord_set_item_pos(coord, left); + coord->unit_pos = 0; + coord->between = AT_UNIT; + + /* key < leftmost key in a mode or node is corrupted and keys + are not sorted */ + bstop = node40_ih_at(node, (unsigned) left); + order = keycmp(&bstop->key, key); + if (unlikely(order == GREATER_THAN)) { + if (unlikely(left != 0)) { + /* screw up */ + warning("nikita-587", "Key less than %i key in a node", left); + print_key("key", key); + print_key("min", &bstop->key); + print_znode("node", node); + print_coord_content("coord", coord); + return RETERR(-EIO); + } else { + coord->between = BEFORE_UNIT; + return NS_NOT_FOUND; + } + } + /* left <= key, ok */ + iplug = item_plugin_by_disk_id(znode_get_tree(node), &bstop->plugin_id); + + if (unlikely(iplug == NULL)) { + warning("nikita-588", "Unknown plugin %i", d16tocpu(&bstop->plugin_id)); + print_key("key", key); + print_znode("node", node); + print_coord_content("coord", coord); + return RETERR(-EIO); + } + + coord_set_iplug(coord, iplug); + + /* if exact key from item header was found by binary search, no + further checks are necessary. */ + if (found) { + assert("nikita-1259", order == EQUAL_TO); + return NS_FOUND; + } + if (iplug->b.max_key_inside != NULL) { + reiser4_key max_item_key; + + /* key > max_item_key --- outside of an item */ + if (keygt(key, iplug->b.max_key_inside(coord, &max_item_key))) { + coord->unit_pos = 0; + coord->between = AFTER_ITEM; + /* FIXME-VS: key we are looking for does not fit into + found item. Return NS_NOT_FOUND then. Without that + the following case does not work: there is extent of + file 10000, 10001. File 10000, 10002 has been just + created. When writing to position 0 in that file - + traverse_tree will stop here on twig level. When we + want it to go down to leaf level + */ + return NS_NOT_FOUND; + } + } + + if (iplug->b.lookup != NULL) { + return iplug->b.lookup(key, bias, coord); + } else { + assert("nikita-1260", order == LESS_THAN); + coord->between = AFTER_UNIT; + return (bias == FIND_EXACT) ? NS_NOT_FOUND : NS_FOUND; + } +} + +#undef NODE_ADDSTAT +#undef NODE_INCSTAT + +/* plugin->u.node.estimate + look for description of this method in plugin/node/node.h */ +reiser4_internal size_t estimate_node40(znode * node) +{ + size_t result; + + assert("nikita-597", node != NULL); + + result = free_space_node40(node) - sizeof(item_header40); + + return (result > 0) ? result : 0; +} + +/* plugin->u.node.check + look for description of this method in plugin/node/node.h */ +reiser4_internal int +check_node40(const znode * node /* node to check */ , + __u32 flags /* check flags */ , + const char **error /* where to store error message */ ) +{ + int nr_items; + int i; + reiser4_key prev; + unsigned old_offset; + tree_level level; + coord_t coord; + + assert("nikita-580", node != NULL); + assert("nikita-581", error != NULL); + assert("nikita-2948", znode_is_loaded(node)); + + if (ZF_ISSET(node, JNODE_HEARD_BANSHEE)) + return 0; + + assert("nikita-582", zdata(node) != NULL); + + nr_items = node40_num_of_items_internal(node); + if (nr_items < 0) { + *error = "Negative number of items"; + return -1; + } + + if (flags & REISER4_NODE_DKEYS) + prev = *znode_get_ld_key((znode *)node); + else + prev = *min_key(); + + old_offset = 0; + coord_init_zero(&coord); + coord.node = (znode *) node; + coord.unit_pos = 0; + coord.between = AT_UNIT; + level = znode_get_level(node); + for (i = 0; i < nr_items; i++) { + item_header40 *ih; + reiser4_key unit_key; + unsigned j; + + ih = node40_ih_at(node, (unsigned) i); + coord_set_item_pos(&coord, i); + if ((ih40_get_offset(ih) >= + znode_size(node) - nr_items * sizeof (item_header40)) || + (ih40_get_offset(ih) < sizeof (node40_header))) { + *error = "Offset is out of bounds"; + return -1; + } + if (ih40_get_offset(ih) <= old_offset) { + *error = "Offsets are in wrong order"; + return -1; + } + if ((i == 0) && (ih40_get_offset(ih) != sizeof(node40_header))) { + *error = "Wrong offset of first item"; + return -1; + } + old_offset = ih40_get_offset(ih); + + if (keygt(&prev, &ih->key)) { + *error = "Keys are in wrong order"; + return -1; + } + if (!keyeq(&ih->key, unit_key_by_coord(&coord, &unit_key))) { + *error = "Wrong key of first unit"; + return -1; + } + prev = ih->key; + for (j = 0; j < coord_num_units(&coord); ++j) { + coord.unit_pos = j; + unit_key_by_coord(&coord, &unit_key); + if (keygt(&prev, &unit_key)) { + *error = "Unit keys are in wrong order"; + return -1; + } + prev = unit_key; + } + coord.unit_pos = 0; + if (level != TWIG_LEVEL && + item_is_extent(&coord)) { + *error = "extent on the wrong level"; + return -1; + } + if (level == LEAF_LEVEL && + item_is_internal(&coord)) { + *error = "internal item on the wrong level"; + return -1; + } + if (level != LEAF_LEVEL && + !item_is_internal(&coord) && !item_is_extent(&coord)) { + *error = "wrong item on the internal level"; + return -1; + } + if (level > TWIG_LEVEL && + !item_is_internal(&coord)) { + *error = "non-internal item on the internal level"; + return -1; + } +#if REISER4_DEBUG + if (item_plugin_by_coord(&coord)->b.check && item_plugin_by_coord(&coord)->b.check(&coord, error)) + return -1; +#endif + if (i) { + coord_t prev_coord; + /* two neighboring items can not be mergeable */ + coord_dup(&prev_coord, &coord); + coord_prev_item(&prev_coord); + if (are_items_mergeable(&prev_coord, &coord)) { + *error = "mergeable items in one node"; + return -1; + } + + } + } + + if ((flags & REISER4_NODE_DKEYS) && !node_is_empty(node)) { + coord_t coord; + item_plugin *iplug; + + coord_init_last_unit(&coord, node); + iplug = item_plugin_by_coord(&coord); + if ((item_is_extent(&coord) || item_is_tail(&coord)) && + iplug->s.file.append_key != NULL) { + reiser4_key mkey; + + iplug->s.file.append_key(&coord, &mkey); + set_key_offset(&mkey, get_key_offset(&mkey) - 1); + if (UNDER_RW(dk, current_tree, read, keygt(&mkey, znode_get_rd_key((znode *) node)))) { + *error = "key of rightmost item is too large"; + return -1; + } + } + } + if (flags & REISER4_NODE_DKEYS) { + RLOCK_TREE(current_tree); + RLOCK_DK(current_tree); + + flags |= REISER4_NODE_TREE_STABLE; + + if (keygt(&prev, znode_get_rd_key((znode *)node))) { + if (flags & REISER4_NODE_TREE_STABLE) { + *error = "Last key is greater than rdkey"; + RUNLOCK_DK(current_tree); + RUNLOCK_TREE(current_tree); + return -1; + } + } + if (keygt(znode_get_ld_key((znode *)node), znode_get_rd_key((znode *)node))) { + *error = "ldkey is greater than rdkey"; + RUNLOCK_DK(current_tree); + RUNLOCK_TREE(current_tree); + return -1; + } + if (ZF_ISSET(node, JNODE_LEFT_CONNECTED) && + (node->left != NULL) && + !ZF_ISSET(node->left, JNODE_HEARD_BANSHEE) && + ergo(flags & REISER4_NODE_TREE_STABLE, + !keyeq(znode_get_rd_key(node->left), znode_get_ld_key((znode *)node))) && + ergo(!(flags & REISER4_NODE_TREE_STABLE), keygt(znode_get_rd_key(node->left), znode_get_ld_key((znode *)node)))) { + *error = "left rdkey or ldkey is wrong"; + RUNLOCK_DK(current_tree); + RUNLOCK_TREE(current_tree); + return -1; + } + if (ZF_ISSET(node, JNODE_RIGHT_CONNECTED) && + (node->right != NULL) && + !ZF_ISSET(node->right, JNODE_HEARD_BANSHEE) && + ergo(flags & REISER4_NODE_TREE_STABLE, + !keyeq(znode_get_rd_key((znode *)node), znode_get_ld_key(node->right))) && + ergo(!(flags & REISER4_NODE_TREE_STABLE), keygt(znode_get_rd_key((znode *)node), znode_get_ld_key(node->right)))) { + *error = "rdkey or right ldkey is wrong"; + RUNLOCK_DK(current_tree); + RUNLOCK_TREE(current_tree); + return -1; + } + + RUNLOCK_DK(current_tree); + RUNLOCK_TREE(current_tree); + } + + return 0; +} + +/* plugin->u.node.parse + look for description of this method in plugin/node/node.h */ +reiser4_internal int +parse_node40(znode * node /* node to parse */ ) +{ + node40_header *header; + int result; + + header = node40_node_header((znode *) node); + result = -EIO; + if (unlikely(((__u8) znode_get_level(node)) != nh40_get_level(header))) + warning("nikita-494", "Wrong level found in node: %i != %i", + znode_get_level(node), nh40_get_level(header)); + else if (unlikely(nh40_get_magic(header) != REISER4_NODE_MAGIC)) + warning("nikita-495", + "Wrong magic in tree node: want %x, got %x", + REISER4_NODE_MAGIC, nh40_get_magic(header)); + else { + node->nr_items = node40_num_of_items_internal(node); + result = 0; + } + if (unlikely(result != 0)) + /* print_znode("node", node)*/; + return RETERR(result); +} + +/* plugin->u.node.init + look for description of this method in plugin/node/node.h */ +reiser4_internal int +init_node40(znode * node /* node to initialise */ ) +{ + node40_header *header; + + assert("nikita-570", node != NULL); + assert("nikita-572", zdata(node) != NULL); + + header = node40_node_header(node); + memset(header, 0, sizeof (node40_header)); + nh40_set_free_space(header, znode_size(node) - sizeof (node40_header)); + nh40_set_free_space_start(header, sizeof (node40_header)); + /* sane hypothesis: 0 in CPU format is 0 in disk format */ + /* items: 0 */ + save_plugin_id(node_plugin_to_plugin(node->nplug), &header->common_header.plugin_id); + nh40_set_level(header, znode_get_level(node)); + nh40_set_magic(header, REISER4_NODE_MAGIC); + node->nr_items = 0; + nh40_set_mkfs_id(header, reiser4_mkfs_id(reiser4_get_current_sb())); + + /* flags: 0 */ + return 0; +} + +#ifdef GUESS_EXISTS +reiser4_internal int +guess_node40(const znode * node /* node to guess plugin of */ ) +{ + node40_header *nethack; + + assert("nikita-1058", node != NULL); + nethack = node40_node_header(node); + return + (nh40_get_magic(nethack) == REISER4_NODE_MAGIC) && + (plugin_by_disk_id(znode_get_tree(node), + REISER4_NODE_PLUGIN_TYPE, &nethack->common_header.plugin_id)->h.id == NODE40_ID); +} +#endif + +/* plugin->u.node.chage_item_size + look for description of this method in plugin/node/node.h */ +reiser4_internal void +change_item_size_node40(coord_t * coord, int by) +{ + node40_header *nh; + item_header40 *ih; + char *item_data; + int item_length; + unsigned i; + + /* make sure that @item is coord of existing item */ + assert("vs-210", coord_is_existing_item(coord)); + + nh = node40_node_header(coord->node); + + item_data = item_by_coord_node40(coord); + item_length = length_by_coord_node40(coord); + + /* move item bodies */ + ih = node40_ih_at_coord(coord); + memmove(item_data + item_length + by, item_data + item_length, + nh40_get_free_space_start(node40_node_header(coord->node)) - (ih40_get_offset(ih) + item_length)); + + /* update offsets of moved items */ + for (i = coord->item_pos + 1; i < nh40_get_num_items(nh); i++) { + ih = node40_ih_at(coord->node, i); + ih40_set_offset(ih, ih40_get_offset(ih) + by); + } + + /* update node header */ + nh40_set_free_space(nh, nh40_get_free_space(nh) - by); + nh40_set_free_space_start(nh, nh40_get_free_space_start(nh) + by); +} + +static int +should_notify_parent(const znode * node) +{ + /* FIXME_JMACD This looks equivalent to znode_is_root(), right? -josh */ + return !disk_addr_eq(znode_get_block(node), &znode_get_tree(node)->root_block); +} + +/* plugin->u.node.create_item + look for description of this method in plugin/node/node.h */ +reiser4_internal int +create_item_node40(coord_t * target, const reiser4_key * key, reiser4_item_data * data, carry_plugin_info * info) +{ + node40_header *nh; + item_header40 *ih; + unsigned offset; + unsigned i; + + nh = node40_node_header(target->node); + + assert("vs-212", coord_is_between_items(target)); + /* node must have enough free space */ + assert("vs-254", free_space_node40(target->node) >= data->length + sizeof(item_header40)); + assert("vs-1410", data->length >= 0); + + if (coord_set_to_right(target)) + /* there are not items to the right of @target, so, new item + will be inserted after last one */ + coord_set_item_pos(target, nh40_get_num_items(nh)); + + if (target->item_pos < nh40_get_num_items(nh)) { + /* there are items to be moved to prepare space for new + item */ + ih = node40_ih_at_coord(target); + /* new item will start at this offset */ + offset = ih40_get_offset(ih); + + memmove(zdata(target->node) + offset + data->length, + zdata(target->node) + offset, nh40_get_free_space_start(nh) - offset); + /* update headers of moved items */ + for (i = target->item_pos; i < nh40_get_num_items(nh); i++) { + ih = node40_ih_at(target->node, i); + ih40_set_offset(ih, ih40_get_offset(ih) + data->length); + } + + /* @ih is set to item header of the last item, move item headers */ + memmove(ih - 1, ih, sizeof (item_header40) * (nh40_get_num_items(nh) - target->item_pos)); + } else { + /* new item will start at this offset */ + offset = nh40_get_free_space_start(nh); + } + + /* make item header for the new item */ + ih = node40_ih_at_coord(target); + memcpy(&ih->key, key, sizeof (reiser4_key)); + ih40_set_offset(ih, offset); + save_plugin_id(item_plugin_to_plugin(data->iplug), &ih->plugin_id); + + /* update node header */ + nh40_set_free_space(nh, nh40_get_free_space(nh) - data->length - sizeof (item_header40)); + nh40_set_free_space_start(nh, nh40_get_free_space_start(nh) + data->length); + node40_set_num_items(target->node, nh, nh40_get_num_items(nh) + 1); + + /* FIXME: check how does create_item work when between is set to BEFORE_UNIT */ + target->unit_pos = 0; + target->between = AT_UNIT; + coord_clear_iplug(target); + + /* initialise item */ + if (data->iplug->b.init != NULL) { + data->iplug->b.init(target, NULL, data); + } + /* copy item body */ + if (data->iplug->b.paste != NULL) { + data->iplug->b.paste(target, data, info); + } else if (data->data != NULL) { + if (data->user) { + /* AUDIT: Are we really should not check that pointer + from userspace was valid and data bytes were + available? How will we return -EFAULT of some kind + without this check? */ + assert("nikita-3038", schedulable()); + /* copy data from user space */ + __copy_from_user(zdata(target->node) + offset, data->data, (unsigned) data->length); + } else + /* copy from kernel space */ + memcpy(zdata(target->node) + offset, data->data, (unsigned) data->length); + } + + if (target->item_pos == 0) { + /* left delimiting key has to be updated */ + prepare_for_update(NULL, target->node, info); + } + + if (item_plugin_by_coord(target)->b.create_hook != NULL) { + item_plugin_by_coord(target)->b.create_hook(target, data->arg); + } + + return 0; +} + +/* plugin->u.node.update_item_key + look for description of this method in plugin/node/node.h */ +reiser4_internal void +update_item_key_node40(coord_t * target, const reiser4_key * key, carry_plugin_info * info) +{ + item_header40 *ih; + + ih = node40_ih_at_coord(target); + memcpy(&ih->key, key, sizeof (reiser4_key)); + + if (target->item_pos == 0) { + prepare_for_update(NULL, target->node, info); + } +} + +/* this bits encode cut mode */ +#define CMODE_TAIL 1 +#define CMODE_WHOLE 2 +#define CMODE_HEAD 4 + +struct cut40_info { + int mode; + pos_in_node_t tail_removed; /* position of item which gets tail removed */ + pos_in_node_t first_removed; /* position of first the leftmost item among items removed completely */ + pos_in_node_t removed_count; /* number of items removed completely */ + pos_in_node_t head_removed; /* position of item which gets head removed */ + + pos_in_node_t freed_space_start; + pos_in_node_t freed_space_end; + pos_in_node_t first_moved; + pos_in_node_t head_removed_location; +}; + +static void +init_cinfo(struct cut40_info *cinfo) +{ + cinfo->mode = 0; + cinfo->tail_removed = MAX_POS_IN_NODE; + cinfo->first_removed = MAX_POS_IN_NODE; + cinfo->removed_count = MAX_POS_IN_NODE; + cinfo->head_removed = MAX_POS_IN_NODE; + cinfo->freed_space_start = MAX_POS_IN_NODE; + cinfo->freed_space_end = MAX_POS_IN_NODE; + cinfo->first_moved = MAX_POS_IN_NODE; + cinfo->head_removed_location = MAX_POS_IN_NODE; +} + +/* complete cut_node40/kill_node40 content by removing the gap created by */ +static void +compact(znode *node, struct cut40_info *cinfo) +{ + node40_header *nh; + item_header40 *ih; + pos_in_node_t freed; + pos_in_node_t pos, nr_items; + + assert("vs-1526", (cinfo->freed_space_start != MAX_POS_IN_NODE && + cinfo->freed_space_end != MAX_POS_IN_NODE && + cinfo->first_moved != MAX_POS_IN_NODE)); + assert("vs-1523", cinfo->freed_space_end >= cinfo->freed_space_start); + + nh = node40_node_header(node); + nr_items = nh40_get_num_items(nh); + + /* remove gap made up by removal */ + memmove(zdata(node) + cinfo->freed_space_start, zdata(node) + cinfo->freed_space_end, + nh40_get_free_space_start(nh) - cinfo->freed_space_end); + + /* update item headers of moved items - change their locations */ + pos = cinfo->first_moved; + ih = node40_ih_at(node, pos); + if (cinfo->head_removed_location != MAX_POS_IN_NODE) { + assert("vs-1580", pos == cinfo->head_removed); + ih40_set_offset(ih, cinfo->head_removed_location); + pos ++; + ih --; + } + + freed = cinfo->freed_space_end - cinfo->freed_space_start; + for (; pos < nr_items; pos ++, ih --) { + assert("vs-1581", ih == node40_ih_at(node, pos)); + ih40_set_offset(ih, ih40_get_offset(ih) - freed); + } + + /* free space start moved to right */ + nh40_set_free_space_start(nh, nh40_get_free_space_start(nh) - freed); + + if (cinfo->removed_count != MAX_POS_IN_NODE) { + /* number of items changed. Remove item headers of those items */ + ih = node40_ih_at(node, nr_items - 1); + memmove(ih + cinfo->removed_count, ih, + sizeof (item_header40) * (nr_items - cinfo->removed_count - cinfo->first_removed)); + freed += sizeof (item_header40) * cinfo->removed_count; + node40_set_num_items(node, nh, nr_items - cinfo->removed_count); + } + + /* total amount of free space increased */ + nh40_set_free_space(nh, nh40_get_free_space(nh) + freed); +} + +reiser4_internal int +shrink_item_node40(coord_t *coord, int delta) +{ + node40_header *nh; + item_header40 *ih; + pos_in_node_t pos; + pos_in_node_t nr_items; + char *end; + znode *node; + + assert("nikita-3487", coord != NULL); + assert("nikita-3488", delta >= 0); + + node = coord->node; + nh = node40_node_header(node); + nr_items = nh40_get_num_items(nh); + + ih = node40_ih_at_coord(coord); + assert("nikita-3489", delta <= length_by_coord_node40(coord)); + end = zdata(node) + ih40_get_offset(ih) + length_by_coord_node40(coord); + + /* remove gap made up by removal */ + memmove(end - delta, end, nh40_get_free_space_start(nh) - delta); + + /* update item headers of moved items - change their locations */ + pos = coord->item_pos + 1; + ih = node40_ih_at(node, pos); + for (; pos < nr_items; pos ++, ih --) { + assert("nikita-3490", ih == node40_ih_at(node, pos)); + ih40_set_offset(ih, ih40_get_offset(ih) - delta); + } + + /* free space start moved to left */ + nh40_set_free_space_start(nh, nh40_get_free_space_start(nh) - delta); + /* total amount of free space increased */ + nh40_set_free_space(nh, nh40_get_free_space(nh) + delta); + /* + * This method does _not_ changes number of items. Hence, it cannot + * make node empty. Also it doesn't remove items at all, which means + * that no keys have to be updated either. + */ + return 0; +} + + +/* this is used by cut_node40 and kill_node40. It analyses input parameters and calculates cut mode. There are 2 types + of cut. First is when a unit is removed from the middle of an item. In this case this function returns 1. All the + rest fits into second case: 0 or 1 of items getting tail cut, 0 or more items removed completely and 0 or 1 item + getting head cut. Function returns 0 in this case */ +static int +parse_cut(struct cut40_info *cinfo, const struct cut_kill_params *params) +{ + reiser4_key left_key, right_key; + reiser4_key min_from_key, max_to_key; + const reiser4_key *from_key, *to_key; + + init_cinfo(cinfo); + + /* calculate minimal key stored in first item of items to be cut (params->from) */ + item_key_by_coord(params->from, &min_from_key); + /* and max key stored in last item of items to be cut (params->to) */ + max_item_key_by_coord(params->to, &max_to_key); + + /* if cut key range is not defined in input parameters - define it using cut coord range */ + if (params->from_key == NULL) { + assert("vs-1513", params->to_key == NULL); + unit_key_by_coord(params->from, &left_key); + from_key = &left_key; + max_unit_key_by_coord(params->to, &right_key); + to_key = &right_key; + } else { + from_key = params->from_key; + to_key = params->to_key; + } + + if (params->from->item_pos == params->to->item_pos) { + if (keylt(&min_from_key, from_key) && keylt(to_key, &max_to_key)) + return 1; + + if (keygt(from_key, &min_from_key)) { + /* tail of item is to be cut cut */ + cinfo->tail_removed = params->from->item_pos; + cinfo->mode |= CMODE_TAIL; + } else if (keylt(to_key, &max_to_key)) { + /* head of item is to be cut */ + cinfo->head_removed = params->from->item_pos; + cinfo->mode |= CMODE_HEAD; + } else { + /* item is removed completely */ + cinfo->first_removed = params->from->item_pos; + cinfo->removed_count = 1; + cinfo->mode |= CMODE_WHOLE; + } + } else { + cinfo->first_removed = params->from->item_pos + 1; + cinfo->removed_count = params->to->item_pos - params->from->item_pos - 1; + + if (keygt(from_key, &min_from_key)) { + /* first item is not cut completely */ + cinfo->tail_removed = params->from->item_pos; + cinfo->mode |= CMODE_TAIL; + } else { + cinfo->first_removed --; + cinfo->removed_count ++; + } + if (keylt(to_key, &max_to_key)) { + /* last item is not cut completely */ + cinfo->head_removed = params->to->item_pos; + cinfo->mode |= CMODE_HEAD; + } else { + cinfo->removed_count ++; + } + if (cinfo->removed_count) + cinfo->mode |= CMODE_WHOLE; + } + + return 0; +} + +static void +call_kill_hooks(znode *node, pos_in_node_t from, pos_in_node_t count, carry_kill_data *kdata) +{ + coord_t coord; + item_plugin *iplug; + pos_in_node_t pos; + + coord.node = node; + coord.unit_pos = 0; + coord.between = AT_UNIT; + for (pos = 0; pos < count; pos ++) { + coord_set_item_pos(&coord, from + pos); + coord.unit_pos = 0; + coord.between = AT_UNIT; + iplug = item_plugin_by_coord(&coord); + if (iplug->b.kill_hook) { + iplug->b.kill_hook(&coord, 0, coord_num_units(&coord), kdata); + } + } +} + +/* this is used to kill item partially */ +static pos_in_node_t +kill_units(coord_t *coord, pos_in_node_t from, pos_in_node_t to, void *data, reiser4_key *smallest_removed, + reiser4_key *new_first_key) +{ + struct carry_kill_data *kdata; + item_plugin *iplug; + + kdata = data; + iplug = item_plugin_by_coord(coord); + + assert("vs-1524", iplug->b.kill_units); + return iplug->b.kill_units(coord, from, to, kdata, smallest_removed, new_first_key); +} + +/* call item plugin to cut tail of file */ +static pos_in_node_t +kill_tail(coord_t *coord, void *data, reiser4_key *smallest_removed) +{ + struct carry_kill_data *kdata; + pos_in_node_t to; + + kdata = data; + to = coord_last_unit_pos(coord); + return kill_units(coord, coord->unit_pos, to, kdata, smallest_removed, 0); +} + +/* call item plugin to cut head of item */ +static pos_in_node_t +kill_head(coord_t *coord, void *data, reiser4_key *smallest_removed, reiser4_key *new_first_key) +{ + return kill_units(coord, 0, coord->unit_pos, data, smallest_removed, new_first_key); +} + +/* this is used to cut item partially */ +static pos_in_node_t +cut_units(coord_t *coord, pos_in_node_t from, pos_in_node_t to, void *data, + reiser4_key *smallest_removed, reiser4_key *new_first_key) +{ + carry_cut_data *cdata; + item_plugin *iplug; + + cdata = data; + iplug = item_plugin_by_coord(coord); + assert("vs-302", iplug->b.cut_units); + return iplug->b.cut_units(coord, from, to, cdata, smallest_removed, new_first_key); +} + +/* call item plugin to cut tail of file */ +static pos_in_node_t +cut_tail(coord_t *coord, void *data, reiser4_key *smallest_removed) +{ + carry_cut_data *cdata; + pos_in_node_t to; + + cdata = data; + to = coord_last_unit_pos(cdata->params.from); + return cut_units(coord, coord->unit_pos, to, data, smallest_removed, 0); +} + +/* call item plugin to cut head of item */ +static pos_in_node_t +cut_head(coord_t *coord, void *data, reiser4_key *smallest_removed, reiser4_key *new_first_key) +{ + return cut_units(coord, 0, coord->unit_pos, data, smallest_removed, new_first_key); +} + +/* this returns 1 of key of first item changed, 0 - if it did not */ +static int +prepare_for_compact(struct cut40_info *cinfo, const struct cut_kill_params *params, int is_cut, + void *data, carry_plugin_info *info) +{ + znode *node; + item_header40 *ih; + pos_in_node_t freed; + pos_in_node_t item_pos; + coord_t coord; + reiser4_key new_first_key; + pos_in_node_t (*kill_units_f)(coord_t *, pos_in_node_t, pos_in_node_t, void *, reiser4_key *, reiser4_key *); + pos_in_node_t (*kill_tail_f)(coord_t *, void *, reiser4_key *); + pos_in_node_t (*kill_head_f)(coord_t *, void *, reiser4_key *, reiser4_key *); + int retval; + + retval = 0; + + node = params->from->node; + + assert("vs-184", node == params->to->node); + assert("vs-312", !node_is_empty(node)); + assert("vs-297", coord_compare(params->from, params->to) != COORD_CMP_ON_RIGHT); + + if (is_cut) { + kill_units_f = cut_units; + kill_tail_f = cut_tail; + kill_head_f = cut_head; + } else { + kill_units_f = kill_units; + kill_tail_f = kill_tail; + kill_head_f = kill_head; + } + + if (parse_cut(cinfo, params) == 1) { + /* cut from the middle of item */ + freed = kill_units_f(params->from, params->from->unit_pos, params->to->unit_pos, data, params->smallest_removed, NULL); + + item_pos = params->from->item_pos; + ih = node40_ih_at(node, item_pos); + cinfo->freed_space_start = ih40_get_offset(ih) + node40_item_length(node, item_pos) - freed; + cinfo->freed_space_end = cinfo->freed_space_start + freed; + cinfo->first_moved = item_pos + 1; + } else { + assert("vs-1521", (cinfo->tail_removed != MAX_POS_IN_NODE || + cinfo->first_removed != MAX_POS_IN_NODE || + cinfo->head_removed != MAX_POS_IN_NODE)); + + switch (cinfo->mode) { + case CMODE_TAIL: + /* one item gets cut partially from its end */ + assert("vs-1562", cinfo->tail_removed == params->from->item_pos); + + freed = kill_tail_f(params->from, data, params->smallest_removed); + + item_pos = cinfo->tail_removed; + ih = node40_ih_at(node, item_pos); + cinfo->freed_space_start = ih40_get_offset(ih) + node40_item_length(node, item_pos) - freed; + cinfo->freed_space_end = cinfo->freed_space_start + freed; + cinfo->first_moved = cinfo->tail_removed + 1; + break; + + case CMODE_WHOLE: + /* one or more items get removed completely */ + assert("vs-1563", cinfo->first_removed == params->from->item_pos); + assert("vs-1564", cinfo->removed_count > 0 && cinfo->removed_count != MAX_POS_IN_NODE); + + /* call kill hook for all items removed completely */ + if (is_cut == 0) + call_kill_hooks(node, cinfo->first_removed, cinfo->removed_count, data); + + item_pos = cinfo->first_removed; + ih = node40_ih_at(node, item_pos); + + if (params->smallest_removed) + memcpy(params->smallest_removed, &ih->key, sizeof (reiser4_key)); + + cinfo->freed_space_start = ih40_get_offset(ih); + + item_pos += (cinfo->removed_count - 1); + ih -= (cinfo->removed_count - 1); + cinfo->freed_space_end = ih40_get_offset(ih) + node40_item_length(node, item_pos); + cinfo->first_moved = item_pos + 1; + if (cinfo->first_removed == 0) + /* key of first item of the node changes */ + retval = 1; + break; + + case CMODE_HEAD: + /* one item gets cut partially from its head */ + assert("vs-1565", cinfo->head_removed == params->from->item_pos); + + freed = kill_head_f(params->to, data, params->smallest_removed, &new_first_key); + + item_pos = cinfo->head_removed; + ih = node40_ih_at(node, item_pos); + cinfo->freed_space_start = ih40_get_offset(ih); + cinfo->freed_space_end = ih40_get_offset(ih) + freed; + cinfo->first_moved = cinfo->head_removed + 1; + + /* item head is removed, therefore, item key changed */ + coord.node = node; + coord_set_item_pos(&coord, item_pos); + coord.unit_pos = 0; + coord.between = AT_UNIT; + update_item_key_node40(&coord, &new_first_key, 0); + if (item_pos == 0) + /* key of first item of the node changes */ + retval = 1; + break; + + case CMODE_TAIL | CMODE_WHOLE: + /* one item gets cut from its end and one or more items get removed completely */ + assert("vs-1566", cinfo->tail_removed == params->from->item_pos); + assert("vs-1567", cinfo->first_removed == cinfo->tail_removed + 1); + assert("vs-1564", cinfo->removed_count > 0 && cinfo->removed_count != MAX_POS_IN_NODE); + + freed = kill_tail_f(params->from, data, params->smallest_removed); + + item_pos = cinfo->tail_removed; + ih = node40_ih_at(node, item_pos); + cinfo->freed_space_start = ih40_get_offset(ih) + node40_item_length(node, item_pos) - freed; + + /* call kill hook for all items removed completely */ + if (is_cut == 0) + call_kill_hooks(node, cinfo->first_removed, cinfo->removed_count, data); + + item_pos += cinfo->removed_count; + ih -= cinfo->removed_count; + cinfo->freed_space_end = ih40_get_offset(ih) + node40_item_length(node, item_pos); + cinfo->first_moved = item_pos + 1; + break; + + case CMODE_WHOLE | CMODE_HEAD: + /* one or more items get removed completely and one item gets cut partially from its head */ + assert("vs-1568", cinfo->first_removed == params->from->item_pos); + assert("vs-1564", cinfo->removed_count > 0 && cinfo->removed_count != MAX_POS_IN_NODE); + assert("vs-1569", cinfo->head_removed == cinfo->first_removed + cinfo->removed_count); + + /* call kill hook for all items removed completely */ + if (is_cut == 0) + call_kill_hooks(node, cinfo->first_removed, cinfo->removed_count, data); + + item_pos = cinfo->first_removed; + ih = node40_ih_at(node, item_pos); + + if (params->smallest_removed) + memcpy(params->smallest_removed, &ih->key, sizeof (reiser4_key)); + + freed = kill_head_f(params->to, data, 0, &new_first_key); + + cinfo->freed_space_start = ih40_get_offset(ih); + + ih = node40_ih_at(node, cinfo->head_removed); + /* this is the most complex case. Item which got head removed and items which are to be moved + intact change their location differently. */ + cinfo->freed_space_end = ih40_get_offset(ih) + freed; + cinfo->first_moved = cinfo->head_removed; + cinfo->head_removed_location = cinfo->freed_space_start; + + /* item head is removed, therefore, item key changed */ + coord.node = node; + coord_set_item_pos(&coord, cinfo->head_removed); + coord.unit_pos = 0; + coord.between = AT_UNIT; + update_item_key_node40(&coord, &new_first_key, 0); + + assert("vs-1579", cinfo->first_removed == 0); + /* key of first item of the node changes */ + retval = 1; + break; + + case CMODE_TAIL | CMODE_HEAD: + /* one item get cut from its end and its neighbor gets cut from its tail */ + impossible("vs-1576", "this can not happen currently"); + break; + + case CMODE_TAIL | CMODE_WHOLE | CMODE_HEAD: + impossible("vs-1577", "this can not happen currently"); + break; + default: + impossible("vs-1578", "unexpected cut mode"); + break; + } + } + return retval; +} + + +/* plugin->u.node.kill + return value is number of items removed completely */ +int +kill_node40(struct carry_kill_data *kdata, carry_plugin_info *info) +{ + znode *node; + struct cut40_info cinfo; + int first_key_changed; + + node = kdata->params.from->node; + + first_key_changed = prepare_for_compact(&cinfo, &kdata->params, 0/* not cut */, kdata, info); + compact(node, &cinfo); + + if (info) { + /* it is not called by node40_shift, so we have to take care + of changes on upper levels */ + if (node_is_empty(node) && !(kdata->flags & DELETE_RETAIN_EMPTY)) + /* all contents of node is deleted */ + prepare_removal_node40(node, info); + else if (first_key_changed) { + prepare_for_update(NULL, node, info); + } + } + + coord_clear_iplug(kdata->params.from); + coord_clear_iplug(kdata->params.to); + + znode_make_dirty(node); + return cinfo.removed_count == MAX_POS_IN_NODE ? 0 : cinfo.removed_count; +} + +/* plugin->u.node.cut + return value is number of items removed completely */ +int +cut_node40(struct carry_cut_data *cdata, carry_plugin_info *info) +{ + znode *node; + struct cut40_info cinfo; + int first_key_changed; + + node = cdata->params.from->node; + + first_key_changed = prepare_for_compact(&cinfo, &cdata->params, 1/* not cut */, cdata, info); + compact(node, &cinfo); + + if (info) { + /* it is not called by node40_shift, so we have to take care + of changes on upper levels */ + if (node_is_empty(node)) + /* all contents of node is deleted */ + prepare_removal_node40(node, info); + else if (first_key_changed) { + prepare_for_update(NULL, node, info); + } + } + + coord_clear_iplug(cdata->params.from); + coord_clear_iplug(cdata->params.to); + + znode_make_dirty(node); + return cinfo.removed_count == MAX_POS_IN_NODE ? 0 : cinfo.removed_count ; +} + + +/* this structure is used by shift method of node40 plugin */ +struct shift_params { + shift_direction pend; /* when @pend == append - we are shifting to + left, when @pend == prepend - to right */ + coord_t wish_stop; /* when shifting to left this is last unit we + want shifted, when shifting to right - this + is set to unit we want to start shifting + from */ + znode *target; + int everything; /* it is set to 1 if everything we have to shift is + shifted, 0 - otherwise */ + + /* FIXME-VS: get rid of read_stop */ + + /* these are set by estimate_shift */ + coord_t real_stop; /* this will be set to last unit which will be + really shifted */ + + /* coordinate in source node before operation of unit which becomes + first after shift to left of last after shift to right */ + union { + coord_t future_first; + coord_t future_last; + } u; + + unsigned merging_units; /* number of units of first item which have to + be merged with last item of target node */ + unsigned merging_bytes; /* number of bytes in those units */ + + unsigned entire; /* items shifted in their entirety */ + unsigned entire_bytes; /* number of bytes in those items */ + + unsigned part_units; /* number of units of partially copied item */ + unsigned part_bytes; /* number of bytes in those units */ + + unsigned shift_bytes; /* total number of bytes in items shifted (item + headers not included) */ + +}; + +static int +item_creation_overhead(coord_t * item) +{ + return node_plugin_by_coord(item)->item_overhead(item->node, 0); +} + +/* how many units are there in @source starting from source->unit_pos + but not further than @stop_coord */ +static int +wanted_units(coord_t * source, coord_t * stop_coord, shift_direction pend) +{ + if (pend == SHIFT_LEFT) { + assert("vs-181", source->unit_pos == 0); + } else { + assert("vs-182", source->unit_pos == coord_last_unit_pos(source)); + } + + if (source->item_pos != stop_coord->item_pos) { + /* @source and @stop_coord are different items */ + return coord_last_unit_pos(source) + 1; + } + + if (pend == SHIFT_LEFT) { + return stop_coord->unit_pos + 1; + } else { + return source->unit_pos - stop_coord->unit_pos + 1; + } +} + +/* this calculates what can be copied from @shift->wish_stop.node to + @shift->target */ +static void +estimate_shift(struct shift_params *shift, const reiser4_context *ctx) +{ + unsigned target_free_space, size; + pos_in_node_t stop_item; /* item which estimating should not consider */ + unsigned want; /* number of units of item we want shifted */ + coord_t source; /* item being estimated */ + item_plugin *iplug; + + /* shifting to left/right starts from first/last units of + @shift->wish_stop.node */ + if (shift->pend == SHIFT_LEFT) { + coord_init_first_unit(&source, shift->wish_stop.node); + } else { + coord_init_last_unit(&source, shift->wish_stop.node); + } + shift->real_stop = source; + + /* free space in target node and number of items in source */ + target_free_space = znode_free_space(shift->target); + + shift->everything = 0; + if (!node_is_empty(shift->target)) { + /* target node is not empty, check for boundary items + mergeability */ + coord_t to; + + /* item we try to merge @source with */ + if (shift->pend == SHIFT_LEFT) { + coord_init_last_unit(&to, shift->target); + } else { + coord_init_first_unit(&to, shift->target); + } + + if ((shift->pend == SHIFT_LEFT) ? are_items_mergeable(&to, &source) : are_items_mergeable(&source, &to)) { + /* how many units of @source do we want to merge to + item @to */ + want = wanted_units(&source, &shift->wish_stop, shift->pend); + + /* how many units of @source we can merge to item + @to */ + iplug = item_plugin_by_coord(&source); + if (iplug->b.can_shift != NULL) + shift->merging_units = + iplug->b.can_shift(target_free_space, + &source, shift->target, shift->pend, &size, want); + else { + shift->merging_units = 0; + size = 0; + } + shift->merging_bytes = size; + shift->shift_bytes += size; + /* update stop coord to be set to last unit of @source + we can merge to @target */ + if (shift->merging_units) + /* at least one unit can be shifted */ + shift->real_stop.unit_pos = (shift->merging_units - source.unit_pos - 1) * shift->pend; + else { + /* nothing can be shifted */ + if (shift->pend == SHIFT_LEFT) + coord_init_before_first_item(&shift->real_stop, source.node); + else + coord_init_after_last_item(&shift->real_stop, source.node); + } + assert("nikita-2081", shift->real_stop.unit_pos + 1); + + if (shift->merging_units != want) { + /* we could not copy as many as we want, so, + there is no reason for estimating any + longer */ + return; + } + + target_free_space -= size; + coord_add_item_pos(&source, shift->pend); + } + } + + /* number of item nothing of which we want to shift */ + stop_item = shift->wish_stop.item_pos + shift->pend; + + /* calculate how many items can be copied into given free + space as whole */ + for (; source.item_pos != stop_item; coord_add_item_pos(&source, shift->pend)) { + if (shift->pend == SHIFT_RIGHT) + source.unit_pos = coord_last_unit_pos(&source); + + /* how many units of @source do we want to copy */ + want = wanted_units(&source, &shift->wish_stop, shift->pend); + + if (want == coord_last_unit_pos(&source) + 1) { + /* we want this item to be copied entirely */ + size = item_length_by_coord(&source) + item_creation_overhead(&source); + if (size <= target_free_space) { + /* item fits into target node as whole */ + target_free_space -= size; + shift->shift_bytes += size - item_creation_overhead(&source); + shift->entire_bytes += size - item_creation_overhead(&source); + shift->entire++; + + /* update shift->real_stop coord to be set to + last unit of @source we can merge to + @target */ + shift->real_stop = source; + if (shift->pend == SHIFT_LEFT) + shift->real_stop.unit_pos = coord_last_unit_pos(&shift->real_stop); + else + shift->real_stop.unit_pos = 0; + continue; + } + } + + /* we reach here only for an item which does not fit into + target node in its entirety. This item may be either + partially shifted, or not shifted at all. We will have to + create new item in target node, so decrease amout of free + space by an item creation overhead. We can reach here also + if stop coord is in this item */ + if (target_free_space >= (unsigned) item_creation_overhead(&source)) { + target_free_space -= item_creation_overhead(&source); + iplug = item_plugin_by_coord(&source); + if (iplug->b.can_shift) { + shift->part_units = iplug->b.can_shift(target_free_space, &source, 0 /*target */ + , shift->pend, &size, want); + } else { + target_free_space = 0; + shift->part_units = 0; + size = 0; + } + } else { + target_free_space = 0; + shift->part_units = 0; + size = 0; + } + shift->part_bytes = size; + shift->shift_bytes += size; + + /* set @shift->real_stop to last unit of @source we can merge + to @shift->target */ + if (shift->part_units) { + shift->real_stop = source; + shift->real_stop.unit_pos = (shift->part_units - source.unit_pos - 1) * shift->pend; + assert("nikita-2082", shift->real_stop.unit_pos + 1); + } + + if (want != shift->part_units) + /* not everything wanted were shifted */ + return; + break; + } + + shift->everything = 1; +} + +static void +copy_units(coord_t * target, coord_t * source, unsigned from, unsigned count, shift_direction dir, unsigned free_space) +{ + item_plugin *iplug; + + assert("nikita-1463", target != NULL); + assert("nikita-1464", source != NULL); + assert("nikita-1465", from + count <= coord_num_units(source)); + + iplug = item_plugin_by_coord(source); + assert("nikita-1468", iplug == item_plugin_by_coord(target)); + iplug->b.copy_units(target, source, from, count, dir, free_space); + + if (dir == SHIFT_RIGHT) { + /* FIXME-VS: this looks not necessary. update_item_key was + called already by copy_units method */ + reiser4_key split_key; + + assert("nikita-1469", target->unit_pos == 0); + + unit_key_by_coord(target, &split_key); + node_plugin_by_coord(target)->update_item_key(target, &split_key, 0); + } +} + +/* copy part of @shift->real_stop.node starting either from its beginning or + from its end and ending at @shift->real_stop to either the end or the + beginning of @shift->target */ +static void +copy(struct shift_params *shift) +{ + node40_header *nh; + coord_t from; + coord_t to; + item_header40 *from_ih, *to_ih; + int free_space_start; + int new_items; + unsigned old_items; + int old_offset; + unsigned i; + + nh = node40_node_header(shift->target); + free_space_start = nh40_get_free_space_start(nh); + old_items = nh40_get_num_items(nh); + new_items = shift->entire + (shift->part_units ? 1 : 0); + assert("vs-185", shift->shift_bytes == shift->merging_bytes + shift->entire_bytes + shift->part_bytes); + + from = shift->wish_stop; + + coord_init_first_unit(&to, shift->target); + + /* NOTE:NIKITA->VS not sure what I am doing: shift->target is empty, + hence to.between is set to EMPTY_NODE above. Looks like we want it + to be AT_UNIT. + + Oh, wonders of ->betweeness... + + */ + to.between = AT_UNIT; + + if (shift->pend == SHIFT_LEFT) { + /* copying to left */ + + coord_set_item_pos(&from, 0); + from_ih = node40_ih_at(from.node, 0); + + coord_set_item_pos(&to, node40_num_of_items_internal(to.node) - 1); + if (shift->merging_units) { + /* expand last item, so that plugin methods will see + correct data */ + free_space_start += shift->merging_bytes; + nh40_set_free_space_start(nh, (unsigned) free_space_start); + nh40_set_free_space(nh, nh40_get_free_space(nh) - shift->merging_bytes); + + /* appending last item of @target */ + copy_units(&to, &from, 0, /* starting from 0-th unit */ + shift->merging_units, SHIFT_LEFT, shift->merging_bytes); + coord_inc_item_pos(&from); + from_ih--; + coord_inc_item_pos(&to); + } + + to_ih = node40_ih_at(shift->target, old_items); + if (shift->entire) { + /* copy @entire items entirely */ + + /* copy item headers */ + memcpy(to_ih - shift->entire + 1, + from_ih - shift->entire + 1, shift->entire * sizeof (item_header40)); + /* update item header offset */ + old_offset = ih40_get_offset(from_ih); + /* AUDIT: Looks like if we calculate old_offset + free_space_start here instead of just old_offset, we can perform one "add" operation less per each iteration */ + for (i = 0; i < shift->entire; i++, to_ih--, from_ih--) + ih40_set_offset(to_ih, ih40_get_offset(from_ih) - old_offset + free_space_start); + + /* copy item bodies */ + memcpy(zdata(shift->target) + free_space_start, zdata(from.node) + old_offset, /*ih40_get_offset (from_ih), */ + shift->entire_bytes); + + coord_add_item_pos(&from, (int) shift->entire); + coord_add_item_pos(&to, (int) shift->entire); + } + + nh40_set_free_space_start(nh, free_space_start + shift->shift_bytes - shift->merging_bytes); + nh40_set_free_space(nh, + nh40_get_free_space(nh) - + (shift->shift_bytes - shift->merging_bytes + sizeof (item_header40) * new_items)); + + /* update node header */ + node40_set_num_items(shift->target, nh, old_items + new_items); + assert("vs-170", nh40_get_free_space(nh) < znode_size(shift->target)); + + if (shift->part_units) { + /* copy heading part (@part units) of @source item as + a new item into @target->node */ + + /* copy item header of partially copied item */ + coord_set_item_pos(&to, node40_num_of_items_internal(to.node) + - 1); + memcpy(to_ih, from_ih, sizeof (item_header40)); + ih40_set_offset(to_ih, nh40_get_free_space_start(nh) - shift->part_bytes); + if (item_plugin_by_coord(&to)->b.init) + item_plugin_by_coord(&to)->b.init(&to, &from, 0); + copy_units(&to, &from, 0, shift->part_units, SHIFT_LEFT, shift->part_bytes); + } + + } else { + /* copying to right */ + + coord_set_item_pos(&from, node40_num_of_items_internal(from.node) - 1); + from_ih = node40_ih_at_coord(&from); + + coord_set_item_pos(&to, 0); + + /* prepare space for new items */ + memmove(zdata(to.node) + sizeof (node40_header) + + shift->shift_bytes, + zdata(to.node) + sizeof (node40_header), free_space_start - sizeof (node40_header)); + /* update item headers of moved items */ + to_ih = node40_ih_at(to.node, 0); + /* first item gets @merging_bytes longer. free space appears + at its beginning */ + if (!node_is_empty(to.node)) + ih40_set_offset(to_ih, ih40_get_offset(to_ih) + shift->shift_bytes - shift->merging_bytes); + + for (i = 1; i < old_items; i++) + ih40_set_offset(to_ih - i, ih40_get_offset(to_ih - i) + shift->shift_bytes); + + /* move item headers to make space for new items */ + memmove(to_ih - old_items + 1 - new_items, to_ih - old_items + 1, sizeof (item_header40) * old_items); + to_ih -= (new_items - 1); + + nh40_set_free_space_start(nh, free_space_start + shift->shift_bytes); + nh40_set_free_space(nh, + nh40_get_free_space(nh) - + (shift->shift_bytes + sizeof (item_header40) * new_items)); + + /* update node header */ + node40_set_num_items(shift->target, nh, old_items + new_items); + assert("vs-170", nh40_get_free_space(nh) < znode_size(shift->target)); + + if (shift->merging_units) { + coord_add_item_pos(&to, new_items); + to.unit_pos = 0; + to.between = AT_UNIT; + /* prepend first item of @to */ + copy_units(&to, &from, + coord_last_unit_pos(&from) - + shift->merging_units + 1, shift->merging_units, SHIFT_RIGHT, shift->merging_bytes); + coord_dec_item_pos(&from); + from_ih++; + } + + if (shift->entire) { + /* copy @entire items entirely */ + + /* copy item headers */ + memcpy(to_ih, from_ih, shift->entire * sizeof (item_header40)); + + /* update item header offset */ + old_offset = ih40_get_offset(from_ih + shift->entire - 1); + /* AUDIT: old_offset + sizeof (node40_header) + shift->part_bytes calculation can be taken off the loop. */ + for (i = 0; i < shift->entire; i++, to_ih++, from_ih++) + ih40_set_offset(to_ih, + ih40_get_offset(from_ih) - + old_offset + sizeof (node40_header) + shift->part_bytes); + /* copy item bodies */ + coord_add_item_pos(&from, -(int) (shift->entire - 1)); + memcpy(zdata(to.node) + sizeof (node40_header) + + shift->part_bytes, item_by_coord_node40(&from), + shift->entire_bytes); + coord_dec_item_pos(&from); + } + + if (shift->part_units) { + coord_set_item_pos(&to, 0); + to.unit_pos = 0; + to.between = AT_UNIT; + /* copy heading part (@part units) of @source item as + a new item into @target->node */ + + /* copy item header of partially copied item */ + memcpy(to_ih, from_ih, sizeof (item_header40)); + ih40_set_offset(to_ih, sizeof (node40_header)); + if (item_plugin_by_coord(&to)->b.init) + item_plugin_by_coord(&to)->b.init(&to, &from, 0); + copy_units(&to, &from, + coord_last_unit_pos(&from) - + shift->part_units + 1, shift->part_units, SHIFT_RIGHT, shift->part_bytes); + } + } +} + +/* remove everything either before or after @fact_stop. Number of items + removed completely is returned */ +static int +delete_copied(struct shift_params *shift) +{ + coord_t from; + coord_t to; + struct carry_cut_data cdata; + + if (shift->pend == SHIFT_LEFT) { + /* we were shifting to left, remove everything from the + beginning of @shift->wish_stop->node upto + @shift->wish_stop */ + coord_init_first_unit(&from, shift->real_stop.node); + to = shift->real_stop; + + /* store old coordinate of unit which will be first after + shift to left */ + shift->u.future_first = to; + coord_next_unit(&shift->u.future_first); + } else { + /* we were shifting to right, remove everything from + @shift->stop_coord upto to end of + @shift->stop_coord->node */ + from = shift->real_stop; + coord_init_last_unit(&to, from.node); + + /* store old coordinate of unit which will be last after + shift to right */ + shift->u.future_last = from; + coord_prev_unit(&shift->u.future_last); + } + + cdata.params.from = &from; + cdata.params.to = &to; + cdata.params.from_key = 0; + cdata.params.to_key = 0; + cdata.params.smallest_removed = 0; + return cut_node40(&cdata, 0); +} + +/* something was moved between @left and @right. Add carry operation to @info + list to have carry to update delimiting key between them */ +static int +prepare_for_update(znode * left, znode * right, carry_plugin_info * info) +{ + carry_op *op; + carry_node *cn; + + if (info == NULL) + /* nowhere to send operation to. */ + return 0; + + if (!should_notify_parent(right)) + return 0; + + op = node_post_carry(info, COP_UPDATE, right, 1); + if (IS_ERR(op) || op == NULL) + return op ? PTR_ERR(op) : -EIO; + + if (left != NULL) { + carry_node *reference; + + if (info->doing) + reference = insert_carry_node(info->doing, + info->todo, left); + else + reference = op->node; + assert("nikita-2992", reference != NULL); + cn = add_carry(info->todo, POOLO_BEFORE, reference); + if (IS_ERR(cn)) + return PTR_ERR(cn); + cn->parent = 1; + cn->node = left; + if (ZF_ISSET(left, JNODE_ORPHAN)) + cn->left_before = 1; + op->u.update.left = cn; + } else + op->u.update.left = NULL; + return 0; +} + +/* plugin->u.node.prepare_removal + to delete a pointer to @empty from the tree add corresponding carry + operation (delete) to @info list */ +reiser4_internal int +prepare_removal_node40(znode * empty, carry_plugin_info * info) +{ + carry_op *op; + + if (!should_notify_parent(empty)) + return 0; + /* already on a road to Styx */ + if (ZF_ISSET(empty, JNODE_HEARD_BANSHEE)) + return 0; + op = node_post_carry(info, COP_DELETE, empty, 1); + if (IS_ERR(op) || op == NULL) + return RETERR(op ? PTR_ERR(op) : -EIO); + + op->u.delete.child = 0; + op->u.delete.flags = 0; + + /* fare thee well */ + + RLOCK_TREE(current_tree); + WLOCK_DK(current_tree); + znode_set_ld_key(empty, znode_get_rd_key(empty)); + if (znode_is_left_connected(empty) && empty->left) + znode_set_rd_key(empty->left, znode_get_rd_key(empty)); + WUNLOCK_DK(current_tree); + RUNLOCK_TREE(current_tree); + + ZF_SET(empty, JNODE_HEARD_BANSHEE); + return 0; +} + +/* something were shifted from @insert_coord->node to @shift->target, update + @insert_coord correspondingly */ +static void +adjust_coord(coord_t * insert_coord, struct shift_params *shift, int removed, int including_insert_coord) +{ + /* item plugin was invalidated by shifting */ + coord_clear_iplug(insert_coord); + + if (node_is_empty(shift->wish_stop.node)) { + assert("vs-242", shift->everything); + if (including_insert_coord) { + if (shift->pend == SHIFT_RIGHT) { + /* set @insert_coord before first unit of + @shift->target node */ + coord_init_before_first_item(insert_coord, shift->target); + } else { + /* set @insert_coord after last in target node */ + coord_init_after_last_item(insert_coord, shift->target); + } + } else { + /* set @insert_coord inside of empty node. There is + only one possible coord within an empty + node. init_first_unit will set that coord */ + coord_init_first_unit(insert_coord, shift->wish_stop.node); + } + return; + } + + if (shift->pend == SHIFT_RIGHT) { + /* there was shifting to right */ + if (shift->everything) { + /* everything wanted was shifted */ + if (including_insert_coord) { + /* @insert_coord is set before first unit of + @to node */ + coord_init_before_first_item(insert_coord, shift->target); + insert_coord->between = BEFORE_UNIT; + } else { + /* @insert_coord is set after last unit of + @insert->node */ + coord_init_last_unit(insert_coord, shift->wish_stop.node); + insert_coord->between = AFTER_UNIT; + } + } + return; + } + + /* there was shifting to left */ + if (shift->everything) { + /* everything wanted was shifted */ + if (including_insert_coord) { + /* @insert_coord is set after last unit in @to node */ + coord_init_after_last_item(insert_coord, shift->target); + } else { + /* @insert_coord is set before first unit in the same + node */ + coord_init_before_first_item(insert_coord, shift->wish_stop.node); + } + return; + } + + /* FIXME-VS: the code below is complicated because with between == + AFTER_ITEM unit_pos is set to 0 */ + + if (!removed) { + /* no items were shifted entirely */ + assert("vs-195", shift->merging_units == 0 || shift->part_units == 0); + + if (shift->real_stop.item_pos == insert_coord->item_pos) { + if (shift->merging_units) { + if (insert_coord->between == AFTER_UNIT) { + assert("nikita-1441", insert_coord->unit_pos >= shift->merging_units); + insert_coord->unit_pos -= shift->merging_units; + } else if (insert_coord->between == BEFORE_UNIT) { + assert("nikita-2090", insert_coord->unit_pos > shift->merging_units); + insert_coord->unit_pos -= shift->merging_units; + } + + assert("nikita-2083", insert_coord->unit_pos + 1); + } else { + if (insert_coord->between == AFTER_UNIT) { + assert("nikita-1442", insert_coord->unit_pos >= shift->part_units); + insert_coord->unit_pos -= shift->part_units; + } else if (insert_coord->between == BEFORE_UNIT) { + assert("nikita-2089", insert_coord->unit_pos > shift->part_units); + insert_coord->unit_pos -= shift->part_units; + } + + assert("nikita-2084", insert_coord->unit_pos + 1); + } + } + return; + } + + /* we shifted to left and there was no enough space for everything */ + switch (insert_coord->between) { + case AFTER_UNIT: + case BEFORE_UNIT: + if (shift->real_stop.item_pos == insert_coord->item_pos) + insert_coord->unit_pos -= shift->part_units; + case AFTER_ITEM: + coord_add_item_pos(insert_coord, -removed); + break; + default: + impossible("nikita-2087", "not ready"); + } + assert("nikita-2085", insert_coord->unit_pos + 1); +} + +static int +call_shift_hooks(struct shift_params *shift) +{ + unsigned i, shifted; + coord_t coord; + item_plugin *iplug; + + assert("vs-275", !node_is_empty(shift->target)); + + /* number of items shift touches */ + shifted = shift->entire + (shift->merging_units ? 1 : 0) + (shift->part_units ? 1 : 0); + + if (shift->pend == SHIFT_LEFT) { + /* moved items are at the end */ + coord_init_last_unit(&coord, shift->target); + coord.unit_pos = 0; + + assert("vs-279", shift->pend == 1); + for (i = 0; i < shifted; i++) { + unsigned from, count; + + iplug = item_plugin_by_coord(&coord); + if (i == 0 && shift->part_units) { + assert("vs-277", coord_num_units(&coord) == shift->part_units); + count = shift->part_units; + from = 0; + } else if (i == shifted - 1 && shift->merging_units) { + count = shift->merging_units; + from = coord_num_units(&coord) - count; + } else { + count = coord_num_units(&coord); + from = 0; + } + + if (iplug->b.shift_hook) { + iplug->b.shift_hook(&coord, from, count, shift->wish_stop.node); + } + coord_add_item_pos(&coord, -shift->pend); + } + } else { + /* moved items are at the beginning */ + coord_init_first_unit(&coord, shift->target); + + assert("vs-278", shift->pend == -1); + for (i = 0; i < shifted; i++) { + unsigned from, count; + + iplug = item_plugin_by_coord(&coord); + if (i == 0 && shift->part_units) { + assert("vs-277", coord_num_units(&coord) == shift->part_units); + count = coord_num_units(&coord); + from = 0; + } else if (i == shifted - 1 && shift->merging_units) { + count = shift->merging_units; + from = 0; + } else { + count = coord_num_units(&coord); + from = 0; + } + + if (iplug->b.shift_hook) { + iplug->b.shift_hook(&coord, from, count, shift->wish_stop.node); + } + coord_add_item_pos(&coord, -shift->pend); + } + } + + return 0; +} + +/* shift to left is completed. Return 1 if unit @old was moved to left neighbor */ +static int +unit_moved_left(const struct shift_params *shift, const coord_t * old) +{ + assert("vs-944", shift->real_stop.node == old->node); + + if (shift->real_stop.item_pos < old->item_pos) + return 0; + if (shift->real_stop.item_pos == old->item_pos) { + if (shift->real_stop.unit_pos < old->unit_pos) + return 0; + } + return 1; +} + +/* shift to right is completed. Return 1 if unit @old was moved to right + neighbor */ +static int +unit_moved_right(const struct shift_params *shift, const coord_t * old) +{ + assert("vs-944", shift->real_stop.node == old->node); + + if (shift->real_stop.item_pos > old->item_pos) + return 0; + if (shift->real_stop.item_pos == old->item_pos) { + if (shift->real_stop.unit_pos > old->unit_pos) + return 0; + } + return 1; +} + +/* coord @old was set in node from which shift was performed. What was shifted + is stored in @shift. Update @old correspondingly to performed shift */ +static coord_t * +adjust_coord2(const struct shift_params *shift, const coord_t * old, coord_t * new) +{ + coord_clear_iplug(new); + new->between = old->between; + + coord_clear_iplug(new); + if (old->node == shift->target) { + if (shift->pend == SHIFT_LEFT) { + /* coord which is set inside of left neighbor does not + change during shift to left */ + coord_dup(new, old); + return new; + } + new->node = old->node; + coord_set_item_pos(new, + old->item_pos + shift->entire + + (shift->part_units ? 1 : 0)); + new->unit_pos = old->unit_pos; + if (old->item_pos == 0 && shift->merging_units) + new->unit_pos += shift->merging_units; + return new; + } + + assert("vs-977", old->node == shift->wish_stop.node); + if (shift->pend == SHIFT_LEFT) { + if (unit_moved_left(shift, old)) { + /* unit @old moved to left neighbor. Calculate its + coordinate there */ + new->node = shift->target; + coord_set_item_pos(new, + node_num_items(shift->target) - + shift->entire - + (shift->part_units ? 1 : 0) + + old->item_pos); + + new->unit_pos = old->unit_pos; + if (shift->merging_units) { + coord_dec_item_pos(new); + if (old->item_pos == 0) { + /* unit_pos only changes if item got + merged */ + new->unit_pos = coord_num_units(new) - (shift->merging_units - old->unit_pos); + } + } + } else { + /* unit @old did not move to left neighbor. + + Use _nocheck, because @old is outside of its node. + */ + coord_dup_nocheck(new, old); + coord_add_item_pos(new, -shift->u.future_first.item_pos); + if (new->item_pos == 0) + new->unit_pos -= shift->u.future_first.unit_pos; + } + } else { + if (unit_moved_right(shift, old)) { + /* unit @old moved to right neighbor */ + new->node = shift->target; + coord_set_item_pos(new, + old->item_pos - + shift->real_stop.item_pos); + if (new->item_pos == 0) { + /* unit @old might change unit pos */ + coord_set_item_pos(new, + old->unit_pos - + shift->real_stop.unit_pos); + } + } else { + /* unit @old did not move to right neighbor, therefore + it did not change */ + coord_dup(new, old); + } + } + coord_set_iplug(new, item_plugin_by_coord(new)); + return new; +} + +/* this is called when shift is completed (something of source node is copied + to target and deleted in source) to update all taps set in current + context */ +static void +update_taps(const struct shift_params *shift) +{ + tap_t *tap; + coord_t new; + + for_all_taps(tap) { + /* update only taps set to nodes participating in shift */ + if (tap->coord->node == shift->wish_stop.node || tap->coord->node == shift->target) + tap_to_coord(tap, adjust_coord2(shift, tap->coord, &new)); + } +} + +#if REISER4_DEBUG + +struct shift_check { + reiser4_key key; + __u16 plugin_id; + union { + __u64 bytes; + __u64 entries; + void *unused; + } u; +}; + +void * +shift_check_prepare(const znode *left, const znode *right) +{ + pos_in_node_t i, nr_items; + int mergeable; + struct shift_check *data; + item_header40 *ih; + + + if (node_is_empty(left) || node_is_empty(right)) + mergeable = 0; + else { + coord_t l, r; + + coord_init_last_unit(&l, left); + coord_init_first_unit(&r, right); + mergeable = are_items_mergeable(&l, &r); + } + nr_items = node40_num_of_items_internal(left) + node40_num_of_items_internal(right) - (mergeable ? 1 : 0); + data = reiser4_kmalloc(sizeof(struct shift_check) * nr_items, GFP_KERNEL); + if (data != NULL) { + coord_t coord; + pos_in_node_t item_pos; + + coord_init_first_unit(&coord, left); + i = 0; + + for (item_pos = 0; item_pos < node40_num_of_items_internal(left); item_pos ++) { + + coord_set_item_pos(&coord, item_pos); + ih = node40_ih_at_coord(&coord); + + data[i].key = ih->key; + data[i].plugin_id = d16tocpu(&ih->plugin_id); + switch(data[i].plugin_id) { + case CTAIL_ID: + case FORMATTING_ID: + data[i].u.bytes = coord_num_units(&coord); + break; + case EXTENT_POINTER_ID: + data[i].u.bytes = extent_size(&coord, coord_num_units(&coord)); + break; + case COMPOUND_DIR_ID: + data[i].u.entries = coord_num_units(&coord); + break; + default: + data[i].u.unused = NULL; + break; + } + i ++; + } + + coord_init_first_unit(&coord, right); + + if (mergeable) { + assert("vs-1609", i != 0); + + ih = node40_ih_at_coord(&coord); + + assert("vs-1589", data[i - 1].plugin_id == d16tocpu(&ih->plugin_id)); + switch(data[i - 1].plugin_id) { + case CTAIL_ID: + case FORMATTING_ID: + data[i - 1].u.bytes += coord_num_units(&coord); + break; + case EXTENT_POINTER_ID: + data[i - 1].u.bytes += extent_size(&coord, coord_num_units(&coord)); + break; + case COMPOUND_DIR_ID: + data[i - 1].u.entries += coord_num_units(&coord); + break; + default: + impossible("vs-1605", "wrong mergeable item"); + break; + } + item_pos = 1; + } else + item_pos = 0; + for (; item_pos < node40_num_of_items_internal(right); item_pos ++) { + + assert("vs-1604", i < nr_items); + coord_set_item_pos(&coord, item_pos); + ih = node40_ih_at_coord(&coord); + + data[i].key = ih->key; + data[i].plugin_id = d16tocpu(&ih->plugin_id); + switch(data[i].plugin_id) { + case CTAIL_ID: + case FORMATTING_ID: + data[i].u.bytes = coord_num_units(&coord); + break; + case EXTENT_POINTER_ID: + data[i].u.bytes = extent_size(&coord, coord_num_units(&coord)); + break; + case COMPOUND_DIR_ID: + data[i].u.entries = coord_num_units(&coord); + break; + default: + data[i].u.unused = NULL; + break; + } + i ++; + } + assert("vs-1606", i == nr_items); + } + return data; +} + +void +shift_check(void *vp, const znode *left, const znode *right) +{ + pos_in_node_t i, nr_items; + coord_t coord; + __u64 last_bytes; + int mergeable; + item_header40 *ih; + pos_in_node_t item_pos; + struct shift_check *data; + + data = (struct shift_check *)vp; + + if (data == NULL) + return; + + if (node_is_empty(left) || node_is_empty(right)) + mergeable = 0; + else { + coord_t l, r; + + coord_init_last_unit(&l, left); + coord_init_first_unit(&r, right); + mergeable = are_items_mergeable(&l, &r); + } + + nr_items = node40_num_of_items_internal(left) + node40_num_of_items_internal(right) - (mergeable ? 1 : 0); + + i = 0; + last_bytes = 0; + + coord_init_first_unit(&coord, left); + + for (item_pos = 0; item_pos < node40_num_of_items_internal(left); item_pos ++) { + + coord_set_item_pos(&coord, item_pos); + ih = node40_ih_at_coord(&coord); + + assert("vs-1611", i == item_pos); + assert("vs-1590", keyeq(&ih->key, &data[i].key)); + assert("vs-1591", d16tocpu(&ih->plugin_id) == data[i].plugin_id); + if ((i < (node40_num_of_items_internal(left) - 1)) || !mergeable) { + switch(data[i].plugin_id) { + case CTAIL_ID: + case FORMATTING_ID: + assert("vs-1592", data[i].u.bytes == coord_num_units(&coord)); + break; + case EXTENT_POINTER_ID: + assert("vs-1593", data[i].u.bytes == extent_size(&coord, coord_num_units(&coord))); + break; + case COMPOUND_DIR_ID: + assert("vs-1594", data[i].u.entries == coord_num_units(&coord)); + break; + default: + break; + } + } + if (item_pos == (node40_num_of_items_internal(left) - 1) && mergeable) { + switch(data[i].plugin_id) { + case CTAIL_ID: + case FORMATTING_ID: + last_bytes = coord_num_units(&coord); + break; + case EXTENT_POINTER_ID: + last_bytes = extent_size(&coord, coord_num_units(&coord)); + break; + case COMPOUND_DIR_ID: + last_bytes = coord_num_units(&coord); + break; + default: + impossible("vs-1595", "wrong mergeable item"); + break; + } + } + i ++; + } + + coord_init_first_unit(&coord, right); + if (mergeable) { + ih = node40_ih_at_coord(&coord); + + assert("vs-1589", data[i - 1].plugin_id == d16tocpu(&ih->plugin_id)); + assert("vs-1608", last_bytes != 0); + switch(data[i - 1].plugin_id) { + case CTAIL_ID: + case FORMATTING_ID: + assert("vs-1596", data[i - 1].u.bytes == last_bytes + coord_num_units(&coord)); + break; + + case EXTENT_POINTER_ID: + assert("vs-1597", data[i - 1].u.bytes == last_bytes + extent_size(&coord, coord_num_units(&coord))); + break; + + case COMPOUND_DIR_ID: + assert("vs-1598", data[i - 1].u.bytes == last_bytes + coord_num_units(&coord)); + break; + default: + impossible("vs-1599", "wrong mergeable item"); + break; + } + item_pos = 1; + } else + item_pos = 0; + + for (; item_pos < node40_num_of_items_internal(right); item_pos ++) { + + coord_set_item_pos(&coord, item_pos); + ih = node40_ih_at_coord(&coord); + + assert("vs-1612", keyeq(&ih->key, &data[i].key)); + assert("vs-1613", d16tocpu(&ih->plugin_id) == data[i].plugin_id); + switch(data[i].plugin_id) { + case CTAIL_ID: + case FORMATTING_ID: + assert("vs-1600", data[i].u.bytes == coord_num_units(&coord)); + break; + case EXTENT_POINTER_ID: + assert("vs-1601", data[i].u.bytes == extent_size(&coord, coord_num_units(&coord))); + break; + case COMPOUND_DIR_ID: + assert("vs-1602", data[i].u.entries == coord_num_units(&coord)); + break; + default: + break; + } + i ++; + } + + assert("vs-1603", i == nr_items); + reiser4_kfree(data); +} + +#endif + +/* plugin->u.node.shift + look for description of this method in plugin/node/node.h */ +reiser4_internal int +shift_node40(coord_t *from, znode *to, shift_direction pend, + int delete_child, /* if @from->node becomes empty - it will be + deleted from the tree if this is set to 1 */ + int including_stop_coord, + carry_plugin_info *info) +{ + struct shift_params shift; + int result; + znode *left, *right; + znode *source; + int target_empty; + + assert("nikita-2161", coord_check(from)); + + memset(&shift, 0, sizeof (shift)); + shift.pend = pend; + shift.wish_stop = *from; + shift.target = to; + + assert("nikita-1473", znode_is_write_locked(from->node)); + assert("nikita-1474", znode_is_write_locked(to)); + + source = from->node; + + /* set @shift.wish_stop to rightmost/leftmost unit among units we want + shifted */ + if (pend == SHIFT_LEFT) { + result = coord_set_to_left(&shift.wish_stop); + left = to; + right = from->node; + } else { + result = coord_set_to_right(&shift.wish_stop); + left = from->node; + right = to; + } + + if (result) { + /* move insertion coord even if there is nothing to move */ + if (including_stop_coord) { + /* move insertion coord (@from) */ + if (pend == SHIFT_LEFT) { + /* after last item in target node */ + coord_init_after_last_item(from, to); + } else { + /* before first item in target node */ + coord_init_before_first_item(from, to); + } + } + + if (delete_child && node_is_empty(shift.wish_stop.node)) + result = prepare_removal_node40(shift.wish_stop.node, info); + else + result = 0; + /* there is nothing to shift */ + assert("nikita-2078", coord_check(from)); + return result; + } + + target_empty = node_is_empty(to); + + /* when first node plugin with item body compression is implemented, + this must be changed to call node specific plugin */ + + /* shift->stop_coord is updated to last unit which really will be + shifted */ + estimate_shift(&shift, get_current_context()); + if (!shift.shift_bytes) { + /* we could not shift anything */ + assert("nikita-2079", coord_check(from)); + return 0; + } + + copy(&shift); + + /* result value of this is important. It is used by adjust_coord below */ + result = delete_copied(&shift); + + assert("vs-1610", result >= 0); + assert("vs-1471", ((reiser4_context *) current->journal_info)->magic == context_magic); + + /* item which has been moved from one node to another might want to do + something on that event. This can be done by item's shift_hook + method, which will be now called for every moved items */ + call_shift_hooks(&shift); + + assert("vs-1472", ((reiser4_context *) current->journal_info)->magic == context_magic); + + update_taps(&shift); + + assert("vs-1473", ((reiser4_context *) current->journal_info)->magic == context_magic); + + /* adjust @from pointer in accordance with @including_stop_coord flag + and amount of data which was really shifted */ + adjust_coord(from, &shift, result, including_stop_coord); + + if (target_empty) + /* + * items were shifted into empty node. Update delimiting key. + */ + result = prepare_for_update(NULL, left, info); + + /* add update operation to @info, which is the list of operations to + be performed on a higher level */ + result = prepare_for_update(left, right, info); + if (!result && node_is_empty(source) && delete_child) { + /* all contents of @from->node is moved to @to and @from->node + has to be removed from the tree, so, on higher level we + will be removing the pointer to node @from->node */ + result = prepare_removal_node40(source, info); + } + assert("nikita-2080", coord_check(from)); + return result ? result : (int) shift.shift_bytes; +} + +/* plugin->u.node.fast_insert() + look for description of this method in plugin/node/node.h */ +reiser4_internal int +fast_insert_node40(const coord_t * coord UNUSED_ARG /* node to query */ ) +{ + return 1; +} + +/* plugin->u.node.fast_paste() + look for description of this method in plugin/node/node.h */ +reiser4_internal int +fast_paste_node40(const coord_t * coord UNUSED_ARG /* node to query */ ) +{ + return 1; +} + +/* plugin->u.node.fast_cut() + look for description of this method in plugin/node/node.h */ +reiser4_internal int +fast_cut_node40(const coord_t * coord UNUSED_ARG /* node to query */ ) +{ + return 1; +} + +/* plugin->u.node.modify - not defined */ + +/* plugin->u.node.max_item_size */ +reiser4_internal int +max_item_size_node40(void) +{ + return reiser4_get_current_sb()->s_blocksize - sizeof (node40_header) - sizeof (item_header40); +} + +/* plugin->u.node.set_item_plugin */ +reiser4_internal int +set_item_plugin_node40(coord_t *coord, item_id id) +{ + item_header40 *ih; + + ih = node40_ih_at_coord(coord); + cputod16(id, &ih->plugin_id); + coord->iplugid = id; + return 0; +} + + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/node/node40.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/node/node40.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,117 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#if !defined( __REISER4_NODE40_H__ ) +#define __REISER4_NODE40_H__ + +#include "../../forward.h" +#include "../../dformat.h" +#include "node.h" + +#include + + +/* format of node header for 40 node layouts. Keep bloat out of this struct. */ +typedef struct node40_header { + /* identifier of node plugin. Must be located at the very beginning + of a node. */ + common_node_header common_header; /* this is 16 bits */ + /* number of items. Should be first element in the node header, + because we haven't yet finally decided whether it shouldn't go into + common_header. + */ +/* NIKITA-FIXME-HANS: Create a macro such that if there is only one + * node format at compile time, and it is this one, accesses do not function dereference when + * accessing these fields (and otherwise they do). Probably 80% of users will only have one node format at a time throughout the life of reiser4. */ + d16 nr_items; + /* free space in node measured in bytes */ + d16 free_space; + /* offset to start of free space in node */ + d16 free_space_start; + /* for reiser4_fsck. When information about what is a free + block is corrupted, and we try to recover everything even + if marked as freed, then old versions of data may + duplicate newer versions, and this field allows us to + restore the newer version. Also useful for when users + who don't have the new trashcan installed on their linux distro + delete the wrong files and send us desperate emails + offering $25 for them back. */ + + /* magic field we need to tell formatted nodes NIKITA-FIXME-HANS: improve this comment*/ + d32 magic; + /* flushstamp is made of mk_id and write_counter. mk_id is an + id generated randomly at mkreiserfs time. So we can just + skip all nodes with different mk_id. write_counter is d64 + incrementing counter of writes on disk. It is used for + choosing the newest data at fsck time. NIKITA-FIXME-HANS: why was field name changed but not comment? */ + + d32 mkfs_id; + d64 flush_id; + /* node flags to be used by fsck (reiser4ck or reiser4fsck?) + and repacker NIKITA-FIXME-HANS: say more or reference elsewhere that says more */ + d16 flags; + + /* 1 is leaf level, 2 is twig level, root is the numerically + largest level */ + d8 level; + + d8 pad; +} PACKED node40_header; + +/* item headers are not standard across all node layouts, pass + pos_in_node to functions instead */ +typedef struct item_header40 { + /* key of item */ + /* 0 */ reiser4_key key; + /* offset from start of a node measured in 8-byte chunks */ + /* 24 */ d16 offset; + /* 26 */ d16 flags; + /* 28 */ d16 plugin_id; +} PACKED item_header40; + +size_t item_overhead_node40(const znode * node, flow_t * aflow); +size_t free_space_node40(znode * node); +node_search_result lookup_node40(znode * node, const reiser4_key * key, lookup_bias bias, coord_t * coord); +int num_of_items_node40(const znode * node); +char *item_by_coord_node40(const coord_t * coord); +int length_by_coord_node40(const coord_t * coord); +item_plugin *plugin_by_coord_node40(const coord_t * coord); +reiser4_key *key_at_node40(const coord_t * coord, reiser4_key * key); +size_t estimate_node40(znode * node); +int check_node40(const znode * node, __u32 flags, const char **error); +int parse_node40(znode * node); +int init_node40(znode * node); +#if GUESS_EXISTS +int guess_node40(const znode * node); +#endif +void change_item_size_node40(coord_t * coord, int by); +int create_item_node40(coord_t * target, const reiser4_key * key, reiser4_item_data * data, carry_plugin_info * info); +void update_item_key_node40(coord_t * target, const reiser4_key * key, carry_plugin_info * info); +int kill_node40(struct carry_kill_data *, carry_plugin_info *); +int cut_node40(struct carry_cut_data *, carry_plugin_info *); +int shift_node40(coord_t * from, znode * to, shift_direction pend, + /* if @from->node becomes + empty - it will be deleted from + the tree if this is set to 1 + */ + int delete_child, int including_stop_coord, carry_plugin_info * info); + +int fast_insert_node40(const coord_t * coord); +int fast_paste_node40(const coord_t * coord); +int fast_cut_node40(const coord_t * coord); +int max_item_size_node40(void); +int prepare_removal_node40(znode * empty, carry_plugin_info * info); +int set_item_plugin_node40(coord_t * coord, item_id id); +int shrink_item_node40(coord_t *coord, int delta); + +/* __REISER4_NODE40_H__ */ +#endif +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/node/node.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/node/node.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,127 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Node plugin interface. + + Description: The tree provides the abstraction of flows, which it + internally fragments into items which it stores in nodes. + + A key_atom is a piece of data bound to a single key. + + For reasonable space efficiency to be achieved it is often + necessary to store key_atoms in the nodes in the form of items, where + an item is a sequence of key_atoms of the same or similar type. It is + more space-efficient, because the item can implement (very) + efficient compression of key_atom's bodies using internal knowledge + about their semantics, and it can often avoid having a key for each + key_atom. Each type of item has specific operations implemented by its + item handler (see balance.c). + + Rationale: the rest of the code (specifically balancing routines) + accesses leaf level nodes through this interface. This way we can + implement various block layouts and even combine various layouts + within the same tree. Balancing/allocating algorithms should not + care about peculiarities of splitting/merging specific item types, + but rather should leave that to the item's item handler. + + Items, including those that provide the abstraction of flows, have + the property that if you move them in part or in whole to another + node, the balancing code invokes their is_left_mergeable() + item_operation to determine if they are mergeable with their new + neighbor in the node you have moved them to. For some items the + is_left_mergeable() function always returns null. + + When moving the bodies of items from one node to another: + + if a partial item is shifted to another node the balancing code invokes + an item handler method to handle the item splitting. + + if the balancing code needs to merge with an item in the node it + is shifting to, it will invoke an item handler method to handle + the item merging. + + if it needs to move whole item bodies unchanged, the balancing code uses xmemcpy() + adjusting the item headers after the move is done using the node handler. +*/ + +#include "../../forward.h" +#include "../../debug.h" +#include "../../key.h" +#include "../../coord.h" +#include "../plugin_header.h" +#include "../item/item.h" +#include "node.h" +#include "../plugin.h" +#include "../../znode.h" +#include "../../tree.h" +#include "../../super.h" +#include "../../reiser4.h" + +/* return starting key of the leftmost item in the @node */ +reiser4_internal reiser4_key * +leftmost_key_in_node(const znode * node /* node to query */ , + reiser4_key * key /* resulting key */ ) +{ + assert("nikita-1634", node != NULL); + assert("nikita-1635", key != NULL); + + if (!node_is_empty(node)) { + coord_t first_item; + + coord_init_first_unit(&first_item, (znode *) node); + item_key_by_coord(&first_item, key); + } else + *key = *max_key(); + return key; +} + +node_plugin node_plugins[LAST_NODE_ID] = { + [NODE40_ID] = { + .h = { + .type_id = REISER4_NODE_PLUGIN_TYPE, + .id = NODE40_ID, + .pops = NULL, + .label = "unified", + .desc = "unified node layout", + .linkage = TYPE_SAFE_LIST_LINK_ZERO, + }, + .item_overhead = item_overhead_node40, + .free_space = free_space_node40, + .lookup = lookup_node40, + .num_of_items = num_of_items_node40, + .item_by_coord = item_by_coord_node40, + .length_by_coord = length_by_coord_node40, + .plugin_by_coord = plugin_by_coord_node40, + .key_at = key_at_node40, + .estimate = estimate_node40, + .check = check_node40, + .parse = parse_node40, + .init = init_node40, +#ifdef GUESS_EXISTS + .guess = guess_node40, +#endif + .change_item_size = change_item_size_node40, + .create_item = create_item_node40, + .update_item_key = update_item_key_node40, + .cut_and_kill = kill_node40, + .cut = cut_node40, + .shift = shift_node40, + .shrink_item = shrink_item_node40, + .fast_insert = fast_insert_node40, + .fast_paste = fast_paste_node40, + .fast_cut = fast_cut_node40, + .max_item_size = max_item_size_node40, + .prepare_removal = prepare_removal_node40, + .set_item_plugin = set_item_plugin_node40 + } +}; + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/node/node.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/node/node.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,266 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* We need a definition of the default node layout here. */ + +/* Generally speaking, it is best to have free space in the middle of the + node so that two sets of things can grow towards it, and to have the + item bodies on the left so that the last one of them grows into free + space. We optimize for the case where we append new items to the end + of the node, or grow the last item, because it hurts nothing to so + optimize and it is a common special case to do massive insertions in + increasing key order (and one of cases more likely to have a real user + notice the delay time for). + + formatted leaf default layout: (leaf1) + + |node header:item bodies:free space:key + pluginid + item offset| + + We grow towards the middle, optimizing layout for the case where we + append new items to the end of the node. The node header is fixed + length. Keys, and item offsets plus pluginids for the items + corresponding to them are in increasing key order, and are fixed + length. Item offsets are relative to start of node (16 bits creating + a node size limit of 64k, 12 bits might be a better choice....). Item + bodies are in decreasing key order. Item bodies have a variable size. + There is a one to one to one mapping of keys to item offsets to item + bodies. Item offsets consist of pointers to the zeroth byte of the + item body. Item length equals the start of the next item minus the + start of this item, except the zeroth item whose length equals the end + of the node minus the start of that item (plus a byte). In other + words, the item length is not recorded anywhere, and it does not need + to be since it is computable. + + Leaf variable length items and keys layout : (lvar) + + |node header:key offset + item offset + pluginid triplets:free space:key bodies:item bodies| + + We grow towards the middle, optimizing layout for the case where we + append new items to the end of the node. The node header is fixed + length. Keys and item offsets for the items corresponding to them are + in increasing key order, and keys are variable length. Item offsets + are relative to start of node (16 bits). Item bodies are in + decreasing key order. Item bodies have a variable size. There is a + one to one to one mapping of keys to item offsets to item bodies. + Item offsets consist of pointers to the zeroth byte of the item body. + Item length equals the start of the next item's key minus the start of + this item, except the zeroth item whose length equals the end of the + node minus the start of that item (plus a byte). + + leaf compressed keys layout: (lcomp) + + |node header:key offset + key inherit + item offset pairs:free space:key bodies:item bodies| + + We grow towards the middle, optimizing layout for the case where we + append new items to the end of the node. The node header is fixed + length. Keys and item offsets for the items corresponding to them are + in increasing key order, and keys are variable length. The "key + inherit" field indicates how much of the key prefix is identical to + the previous key (stem compression as described in "Managing + Gigabytes" is used). key_inherit is a one byte integer. The + intra-node searches performed through this layout are linear searches, + and this is theorized to not hurt performance much due to the high + cost of processor stalls on modern CPUs, and the small number of keys + in a single node. Item offsets are relative to start of node (16 + bits). Item bodies are in decreasing key order. Item bodies have a + variable size. There is a one to one to one mapping of keys to item + offsets to item bodies. Item offsets consist of pointers to the + zeroth byte of the item body. Item length equals the start of the + next item minus the start of this item, except the zeroth item whose + length equals the end of the node minus the start of that item (plus a + byte). In other words, item length and key length is not recorded + anywhere, and it does not need to be since it is computable. + + internal node default layout: (idef1) + + just like ldef1 except that item bodies are either blocknrs of + children or extents, and moving them may require updating parent + pointers in the nodes that they point to. +*/ + +/* There is an inherent 3-way tradeoff between optimizing and + exchanging disks between different architectures and code + complexity. This is optimal and simple and inexchangeable. + Someone else can do the code for exchanging disks and make it + complex. It would not be that hard. Using other than the PAGE_SIZE + might be suboptimal. +*/ + +#if !defined( __REISER4_NODE_H__ ) +#define __REISER4_NODE_H__ + +#define LEAF40_NODE_SIZE PAGE_CACHE_SIZE + +#include "../../dformat.h" +#include "../plugin_header.h" + +#include + +typedef enum { + NS_FOUND = 0, + NS_NOT_FOUND = -ENOENT +} node_search_result; + +/* Maximal possible space overhead for creation of new item in a node */ +#define REISER4_NODE_MAX_OVERHEAD ( sizeof( reiser4_key ) + 32 ) + +typedef enum { + REISER4_NODE_DKEYS = (1 << 0), + REISER4_NODE_TREE_STABLE = (1 << 1) +} reiser4_node_check_flag; + +/* cut and cut_and_kill have too long list of parameters. This structure is just to safe some space on stack */ +struct cut_list { + coord_t * from; + coord_t * to; + const reiser4_key * from_key; + const reiser4_key * to_key; + reiser4_key * smallest_removed; + carry_plugin_info * info; + __u32 flags; + struct inode *inode; /* this is to pass list of eflushed jnodes down to extent_kill_hook */ + lock_handle *left; + lock_handle *right; +}; + +struct carry_cut_data; +struct carry_kill_data; + +/* The responsibility of the node plugin is to store and give access + to the sequence of items within the node. */ +typedef struct node_plugin { + /* generic plugin fields */ + plugin_header h; + + /* calculates the amount of space that will be required to store an + item which is in addition to the space consumed by the item body. + (the space consumed by the item body can be gotten by calling + item->estimate) */ + size_t(*item_overhead) (const znode * node, flow_t * f); + + /* returns free space by looking into node (i.e., without using + znode->free_space). */ + size_t(*free_space) (znode * node); + /* search within the node for the one item which might + contain the key, invoking item->search_within to search within + that item to see if it is in there */ + node_search_result(*lookup) (znode * node, const reiser4_key * key, lookup_bias bias, coord_t * coord); + /* number of items in node */ + int (*num_of_items) (const znode * node); + + /* store information about item in @coord in @data */ + /* break into several node ops, don't add any more uses of this before doing so */ + /*int ( *item_at )( const coord_t *coord, reiser4_item_data *data ); */ + char *(*item_by_coord) (const coord_t * coord); + int (*length_by_coord) (const coord_t * coord); + item_plugin *(*plugin_by_coord) (const coord_t * coord); + + /* store item key in @key */ + reiser4_key *(*key_at) (const coord_t * coord, reiser4_key * key); + /* conservatively estimate whether unit of what size can fit + into node. This estimation should be performed without + actually looking into the node's content (free space is saved in + znode). */ + size_t(*estimate) (znode * node); + + /* performs every consistency check the node plugin author could + imagine. Optional. */ + int (*check) (const znode * node, __u32 flags, const char **error); + + /* Called when node is read into memory and node plugin is + already detected. This should read some data into znode (like free + space counter) and, optionally, check data consistency. + */ + int (*parse) (znode * node); + /* This method is called on a new node to initialise plugin specific + data (header, etc.) */ + int (*init) (znode * node); + /* Check whether @node content conforms to this plugin format. + Probably only useful after support for old V3.x formats is added. + Uncomment after 4.0 only. + */ + /* int ( *guess )( const znode *node ); */ +#if REISER4_DEBUG + void (*print) (const char *prefix, const znode * node, __u32 flags); +#endif + /* change size of @item by @by bytes. @item->node has enough free + space. When @by > 0 - free space is appended to end of item. When + @by < 0 - item is truncated - it is assumed that last @by bytes if + the item are freed already */ + void (*change_item_size) (coord_t * item, int by); + + /* create new item @length bytes long in coord @target */ + int (*create_item) (coord_t * target, const reiser4_key * key, + reiser4_item_data * data, carry_plugin_info * info); + + /* update key of item. */ + void (*update_item_key) (coord_t * target, const reiser4_key * key, carry_plugin_info * info); + + int (*cut_and_kill) (struct carry_kill_data *, carry_plugin_info *); + int (*cut) (struct carry_cut_data *, carry_plugin_info *); + + /* + * shrink item pointed to by @coord by @delta bytes. + */ + int (*shrink_item) (coord_t *coord, int delta); + + /* copy as much as possible but not more than up to @stop from + @stop->node to @target. If (pend == append) then data from beginning of + @stop->node are copied to the end of @target. If (pend == prepend) then + data from the end of @stop->node are copied to the beginning of + @target. Copied data are removed from @stop->node. Information + about what to do on upper level is stored in @todo */ + int (*shift) (coord_t * stop, znode * target, shift_direction pend, + int delete_node, int including_insert_coord, carry_plugin_info * info); + /* return true if this node allows skip carry() in some situations + (see fs/reiser4/tree.c:insert_by_coord()). Reiser3.x format + emulation doesn't. + + This will speedup insertions that doesn't require updates to the + parent, by bypassing initialisation of carry() structures. It's + believed that majority of insertions will fit there. + + */ + int (*fast_insert) (const coord_t * coord); + int (*fast_paste) (const coord_t * coord); + int (*fast_cut) (const coord_t * coord); + /* this limits max size of item which can be inserted into a node and + number of bytes item in a node may be appended with */ + int (*max_item_size) (void); + int (*prepare_removal) (znode * empty, carry_plugin_info * info); + /* change plugin id of items which are in a node already. Currently it is Used in tail conversion for regular + * files */ + int (*set_item_plugin) (coord_t * coord, item_id); +} node_plugin; + +typedef enum { + /* standard unified node layout used for both leaf and internal + nodes */ + NODE40_ID, + LAST_NODE_ID +} reiser4_node_id; + +extern reiser4_key *leftmost_key_in_node(const znode * node, reiser4_key * key); +#if REISER4_DEBUG +extern void print_node_content(const char *prefix, const znode * node, __u32 flags); +#endif + +extern void indent_znode(const znode * node); + +typedef struct common_node_header { + /* identifier of node plugin. Must be located at the very beginning + of a node. */ + d16 plugin_id; +} common_node_header; + +/* __REISER4_NODE_H__ */ +#endif +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/object.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/object.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1640 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Examples of object plugins: file, directory, symlink, special file */ +/* Plugins associated with inode: + + Plugin of inode is plugin referenced by plugin-id field of on-disk + stat-data. How we store this plugin in in-core inode is not + important. Currently pointers are used, another variant is to store + offsets and do array lookup on each access. + + Now, each inode has one selected plugin: object plugin that + determines what type of file this object is: directory, regular etc. + + This main plugin can use other plugins that are thus subordinated to + it. Directory instance of object plugin uses hash; regular file + instance uses tail policy plugin. + + Object plugin is either taken from id in stat-data or guessed from + i_mode bits. Once it is established we ask it to install its + subordinate plugins, by looking again in stat-data or inheriting them + from parent. +*/ +/* How new inode is initialized during ->read_inode(): + 1 read stat-data and initialize inode fields: i_size, i_mode, + i_generation, capabilities etc. + 2 read plugin id from stat data or try to guess plugin id + from inode->i_mode bits if plugin id is missing. + 3 Call ->init_inode() method of stat-data plugin to initialise inode fields. + +NIKITA-FIXME-HANS: can you say a little about 1 being done before 3? What if stat data does contain i_size, etc., due to it being an unusual plugin? + 4 Call ->activate() method of object's plugin. Plugin is either read from + from stat-data or guessed from mode bits + 5 Call ->inherit() method of object plugin to inherit as yet +NIKITA-FIXME-HANS: are you missing an "un" here? +initialized + plugins from parent. + + Easy induction proves that on last step all plugins of inode would be + initialized. + + When creating new object: + 1 obtain object plugin id (see next period) +NIKITA-FIXME-HANS: period? + 2 ->install() this plugin + 3 ->inherit() the rest from the parent + +*/ +/* We need some examples of creating an object with default and + non-default plugin ids. Nikita, please create them. +*/ + +#include "../forward.h" +#include "../debug.h" +#include "../key.h" +#include "../kassign.h" +#include "../coord.h" +#include "../seal.h" +#include "plugin_header.h" +#include "item/static_stat.h" +#include "file/file.h" +#include "file/pseudo.h" +#include "symlink.h" +#include "dir/dir.h" +#include "item/item.h" +#include "plugin.h" +#include "object.h" +#include "../znode.h" +#include "../tap.h" +#include "../tree.h" +#include "../vfs_ops.h" +#include "../inode.h" +#include "../super.h" +#include "../reiser4.h" +#include "../safe_link.h" + +#include +#include +#include +#include +#include /* security_inode_delete() */ +#include /* wake_up_inode() */ +#include +#include + +/* helper function to print errors */ +static void +key_warning(const reiser4_key * key /* key to print */, + const struct inode *inode, + int code /* error code to print */) +{ + assert("nikita-716", key != NULL); + + if (code != -ENOMEM) { + warning("nikita-717", "Error for inode %llu (%i)", + (unsigned long long)get_key_objectid(key), code); + print_key("for key", key); + } +} + +/* NIKITA-FIXME-HANS: perhaps this function belongs in another file? */ +#if REISER4_DEBUG +static void +check_inode_seal(const struct inode *inode, + const coord_t *coord, const reiser4_key *key) +{ + reiser4_key unit_key; + + unit_key_by_coord(coord, &unit_key); + assert("nikita-2752", + WITH_DATA_RET(coord->node, 1, keyeq(key, &unit_key))); + assert("nikita-2753", get_inode_oid(inode) == get_key_objectid(key)); +} + +static void +check_sd_coord(coord_t *coord, const reiser4_key *key) +{ + reiser4_key ukey; + + coord_clear_iplug(coord); + if (zload(coord->node)) + return; + + if (!coord_is_existing_unit(coord) || + !item_plugin_by_coord(coord) || + !keyeq(unit_key_by_coord(coord, &ukey), key) || + (znode_get_level(coord->node) != LEAF_LEVEL) || + !item_is_statdata(coord)) { + warning("nikita-1901", "Conspicuous seal"); + print_key("key", key); + print_coord("coord", coord, 1); + impossible("nikita-2877", "no way"); + } + zrelse(coord->node); +} + +#else +#define check_inode_seal(inode, coord, key) noop +#define check_sd_coord(coord, key) noop +#endif + +/* find sd of inode in a tree, deal with errors */ +reiser4_internal int +lookup_sd(struct inode *inode /* inode to look sd for */ , + znode_lock_mode lock_mode /* lock mode */ , + coord_t * coord /* resulting coord */ , + lock_handle * lh /* resulting lock handle */ , + const reiser4_key * key /* resulting key */, + int silent) +{ + int result; + __u32 flags; + + assert("nikita-1692", inode != NULL); + assert("nikita-1693", coord != NULL); + assert("nikita-1694", key != NULL); + + /* look for the object's stat data in a tree. + This returns in "node" pointer to a locked znode and in "pos" + position of an item found in node. Both are only valid if + coord_found is returned. */ + flags = (lock_mode == ZNODE_WRITE_LOCK) ? CBK_FOR_INSERT : 0; + flags |= CBK_UNIQUE; + /* + * traverse tree to find stat data. We cannot use vroot here, because + * it only covers _body_ of the file, and stat data don't belong + * there. + */ + result = coord_by_key(tree_by_inode(inode), + key, + coord, + lh, + lock_mode, + FIND_EXACT, + LEAF_LEVEL, + LEAF_LEVEL, + flags, + 0); + if (REISER4_DEBUG && result == 0) + check_sd_coord(coord, key); + + if (result != 0 && !silent) + key_warning(key, inode, result); + return result; +} + +/* insert new stat-data into tree. Called with inode state + locked. Return inode state locked. */ +static int +insert_new_sd(struct inode *inode /* inode to create sd for */ ) +{ + int result; + reiser4_key key; + coord_t coord; + reiser4_item_data data; + char *area; + reiser4_inode *ref; + lock_handle lh; + oid_t oid; + + assert("nikita-723", inode != NULL); + assert("nikita-3406", inode_get_flag(inode, REISER4_NO_SD)); + + ref = reiser4_inode_data(inode); + spin_lock_inode(inode); + + /* + * prepare specification of new item to be inserted + */ + + data.iplug = inode_sd_plugin(inode); + data.length = data.iplug->s.sd.save_len(inode); + spin_unlock_inode(inode); + + data.data = NULL; + data.user = 0; +/* could be optimized for case where there is only one node format in + * use in the filesystem, probably there are lots of such + * places we could optimize for only one node layout.... -Hans */ + if (data.length > tree_by_inode(inode)->nplug->max_item_size()) { + /* This is silly check, but we don't know actual node where + insertion will go into. */ + return RETERR(-ENAMETOOLONG); + } + oid = oid_allocate(inode->i_sb); +/* NIKITA-FIXME-HANS: what is your opinion on whether this error check should be encapsulated into oid_allocate? */ + if (oid == ABSOLUTE_MAX_OID) + return RETERR(-EOVERFLOW); + + set_inode_oid(inode, oid); + + coord_init_zero(&coord); + init_lh(&lh); + + result = insert_by_key(tree_by_inode(inode), + build_sd_key(inode, &key), + &data, + &coord, + &lh, + /* stat data lives on a leaf level */ + LEAF_LEVEL, + CBK_UNIQUE); + + /* we don't want to re-check that somebody didn't insert + stat-data while we were doing io, because if it did, + insert_by_key() returned error. */ + /* but what _is_ possible is that plugin for inode's stat-data, + list of non-standard plugins or their state would change + during io, so that stat-data wouldn't fit into sd. To avoid + this race we keep inode_state lock. This lock has to be + taken each time you access inode in a way that would cause + changes in sd size: changing plugins etc. + */ + + if (result == IBK_INSERT_OK) { + coord_clear_iplug(&coord); + result = zload(coord.node); + if (result == 0) { + /* have we really inserted stat data? */ + assert("nikita-725", item_is_statdata(&coord)); + + /* inode was just created. It is inserted into hash + table, but no directory entry was yet inserted into + parent. So, inode is inaccessible through + ->lookup(). All places that directly grab inode + from hash-table (like old knfsd), should check + IMMUTABLE flag that is set by common_create_child. + */ + assert("nikita-3240", data.iplug != NULL); + assert("nikita-3241", data.iplug->s.sd.save != NULL); + area = item_body_by_coord(&coord); + result = data.iplug->s.sd.save(inode, &area); + znode_make_dirty(coord.node); + if (result == 0) { + /* object has stat-data now */ + inode_clr_flag(inode, REISER4_NO_SD); + inode_set_flag(inode, REISER4_SDLEN_KNOWN); + /* initialise stat-data seal */ + seal_init(&ref->sd_seal, &coord, &key); + ref->sd_coord = coord; + check_inode_seal(inode, &coord, &key); + } else if (result != -ENOMEM) + /* + * convert any other error code to -EIO to + * avoid confusing user level with unexpected + * errors. + */ + result = RETERR(-EIO); + zrelse(coord.node); + } + } + done_lh(&lh); + + if (result != 0) + key_warning(&key, inode, result); + else + oid_count_allocated(); + + return result; +} + + +/* update stat-data at @coord */ +static int +update_sd_at(struct inode * inode, coord_t * coord, reiser4_key * key, + lock_handle * lh) +{ + int result; + reiser4_item_data data; + char *area; + reiser4_inode *state; + znode *loaded; + + state = reiser4_inode_data(inode); + + coord_clear_iplug(coord); + result = zload(coord->node); + if (result != 0) + return result; + loaded = coord->node; + + spin_lock_inode(inode); + assert("nikita-728", inode_sd_plugin(inode) != NULL); + data.iplug = inode_sd_plugin(inode); + + /* if inode has non-standard plugins, add appropriate stat data + * extension */ + if (state->plugin_mask != 0) + inode_set_extension(inode, PLUGIN_STAT); + + /* data.length is how much space to add to (or remove + from if negative) sd */ + if (!inode_get_flag(inode, REISER4_SDLEN_KNOWN)) { + /* recalculate stat-data length */ + data.length = + data.iplug->s.sd.save_len(inode) - + item_length_by_coord(coord); + inode_set_flag(inode, REISER4_SDLEN_KNOWN); + } else + data.length = 0; + spin_unlock_inode(inode); + + /* if on-disk stat data is of different length than required + for this inode, resize it */ + if (data.length != 0) { + data.data = NULL; + data.user = 0; + + /* insertion code requires that insertion point (coord) was + * between units. */ + coord->between = AFTER_UNIT; + result = resize_item(coord, + &data, key, lh, COPI_DONT_SHIFT_LEFT); + if (result != 0) { + key_warning(key, inode, result); + zrelse(loaded); + return result; + } + if (loaded != coord->node) { + /* resize_item moved coord to another node. Zload it */ + zrelse(loaded); + coord_clear_iplug(coord); + result = zload(coord->node); + if (result != 0) + return result; + loaded = coord->node; + } + } + + area = item_body_by_coord(coord); + spin_lock_inode(inode); + result = data.iplug->s.sd.save(inode, &area); + znode_make_dirty(coord->node); + + /* re-initialise stat-data seal */ + + /* + * coord.between was possibly skewed from AT_UNIT when stat-data size + * was changed and new extensions were pasted into item. + */ + coord->between = AT_UNIT; + seal_init(&state->sd_seal, coord, key); + state->sd_coord = *coord; + spin_unlock_inode(inode); + check_inode_seal(inode, coord, key); + zrelse(loaded); + return result; +} + +reiser4_internal int +locate_inode_sd(struct inode *inode, + reiser4_key *key, + coord_t *coord, + lock_handle *lh) +{ + reiser4_inode *state; + seal_t seal; + int result; + + assert("nikita-3483", inode != NULL); + + state = reiser4_inode_data(inode); + spin_lock_inode(inode); + *coord = state->sd_coord; + coord_clear_iplug(coord); + seal = state->sd_seal; + spin_unlock_inode(inode); + + build_sd_key(inode, key); + if (seal_is_set(&seal)) { + /* first, try to use seal */ + result = seal_validate(&seal, + coord, + key, + lh, + ZNODE_WRITE_LOCK, + ZNODE_LOCK_LOPRI); + if (result == 0) + check_sd_coord(coord, key); + } else + result = -E_REPEAT; + + if (result != 0) { + coord_init_zero(coord); + result = lookup_sd(inode, ZNODE_WRITE_LOCK, coord, lh, key, 0); + } + return result; +} + +/* Update existing stat-data in a tree. Called with inode state locked. Return + inode state locked. */ +static int +update_sd(struct inode *inode /* inode to update sd for */ ) +{ + int result; + reiser4_key key; + coord_t coord; + lock_handle lh; + + assert("nikita-726", inode != NULL); + + /* no stat-data, nothing to update?! */ + assert("nikita-3482", !inode_get_flag(inode, REISER4_NO_SD)); + + init_lh(&lh); + + result = locate_inode_sd(inode, &key, &coord, &lh); + if (result == 0) + result = update_sd_at(inode, &coord, &key, &lh); + done_lh(&lh); + + return result; +} +/* NIKITA-FIXME-HANS: the distinction between writing and updating made in the function names seems muddled, please adopt a better function naming strategy */ +/* save object's stat-data to disk */ +reiser4_internal int +write_sd_by_inode_common(struct inode *inode /* object to save */) +{ + int result; + + assert("nikita-730", inode != NULL); + + mark_inode_update(inode, 1); + + if (inode_get_flag(inode, REISER4_NO_SD)) + /* object doesn't have stat-data yet */ + result = insert_new_sd(inode); + else + result = update_sd(inode); + if (result != 0 && result != -ENAMETOOLONG && result != -ENOMEM) + /* Don't issue warnings about "name is too long" */ + warning("nikita-2221", "Failed to save sd for %llu: %i", + (unsigned long long)get_inode_oid(inode), result); + return result; +} + +/* checks whether yet another hard links to this object can be added */ +static int +can_add_link_common(const struct inode *object /* object to check */ ) +{ + assert("nikita-732", object != NULL); + + /* inode->i_nlink is unsigned int, so just check for integer + * overflow */ + return object->i_nlink + 1 != 0; +} + +/* remove object stat data. Space for it must be reserved by caller before */ +static int +common_object_delete_no_reserve(struct inode *inode /* object to remove */) +{ + int result; + + assert("nikita-1477", inode != NULL); + + if (!inode_get_flag(inode, REISER4_NO_SD)) { + reiser4_key sd_key; + + DQUOT_FREE_INODE(inode); + DQUOT_DROP(inode); + + build_sd_key(inode, &sd_key); + result = cut_tree(tree_by_inode(inode), &sd_key, &sd_key, NULL, 0); + if (result == 0) { + inode_set_flag(inode, REISER4_NO_SD); + result = oid_release(inode->i_sb, get_inode_oid(inode)); + if (result == 0) { + oid_count_released(); + + result = safe_link_del(inode, SAFE_UNLINK); + } + } + } else + result = 0; + return result; +} + +/* delete object stat-data. This is to be used when file deletion turns into stat data removal */ +reiser4_internal int +delete_object(struct inode *inode /* object to remove */) +{ + int result; + + assert("nikita-1477", inode != NULL); + /* FIXME: if file body deletion failed (i/o error, for instance), + inode->i_size can be != 0 here */ + assert("nikita-3420", inode->i_size == 0 || S_ISLNK(inode->i_mode)); + assert("nikita-3421", inode->i_nlink == 0); + + if (!inode_get_flag(inode, REISER4_NO_SD)) { + reiser4_block_nr reserve; + + /* grab space which is needed to remove 2 items from the tree: + stat data and safe-link */ + reserve = 2 * estimate_one_item_removal(tree_by_inode(inode)); + if (reiser4_grab_space_force(reserve, + BA_RESERVED | BA_CAN_COMMIT)) + return RETERR(-ENOSPC); + result = common_object_delete_no_reserve(inode); + } else + result = 0; + return result; +} + +/* common directory consists of two items: stat data and one item containing "." and ".." */ +static int delete_directory_common(struct inode *inode) +{ + int result; + dir_plugin *dplug; + + dplug = inode_dir_plugin(inode); + assert("vs-1101", dplug && dplug->done); + + /* grab space enough for removing two items */ + if (reiser4_grab_space(2 * estimate_one_item_removal(tree_by_inode(inode)), BA_RESERVED | BA_CAN_COMMIT)) + return RETERR(-ENOSPC); + + result = dplug->done(inode); + if (!result) + result = common_object_delete_no_reserve(inode); + all_grabbed2free(); + return result; +} + +/* ->set_plug_in_inode() default method. */ +static int +set_plug_in_inode_common(struct inode *object /* inode to set plugin on */ , + struct inode *parent /* parent object */ , + reiser4_object_create_data * data /* creational + * data */ ) +{ + __u64 mask; + + object->i_mode = data->mode; + /* this should be plugin decision */ + object->i_uid = current->fsuid; + object->i_mtime = object->i_atime = object->i_ctime = CURRENT_TIME; +/* NIKITA-FIXME-HANS: which is defined as what where? */ + /* support for BSD style group-id assignment. */ + if (reiser4_is_set(object->i_sb, REISER4_BSD_GID)) + object->i_gid = parent->i_gid; + else if (parent->i_mode & S_ISGID) { + /* parent directory has sguid bit */ + object->i_gid = parent->i_gid; + if (S_ISDIR(object->i_mode)) + /* sguid is inherited by sub-directories */ + object->i_mode |= S_ISGID; + } else + object->i_gid = current->fsgid; + + /* this object doesn't have stat-data yet */ + inode_set_flag(object, REISER4_NO_SD); + /* setup inode and file-operations for this inode */ + setup_inode_ops(object, data); + object->i_nlink = 0; + seal_init(&reiser4_inode_data(object)->sd_seal, NULL, NULL); + mask = (1 << UNIX_STAT) | (1 << LIGHT_WEIGHT_STAT); + if (!reiser4_is_set(object->i_sb, REISER4_32_BIT_TIMES)) + mask |= (1 << LARGE_TIMES_STAT); + + reiser4_inode_data(object)->extmask = mask; + return 0; +} + +/* Determine object plugin for @inode based on i_mode. + + Many objects in reiser4 file system are controlled by standard object + plugins that emulate traditional unix objects: unix file, directory, symlink, fifo, and so on. + + For such files we don't explicitly store plugin id in object stat + data. Rather required plugin is guessed from mode bits, where file "type" + is encoded (see stat(2)). +*/ +reiser4_internal int +guess_plugin_by_mode(struct inode *inode /* object to guess plugins + * for */ ) +{ + int fplug_id; + int dplug_id; + reiser4_inode *info; + + assert("nikita-736", inode != NULL); + + dplug_id = fplug_id = -1; + + switch (inode->i_mode & S_IFMT) { + case S_IFSOCK: + case S_IFBLK: + case S_IFCHR: + case S_IFIFO: + fplug_id = SPECIAL_FILE_PLUGIN_ID; + break; + case S_IFLNK: + fplug_id = SYMLINK_FILE_PLUGIN_ID; + break; + case S_IFDIR: + fplug_id = DIRECTORY_FILE_PLUGIN_ID; + dplug_id = HASHED_DIR_PLUGIN_ID; + break; + default: + warning("nikita-737", "wrong file mode: %o", inode->i_mode); + return RETERR(-EIO); + case S_IFREG: + fplug_id = UNIX_FILE_PLUGIN_ID; + break; + } + info = reiser4_inode_data(inode); + plugin_set_file(&info->pset, + (fplug_id >= 0) ? file_plugin_by_id(fplug_id) : NULL); + plugin_set_dir(&info->pset, + (dplug_id >= 0) ? dir_plugin_by_id(dplug_id) : NULL); + return 0; +} + +/* this comon implementation of create estimation function may be used when object creation involves insertion of one item + (usualy stat data) into tree */ +static reiser4_block_nr estimate_create_file_common(struct inode *object) +{ + return estimate_one_insert_item(tree_by_inode(object)); +} + +/* this comon implementation of create directory estimation function may be used when directory creation involves + insertion of two items (usualy stat data and item containing "." and "..") into tree */ +static reiser4_block_nr estimate_create_dir_common(struct inode *object) +{ + return 2 * estimate_one_insert_item(tree_by_inode(object)); +} + +/* ->create method of object plugin */ +static int +create_common(struct inode *object, struct inode *parent UNUSED_ARG, + reiser4_object_create_data * data UNUSED_ARG) +{ + reiser4_block_nr reserve; + assert("nikita-744", object != NULL); + assert("nikita-745", parent != NULL); + assert("nikita-747", data != NULL); + assert("nikita-748", inode_get_flag(object, REISER4_NO_SD)); + + reserve = estimate_create_file_common(object); + if (reiser4_grab_space(reserve, BA_CAN_COMMIT)) + return RETERR(-ENOSPC); + return write_sd_by_inode_common(object); +} + +/* standard implementation of ->owns_item() plugin method: compare objectids + of keys in inode and coord */ +reiser4_internal int +owns_item_common(const struct inode *inode /* object to check + * against */ , + const coord_t * coord /* coord to check */ ) +{ + reiser4_key item_key; + reiser4_key file_key; + + assert("nikita-760", inode != NULL); + assert("nikita-761", coord != NULL); + + return /*coord_is_in_node( coord ) && */ + coord_is_existing_item(coord) && + (get_key_objectid(build_sd_key(inode, &file_key)) == get_key_objectid(item_key_by_coord(coord, &item_key))); +} + +/* @count bytes of flow @f got written, update correspondingly f->length, + f->data and f->key */ +reiser4_internal void +move_flow_forward(flow_t * f, unsigned count) +{ + if (f->data) + f->data += count; + f->length -= count; + set_key_offset(&f->key, get_key_offset(&f->key) + count); +} + +/* default ->add_link() method of file plugin */ +static int +add_link_common(struct inode *object, struct inode *parent UNUSED_ARG) +{ + /* + * increment ->i_nlink and update ->i_ctime + */ + + INODE_INC_FIELD(object, i_nlink); + object->i_ctime = CURRENT_TIME; + return 0; +} + +/* default ->rem_link() method of file plugin */ +static int +rem_link_common(struct inode *object, struct inode *parent UNUSED_ARG) +{ + assert("nikita-2021", object != NULL); + assert("nikita-2163", object->i_nlink > 0); + + /* + * decrement ->i_nlink and update ->i_ctime + */ + + INODE_DEC_FIELD(object, i_nlink); + object->i_ctime = CURRENT_TIME; + return 0; +} + +/* ->not_linked() method for file plugins */ +static int +not_linked_common(const struct inode *inode) +{ + assert("nikita-2007", inode != NULL); + return (inode->i_nlink == 0); +} + +/* ->not_linked() method the for directory file plugin */ +static int +not_linked_dir(const struct inode *inode) +{ + assert("nikita-2008", inode != NULL); + /* one link from dot */ + return (inode->i_nlink == 1); +} + +/* ->adjust_to_parent() method for regular files */ +static int +adjust_to_parent_common(struct inode *object /* new object */ , + struct inode *parent /* parent directory */ , + struct inode *root /* root directory */ ) +{ + assert("nikita-2165", object != NULL); + if (parent == NULL) + parent = root; + assert("nikita-2069", parent != NULL); + + /* + * inherit missing plugins from parent + */ + + grab_plugin(object, parent, PSET_FILE); + grab_plugin(object, parent, PSET_SD); + grab_plugin(object, parent, PSET_FORMATTING); + grab_plugin(object, parent, PSET_PERM); + return 0; +} + +/* ->adjust_to_parent() method for directory files */ +static int +adjust_to_parent_dir(struct inode *object /* new object */ , + struct inode *parent /* parent directory */ , + struct inode *root /* root directory */ ) +{ + int result = 0; + pset_member memb; + + assert("nikita-2166", object != NULL); + if (parent == NULL) + parent = root; + assert("nikita-2167", parent != NULL); + + /* + * inherit missing plugins from parent + */ + for (memb = 0; memb < PSET_LAST; ++ memb) { + result = grab_plugin(object, parent, memb); + if (result != 0) + break; + } + return result; +} + +/* simplest implementation of ->getattr() method. Completely static. */ +static int +getattr_common(struct vfsmount *mnt UNUSED_ARG, struct dentry *dentry, struct kstat *stat) +{ + struct inode *obj; + + assert("nikita-2298", dentry != NULL); + assert("nikita-2299", stat != NULL); + assert("nikita-2300", dentry->d_inode != NULL); + + obj = dentry->d_inode; + + stat->dev = obj->i_sb->s_dev; + stat->ino = oid_to_uino(get_inode_oid(obj)); + stat->mode = obj->i_mode; + /* don't confuse userland with huge nlink. This is not entirely + * correct, because nlink_t is not necessary 16 bit signed. */ + stat->nlink = min(obj->i_nlink, (typeof(obj->i_nlink))0x7fff); + stat->uid = obj->i_uid; + stat->gid = obj->i_gid; + stat->rdev = obj->i_rdev; + stat->atime = obj->i_atime; + stat->mtime = obj->i_mtime; + stat->ctime = obj->i_ctime; + stat->size = obj->i_size; + stat->blocks = (inode_get_bytes(obj) + VFS_BLKSIZE - 1) >> VFS_BLKSIZE_BITS; + /* "preferred" blocksize for efficient file system I/O */ + stat->blksize = get_super_private(obj->i_sb)->optimal_io_size; + + return 0; +} + +/* plugin->u.file.release */ +static int +release_dir(struct inode *inode, struct file *file) +{ + /* this is called when directory file descriptor is closed. */ + spin_lock_inode(inode); + /* remove directory from readddir list. See comment before + * readdir_common() for details. */ + if (file->private_data != NULL) + readdir_list_remove_clean(reiser4_get_file_fsdata(file)); + spin_unlock_inode(inode); + return 0; +} + +/* default implementation of ->bind() method of file plugin */ +static int +bind_common(struct inode *child UNUSED_ARG, struct inode *parent UNUSED_ARG) +{ + return 0; +} + +#define detach_common bind_common +#define cannot ((void *)bind_common) + +static int +detach_dir(struct inode *child, struct inode *parent) +{ + dir_plugin *dplug; + + dplug = inode_dir_plugin(child); + assert("nikita-2883", dplug != NULL); + assert("nikita-2884", dplug->detach != NULL); + return dplug->detach(child, parent); +} + + +/* this common implementation of update estimation function may be used when stat data update does not do more than + inserting a unit into a stat data item which is probably true for most cases */ +reiser4_internal reiser4_block_nr +estimate_update_common(const struct inode *inode) +{ + return estimate_one_insert_into_item(tree_by_inode(inode)); +} + +static reiser4_block_nr +estimate_unlink_common(struct inode *object UNUSED_ARG, + struct inode *parent UNUSED_ARG) +{ + return 0; +} + +static reiser4_block_nr +estimate_unlink_dir_common(struct inode *object, struct inode *parent) +{ + dir_plugin *dplug; + + dplug = inode_dir_plugin(object); + assert("nikita-2888", dplug != NULL); + assert("nikita-2887", dplug->estimate.unlink != NULL); + return dplug->estimate.unlink(object, parent); +} + +/* implementation of ->bind() method for file plugin of directory file */ +static int +bind_dir(struct inode *child, struct inode *parent) +{ + dir_plugin *dplug; + + dplug = inode_dir_plugin(child); + assert("nikita-2646", dplug != NULL); + return dplug->attach(child, parent); +} + +static int +setattr_reserve_common(reiser4_tree *tree) +{ + assert("vs-1096", is_grab_enabled(get_current_context())); + return reiser4_grab_space(estimate_one_insert_into_item(tree), + BA_CAN_COMMIT); +} + +/* ->setattr() method. This is called when inode attribute (including + * ->i_size) is modified. */ +reiser4_internal int +setattr_common(struct inode *inode /* Object to change attributes */, + struct iattr *attr /* change description */) +{ + int result; + + assert("nikita-3119", !(attr->ia_valid & ATTR_SIZE)); + + result = inode_change_ok(inode, attr); + if (result) + return result; + + /* + * grab disk space and call standard inode_setattr(). + */ + result = setattr_reserve_common(tree_by_inode(inode)); + if (!result) { + if ((attr->ia_valid & ATTR_UID && attr->ia_uid != inode->i_uid) || + (attr->ia_valid & ATTR_GID && attr->ia_gid != inode->i_gid)) { + result = DQUOT_TRANSFER(inode, attr) ? -EDQUOT : 0; + if (result) { + all_grabbed2free(); + return result; + } + } + result = inode_setattr(inode, attr); + if (!result) + reiser4_update_sd(inode); + } + + all_grabbed2free(); + return result; +} + +/* doesn't seem to be exported in headers. */ +extern spinlock_t inode_lock; + +/* ->delete_inode() method. This is called by + * iput()->iput_final()->drop_inode() when last reference to inode is released + * and inode has no names. */ +static void delete_inode_common(struct inode *object) +{ + /* create context here. + * + * removal of inode from the hash table (done at the very beginning of + * generic_delete_inode(), truncate of pages, and removal of file's + * extents has to be performed in the same atom. Otherwise, it may so + * happen, that twig node with unallocated extent will be flushed to + * the disk. + */ + reiser4_context ctx; + + /* + * FIXME: this resembles generic_delete_inode + */ + list_del_init(&object->i_list); + list_del_init(&object->i_sb_list); + object->i_state |= I_FREEING; + inodes_stat.nr_inodes--; + spin_unlock(&inode_lock); + + init_context(&ctx, object->i_sb); + + kill_cursors(object); + + if (!is_bad_inode(object)) { + file_plugin *fplug; + + /* truncate object body */ + fplug = inode_file_plugin(object); + if (fplug->pre_delete != NULL && fplug->pre_delete(object) != 0) + warning("vs-1216", "Failed to delete file body %llu", + (unsigned long long)get_inode_oid(object)); + else + assert("vs-1430", + reiser4_inode_data(object)->anonymous_eflushed == 0 && + reiser4_inode_data(object)->captured_eflushed == 0); + } + + if (object->i_data.nrpages) { + warning("vs-1434", "nrpages %ld\n", object->i_data.nrpages); + truncate_inode_pages(&object->i_data, 0); + } + security_inode_delete(object); + if (!is_bad_inode(object)) + DQUOT_INIT(object); + + object->i_sb->s_op->delete_inode(object); + + spin_lock(&inode_lock); + hlist_del_init(&object->i_hash); + spin_unlock(&inode_lock); + wake_up_inode(object); + if (object->i_state != I_CLEAR) + BUG(); + destroy_inode(object); + reiser4_exit_context(&ctx); +} + +/* + * ->forget_inode() method. Called by iput()->iput_final()->drop_inode() when + * last reference to inode with names is released + */ +static void forget_inode_common(struct inode *object) +{ + generic_forget_inode(object); +} + +/* ->drop_inode() method. Called by iput()->iput_final() when last reference + * to inode is released */ +static void drop_common(struct inode * object) +{ + file_plugin *fplug; + + assert("nikita-2643", object != NULL); + + /* -not- creating context in this method, because it is frequently + called and all existing ->not_linked() methods are one liners. */ + + fplug = inode_file_plugin(object); + /* fplug is NULL for fake inode */ + if (fplug != NULL && fplug->not_linked(object)) { + assert("nikita-3231", fplug->delete_inode != NULL); + fplug->delete_inode(object); + } else { + assert("nikita-3232", fplug->forget_inode != NULL); + fplug->forget_inode(object); + } +} + +static ssize_t +isdir(void) +{ + return RETERR(-EISDIR); +} + +#define eisdir ((void *)isdir) + +static ssize_t +perm(void) +{ + return RETERR(-EPERM); +} + +#define eperm ((void *)perm) + +static int +can_rem_dir(const struct inode * inode) +{ + /* is_dir_empty() returns 0 is dir is empty */ + return !is_dir_empty(inode); +} + +static int +process_truncate(struct inode *inode, __u64 size) +{ + int result; + struct iattr attr; + file_plugin *fplug; + reiser4_context ctx; + + init_context(&ctx, inode->i_sb); + + attr.ia_size = size; + attr.ia_valid = ATTR_SIZE | ATTR_CTIME; + fplug = inode_file_plugin(inode); + + down(&inode->i_sem); + assert("vs-1704", get_current_context()->trans->atom == NULL); + result = fplug->setattr(inode, &attr); + up(&inode->i_sem); + + context_set_commit_async(&ctx); + reiser4_exit_context(&ctx); + + return result; +} + +static int +safelink_common(struct inode *object, reiser4_safe_link_t link, __u64 value) +{ + int result; + + assert("vs-1705", get_current_context()->trans->atom == NULL); + if (link == SAFE_UNLINK) + /* nothing to do. iput() in the caller (process_safelink) will + * finish with file */ + result = 0; + else if (link == SAFE_TRUNCATE) + result = process_truncate(object, value); + else { + warning("nikita-3438", "Unrecognized safe-link type: %i", link); + result = RETERR(-EIO); + } + return result; +} + +reiser4_internal int prepare_write_common ( + struct file * file, struct page * page, unsigned from, unsigned to) +{ + int result; + file_plugin *fplug; + struct inode *inode; + + assert("umka-3099", file != NULL); + assert("umka-3100", page != NULL); + assert("umka-3095", PageLocked(page)); + + if (to - from == PAGE_CACHE_SIZE || PageUptodate(page)) + return 0; + + inode = page->mapping->host; + fplug = inode_file_plugin(inode); + + if (fplug->readpage == NULL) + return RETERR(-EINVAL); + + result = fplug->readpage(file, page); + if (result != 0) { + SetPageError(page); + ClearPageUptodate(page); + /* All reiser4 readpage() implementations should return the + * page locked in case of error. */ + assert("nikita-3472", PageLocked(page)); + } else { + /* + * ->readpage() either: + * + * 1. starts IO against @page. @page is locked for IO in + * this case. + * + * 2. doesn't start IO. @page is unlocked. + * + * In either case, page should be locked. + */ + lock_page(page); + /* + * IO (if any) is completed at this point. Check for IO + * errors. + */ + if (!PageUptodate(page)) + result = RETERR(-EIO); + } + assert("umka-3098", PageLocked(page)); + return result; +} + +reiser4_internal int +key_by_inode_and_offset_common(struct inode *inode, loff_t off, reiser4_key *key) +{ + reiser4_key_init(key); + set_key_locality(key, reiser4_inode_data(inode)->locality_id); + set_key_ordering(key, get_inode_ordering(inode)); + set_key_objectid(key, get_inode_oid(inode));/*FIXME: inode->i_ino */ + set_key_type(key, KEY_BODY_MINOR); + set_key_offset(key, (__u64) off); + return 0; +} + +/* default implementation of ->sync() method: commit all transactions */ +static int +sync_common(struct inode *inode, int datasync) +{ + return txnmgr_force_commit_all(inode->i_sb, 0); +} + +static int +wire_size_common(struct inode *inode) +{ + return inode_onwire_size(inode); +} + +static char * +wire_write_common(struct inode *inode, char *start) +{ + return build_inode_onwire(inode, start); +} + +static char * +wire_read_common(char *addr, reiser4_object_on_wire *obj) +{ + return extract_obj_key_id_from_onwire(addr, &obj->u.std.key_id); +} + +static void +wire_done_common(reiser4_object_on_wire *obj) +{ + /* nothing to do */ +} + +static struct dentry * +wire_get_common(struct super_block *sb, reiser4_object_on_wire *obj) +{ + struct inode *inode; + struct dentry *dentry; + reiser4_key key; + + extract_key_from_id(&obj->u.std.key_id, &key); + inode = reiser4_iget(sb, &key, 1); + if (!IS_ERR(inode)) { + reiser4_iget_complete(inode); + dentry = d_alloc_anon(inode); + if (dentry == NULL) { + iput(inode); + dentry = ERR_PTR(-ENOMEM); + } else + dentry->d_op = &get_super_private(sb)->ops.dentry; + } else if (PTR_ERR(inode) == -ENOENT) + /* + * inode wasn't found at the key encoded in the file + * handle. Hence, file handle is stale. + */ + dentry = ERR_PTR(RETERR(-ESTALE)); + else + dentry = (void *)inode; + return dentry; +} + + +static int +change_file(struct inode * inode, reiser4_plugin * plugin) +{ + /* cannot change object plugin of already existing object */ + return RETERR(-EINVAL); +} + +static reiser4_plugin_ops file_plugin_ops = { + .init = NULL, + .load = NULL, + .save_len = NULL, + .save = NULL, + .change = change_file +}; + + +/* + * Definitions of object plugins. + */ + +file_plugin file_plugins[LAST_FILE_PLUGIN_ID] = { + [UNIX_FILE_PLUGIN_ID] = { + .h = { + .type_id = REISER4_FILE_PLUGIN_TYPE, + .id = UNIX_FILE_PLUGIN_ID, + .pops = &file_plugin_ops, + .label = "reg", + .desc = "regular file", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .open = NULL, + .truncate = truncate_unix_file, + .write_sd_by_inode = write_sd_by_inode_common, + .capturepage = capturepage_unix_file, + .readpage = readpage_unix_file, + .capture = capture_unix_file, + .read = read_unix_file, + .write = write_unix_file, + .release = release_unix_file, + .ioctl = ioctl_unix_file, + .mmap = mmap_unix_file, + .get_block = get_block_unix_file, + .flow_by_inode = flow_by_inode_unix_file, + .key_by_inode = key_by_inode_unix_file, + .set_plug_in_inode = set_plug_in_inode_common, + .adjust_to_parent = adjust_to_parent_common, + .create = create_common, + .delete = delete_object, + .sync = sync_unix_file, + .add_link = add_link_common, + .rem_link = rem_link_common, + .owns_item = owns_item_unix_file, + .can_add_link = can_add_link_common, + .can_rem_link = NULL, + .not_linked = not_linked_common, + .setattr = setattr_unix_file, + .getattr = getattr_common, + .seek = NULL, + .detach = detach_common, + .bind = bind_common, + .safelink = safelink_common, + .estimate = { + .create = estimate_create_file_common, + .update = estimate_update_common, + .unlink = estimate_unlink_common + }, + .wire = { + .write = wire_write_common, + .read = wire_read_common, + .get = wire_get_common, + .size = wire_size_common, + .done = wire_done_common + }, + .init_inode_data = init_inode_data_unix_file, + .pre_delete = pre_delete_unix_file, + .cut_tree_worker = cut_tree_worker_common, + .drop = drop_common, + .delete_inode = delete_inode_common, + .destroy_inode = NULL, + .forget_inode = forget_inode_common, + .sendfile = sendfile_unix_file, + .prepare_write = prepare_write_unix_file + }, + [DIRECTORY_FILE_PLUGIN_ID] = { + .h = { + .type_id = REISER4_FILE_PLUGIN_TYPE, + .id = DIRECTORY_FILE_PLUGIN_ID, + .pops = &file_plugin_ops, + .label = "dir", + .desc = "directory", + .linkage = TYPE_SAFE_LIST_LINK_ZERO}, + .open = NULL, + .truncate = eisdir, + .write_sd_by_inode = write_sd_by_inode_common, + .capturepage = NULL, + .readpage = eisdir, + .capture = NULL, + .read = eisdir, + .write = eisdir, + .release = release_dir, + .ioctl = eisdir, + .mmap = eisdir, + .get_block = NULL, + .flow_by_inode = NULL, + .key_by_inode = NULL, + .set_plug_in_inode = set_plug_in_inode_common, + .adjust_to_parent = adjust_to_parent_dir, + .create = create_common, + .delete = delete_directory_common, + .sync = sync_common, + .add_link = add_link_common, + .rem_link = rem_link_common, + .owns_item = owns_item_hashed, + .can_add_link = can_add_link_common, + .can_rem_link = can_rem_dir, + .not_linked = not_linked_dir, + .setattr = setattr_common, + .getattr = getattr_common, + .seek = seek_dir, + .detach = detach_dir, + .bind = bind_dir, + .safelink = safelink_common, + .estimate = { + .create = estimate_create_dir_common, + .update = estimate_update_common, + .unlink = estimate_unlink_dir_common + }, + .wire = { + .write = wire_write_common, + .read = wire_read_common, + .get = wire_get_common, + .size = wire_size_common, + .done = wire_done_common + }, + .init_inode_data = init_inode_ordering, + .pre_delete = NULL, + .cut_tree_worker = cut_tree_worker_common, + .drop = drop_common, + .delete_inode = delete_inode_common, + .destroy_inode = NULL, + .forget_inode = forget_inode_common, + }, + [SYMLINK_FILE_PLUGIN_ID] = { + .h = { + .type_id = REISER4_FILE_PLUGIN_TYPE, + .id = SYMLINK_FILE_PLUGIN_ID, + .pops = &file_plugin_ops, + .label = "symlink", + .desc = "symbolic link", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .open = NULL, + .truncate = eperm, + .write_sd_by_inode = write_sd_by_inode_common, + .capturepage = NULL, + .readpage = eperm, + .capture = NULL, + .read = eperm, + .write = eperm, + .release = NULL, + .ioctl = eperm, + .mmap = eperm, + .sync = sync_common, + .get_block = NULL, + .flow_by_inode = NULL, + .key_by_inode = NULL, + .set_plug_in_inode = set_plug_in_inode_common, + .adjust_to_parent = adjust_to_parent_common, + .create = create_symlink, + /* FIXME-VS: symlink should probably have its own destroy + * method */ + .delete = delete_object, + .add_link = add_link_common, + .rem_link = rem_link_common, + .owns_item = NULL, + .can_add_link = can_add_link_common, + .can_rem_link = NULL, + .not_linked = not_linked_common, + .setattr = setattr_common, + .getattr = getattr_common, + .seek = NULL, + .detach = detach_common, + .bind = bind_common, + .safelink = safelink_common, + .estimate = { + .create = estimate_create_file_common, + .update = estimate_update_common, + .unlink = estimate_unlink_common + }, + .wire = { + .write = wire_write_common, + .read = wire_read_common, + .get = wire_get_common, + .size = wire_size_common, + .done = wire_done_common + }, + .init_inode_data = init_inode_ordering, + .pre_delete = NULL, + .cut_tree_worker = cut_tree_worker_common, + .drop = drop_common, + .delete_inode = delete_inode_common, + .destroy_inode = destroy_inode_symlink, + .forget_inode = forget_inode_common, + }, + [SPECIAL_FILE_PLUGIN_ID] = { + .h = { + .type_id = REISER4_FILE_PLUGIN_TYPE, + .id = SPECIAL_FILE_PLUGIN_ID, + .pops = &file_plugin_ops, + .label = "special", + .desc = "special: fifo, device or socket", + .linkage = TYPE_SAFE_LIST_LINK_ZERO} + , + .open = NULL, + .truncate = eperm, + .create = create_common, + .write_sd_by_inode = write_sd_by_inode_common, + .capturepage = NULL, + .readpage = eperm, + .capture = NULL, + .read = eperm, + .write = eperm, + .release = NULL, + .ioctl = eperm, + .mmap = eperm, + .sync = sync_common, + .get_block = NULL, + .flow_by_inode = NULL, + .key_by_inode = NULL, + .set_plug_in_inode = set_plug_in_inode_common, + .adjust_to_parent = adjust_to_parent_common, + .delete = delete_object, + .add_link = add_link_common, + .rem_link = rem_link_common, + .owns_item = owns_item_common, + .can_add_link = can_add_link_common, + .can_rem_link = NULL, + .not_linked = not_linked_common, + .setattr = setattr_common, + .getattr = getattr_common, + .seek = NULL, + .detach = detach_common, + .bind = bind_common, + .safelink = safelink_common, + .estimate = { + .create = estimate_create_file_common, + .update = estimate_update_common, + .unlink = estimate_unlink_common + }, + .wire = { + .write = wire_write_common, + .read = wire_read_common, + .get = wire_get_common, + .size = wire_size_common, + .done = wire_done_common + }, + .init_inode_data = init_inode_ordering, + .pre_delete = NULL, + .cut_tree_worker = cut_tree_worker_common, + .drop = drop_common, + .delete_inode = delete_inode_common, + .destroy_inode = NULL, + .forget_inode = forget_inode_common, + }, + [PSEUDO_FILE_PLUGIN_ID] = { + .h = { + .type_id = REISER4_FILE_PLUGIN_TYPE, + .id = PSEUDO_FILE_PLUGIN_ID, + .pops = &file_plugin_ops, + .label = "pseudo", + .desc = "pseudo file", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .open = open_pseudo, + .truncate = eperm, + .write_sd_by_inode = eperm, + .readpage = eperm, + .capturepage = NULL, + .capture = NULL, + .read = read_pseudo, + .write = write_pseudo, + .release = release_pseudo, + .ioctl = eperm, + .mmap = eperm, + .sync = sync_common, + .get_block = eperm, + .flow_by_inode = NULL, + .key_by_inode = NULL, + .set_plug_in_inode = set_plug_in_inode_common, + .adjust_to_parent = NULL, + .create = NULL, + .delete = eperm, + .add_link = NULL, + .rem_link = NULL, + .owns_item = NULL, + .can_add_link = cannot, + .can_rem_link = cannot, + .not_linked = NULL, + .setattr = inode_setattr, + .getattr = getattr_common, + .seek = seek_pseudo, + .detach = detach_common, + .bind = bind_common, + .safelink = NULL, + .estimate = { + .create = NULL, + .update = NULL, + .unlink = NULL + }, + .wire = { + .write = wire_write_pseudo, + .read = wire_read_pseudo, + .get = wire_get_pseudo, + .size = wire_size_pseudo, + .done = wire_done_pseudo + }, + .init_inode_data = NULL, + .pre_delete = NULL, + .cut_tree_worker = cut_tree_worker_common, + .drop = drop_pseudo, + .delete_inode = NULL, + .destroy_inode = NULL, + .forget_inode = NULL, + }, + [CRC_FILE_PLUGIN_ID] = { + .h = { + .type_id = REISER4_FILE_PLUGIN_TYPE, + .id = CRC_FILE_PLUGIN_ID, + .pops = &cryptcompress_plugin_ops, + .label = "cryptcompress", + .desc = "cryptcompress file", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + /* FIXME: check which of these are relly needed */ + .open = open_cryptcompress, + .truncate = truncate_cryptcompress, + .write_sd_by_inode = write_sd_by_inode_common, + .readpage = readpage_cryptcompress, + .capturepage = NULL, + .capture = capture_cryptcompress, + .read = read_cryptcompress, + .write = write_cryptcompress, + .release = NULL, + .ioctl = NULL, + .mmap = mmap_cryptcompress, + .get_block = get_block_cryptcompress, + .sync = sync_common, + .flow_by_inode = flow_by_inode_cryptcompress, + .key_by_inode = key_by_inode_cryptcompress, + .set_plug_in_inode = set_plug_in_inode_common, + .adjust_to_parent = adjust_to_parent_common, + .create = create_cryptcompress, + .delete = delete_cryptcompress, + .add_link = add_link_common, + .rem_link = rem_link_common, + .owns_item = owns_item_common, + .can_add_link = can_add_link_common, + .can_rem_link = NULL, + .not_linked = not_linked_common, + .setattr = setattr_cryptcompress, + .getattr = getattr_common, + .seek = NULL, + .detach = detach_common, + .bind = bind_common, + .safelink = safelink_common, + .estimate = { + .create = estimate_create_file_common, + .update = estimate_update_common, + .unlink = estimate_unlink_common + }, + .wire = { + .write = wire_write_common, + .read = wire_read_common, + .get = wire_get_common, + .size = wire_size_common, + .done = wire_done_common + }, + /*.readpages = readpages_cryptcompress,*/ + .init_inode_data = init_inode_data_cryptcompress, + .pre_delete = pre_delete_cryptcompress, + .cut_tree_worker = cut_tree_worker_cryptcompress, + .drop = drop_common, + .delete_inode = delete_inode_common, + .destroy_inode = destroy_inode_cryptcompress, + .forget_inode = forget_inode_common, + .sendfile = sendfile_common, + .prepare_write = prepare_write_common + } +}; + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/object.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/object.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,42 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Declaration of object plugin functions. */ + +#if !defined( __FS_REISER4_PLUGIN_OBJECT_H__ ) +#define __FS_REISER4_PLUGIN_OBJECT_H__ + +#include "../forward.h" + +#include /* for struct inode */ +#include + +extern int locate_inode_sd(struct inode *inode, + reiser4_key *key, coord_t *coord, lock_handle *lh); +extern int lookup_sd(struct inode *inode, znode_lock_mode lock_mode, + coord_t * coord, lock_handle * lh, const reiser4_key * key, + int silent); +extern int guess_plugin_by_mode(struct inode *inode); + +extern int write_sd_by_inode_common(struct inode *inode); +extern int owns_item_common(const struct inode *inode, + const coord_t * coord); +extern reiser4_block_nr estimate_update_common(const struct inode *inode); +extern int prepare_write_common (struct file *, struct page *, unsigned, unsigned); +extern int key_by_inode_and_offset_common(struct inode *, loff_t, reiser4_key *); +extern int setattr_common(struct inode *, struct iattr *); + +extern reiser4_plugin_ops cryptcompress_plugin_ops; + +/* __FS_REISER4_PLUGIN_OBJECT_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/plugin.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/plugin.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,623 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Basic plugin infrastructure, lookup etc. */ + +/* PLUGINS: + + Plugins are internal Reiser4 "modules" or "objects" used to increase + extensibility and allow external users to easily adapt reiser4 to + their needs. + + Plugins are classified into several disjoint "types". Plugins + belonging to the particular plugin type are termed "instances" of + this type. Currently the following types are present: + + . object plugin + . hash plugin + . tail plugin + . perm plugin + . item plugin + . node layout plugin + +NIKITA-FIXME-HANS: update this list, and review this entire comment for currency + + Object (file) plugin determines how given file-system object serves + standard VFS requests for read, write, seek, mmap etc. Instances of + file plugins are: regular file, directory, symlink. Another example + of file plugin is audit plugin, that optionally records accesses to + underlying object and forwards requests to it. + + Hash plugins compute hashes used by reiser4 to store and locate + files within directories. Instances of hash plugin type are: r5, + tea, rupasov. + + Tail plugins (or, more precisely, tail policy plugins) determine + when last part of the file should be stored in a formatted item. + + Perm plugins control permissions granted for a process accessing a file. + + Scope and lookup: + + label such that pair ( type_label, plugin_label ) is unique. This + pair is a globally persistent and user-visible plugin + identifier. Internally kernel maintains plugins and plugin types in + arrays using an index into those arrays as plugin and plugin type + identifiers. File-system in turn, also maintains persistent + "dictionary" which is mapping from plugin label to numerical + identifier which is stored in file-system objects. That is, we + store the offset into the plugin array for that plugin type as the + plugin id in the stat data of the filesystem object. + + plugin_labels have meaning for the user interface that assigns + plugins to files, and may someday have meaning for dynamic loading of + plugins and for copying of plugins from one fs instance to + another by utilities like cp and tar. + + Internal kernel plugin type identifier (index in plugins[] array) is + of type reiser4_plugin_type. Set of available plugin types is + currently static, but dynamic loading doesn't seem to pose + insurmountable problems. + + Within each type plugins are addressed by the identifiers of type + reiser4_plugin_id (indices in + reiser4_plugin_type_data.builtin[]). Such identifiers are only + required to be unique within one type, not globally. + + Thus, plugin in memory is uniquely identified by the pair (type_id, + id). + + Usage: + + There exists only one instance of each plugin instance, but this + single instance can be associated with many entities (file-system + objects, items, nodes, transactions, file-descriptors etc.). Entity + to which plugin of given type is termed (due to the lack of + imagination) "subject" of this plugin type and, by abuse of + terminology, subject of particular instance of this type to which + it's attached currently. For example, inode is subject of object + plugin type. Inode representing directory is subject of directory + plugin, hash plugin type and some particular instance of hash plugin + type. Inode, representing regular file is subject of "regular file" + plugin, tail-policy plugin type etc. + + With each subject the plugin possibly stores some state. For example, + the state of a directory plugin (instance of object plugin type) is pointer + to hash plugin (if directories always use hashing that is). State of + audit plugin is file descriptor (struct file) of log file or some + magic value to do logging through printk(). + + Interface: + + In addition to a scalar identifier, each plugin type and plugin + proper has a "label": short string and a "description"---longer + descriptive string. Labels and descriptions of plugin types are + hard-coded into plugins[] array, declared and defined in + plugin.c. Label and description of plugin are stored in .label and + .desc fields of reiser4_plugin_header respectively. It's possible to + locate plugin by the pair of labels. + + Features: + + . user-level plugin manipulations: + + reiser4("filename/..file_plugin<='audit'"); + + write(open("filename/..file_plugin"), "audit", 8); + + . user level utilities lsplug and chplug to manipulate plugins. + Utilities are not of primary priority. Possibly they will be not + working on v4.0 + +NIKITA-FIXME-HANS: this should be a mkreiserfs option not a mount option, do you agree? I don't think that specifying it at mount time, and then changing it with each mount, is a good model for usage. + + . mount option "plug" to set-up plugins of root-directory. + "plug=foo:bar" will set "bar" as default plugin of type "foo". + + Limitations: + + . each plugin type has to provide at least one builtin + plugin. This is technical limitation and it can be lifted in the + future. + + TODO: + + New plugin types/plugings: + Things we should be able to separately choose to inherit: + + security plugins + + stat data + + file bodies + + file plugins + + dir plugins + + . perm:acl + + d audi---audit plugin intercepting and possibly logging all + accesses to object. Requires to put stub functions in file_operations + in stead of generic_file_*. + +NIKITA-FIXME-HANS: why make overflows a plugin? + . over---handle hash overflows + + . sqnt---handle different access patterns and instruments read-ahead + +NIKITA-FIXME-HANS: describe the line below in more detail. + + . hier---handle inheritance of plugins along file-system hierarchy + + Different kinds of inheritance: on creation vs. on access. + Compatible/incompatible plugins. + Inheritance for multi-linked files. + Layered plugins. + Notion of plugin context is abandoned. + +Each file is associated + with one plugin and dependant plugins (hash, etc.) are stored as + main plugin state. Now, if we have plugins used for regular files + but not for directories, how such plugins would be inherited? + . always store them with directories also + +NIKTIA-FIXME-HANS: Do the line above. It is not exclusive of doing the line below which is also useful. + + . use inheritance hierarchy, independent of file-system namespace + +*/ + +#include "../debug.h" +#include "../dformat.h" +#include "plugin_header.h" +#include "item/static_stat.h" +#include "node/node.h" +#include "security/perm.h" +#include "space/space_allocator.h" +#include "disk_format/disk_format.h" +#include "plugin.h" +#include "../reiser4.h" +#include "../jnode.h" +#include "../inode.h" + +#include /* for struct super_block */ + +/* public interface */ + +/* initialise plugin sub-system. Just call this once on reiser4 startup. */ +int init_plugins(void); +int setup_plugins(struct super_block *super, reiser4_plugin ** area); +reiser4_plugin *lookup_plugin(const char *type_label, const char *plug_label); +int locate_plugin(struct inode *inode, plugin_locator * loc); + +/* internal functions. */ + +static reiser4_plugin_type find_type(const char *label); +static reiser4_plugin *find_plugin(reiser4_plugin_type_data * ptype, const char *label); + +/* initialise plugin sub-system. Just call this once on reiser4 startup. */ +reiser4_internal int +init_plugins(void) +{ + reiser4_plugin_type type_id; + + for (type_id = 0; type_id < REISER4_PLUGIN_TYPES; ++type_id) { + reiser4_plugin_type_data *ptype; + int i; + + ptype = &plugins[type_id]; + assert("nikita-3508", ptype->label != NULL); + assert("nikita-3509", ptype->type_id == type_id); + + plugin_list_init(&ptype->plugins_list); +/* NIKITA-FIXME-HANS: change builtin_num to some other name lacking the term builtin. */ + for (i = 0; i < ptype->builtin_num; ++i) { + reiser4_plugin *plugin; + + plugin = plugin_at(ptype, i); + + if (plugin->h.label == NULL) + /* uninitialized slot encountered */ + continue; + assert("nikita-3445", plugin->h.type_id == type_id); + plugin->h.id = i; + if (plugin->h.pops != NULL && + plugin->h.pops->init != NULL) { + int result; + + result = plugin->h.pops->init(plugin); + if (result != 0) + return result; + } + plugin_list_clean(plugin); + plugin_list_push_back(&ptype->plugins_list, plugin); + } + } + return 0; +} + +/* true if plugin type id is valid */ +reiser4_internal int +is_type_id_valid(reiser4_plugin_type type_id /* plugin type id */) +{ + /* "type_id" is unsigned, so no comparison with 0 is + necessary */ + return (type_id < REISER4_PLUGIN_TYPES); +} + +/* true if plugin id is valid */ +reiser4_internal int +is_plugin_id_valid(reiser4_plugin_type type_id /* plugin type id */ , + reiser4_plugin_id id /* plugin id */) +{ + assert("nikita-1653", is_type_id_valid(type_id)); + return ((id < plugins[type_id].builtin_num) && (id >= 0)); +} + +/* lookup plugin by scanning tables */ +reiser4_internal reiser4_plugin * +lookup_plugin(const char *type_label /* plugin type label */ , + const char *plug_label /* plugin label */ ) +{ + reiser4_plugin *result; + reiser4_plugin_type type_id; + + assert("nikita-546", type_label != NULL); + assert("nikita-547", plug_label != NULL); + + type_id = find_type(type_label); + if (is_type_id_valid(type_id)) + result = find_plugin(&plugins[type_id], plug_label); + else + result = NULL; + return result; +} + +/* return plugin by its @type_id and @id. + + Both arguments are checked for validness: this is supposed to be called + from user-level. + +NIKITA-FIXME-HANS: Do you instead mean that this checks ids created in +user space, and passed to the filesystem by use of method files? Your +comment really confused me on the first reading.... + +*/ +reiser4_internal reiser4_plugin * +plugin_by_unsafe_id(reiser4_plugin_type type_id /* plugin + * type id, + * unchecked */ , + reiser4_plugin_id id /* plugin id, + * unchecked */ ) +{ + if (is_type_id_valid(type_id)) { + if (is_plugin_id_valid(type_id, id)) + return plugin_at(&plugins[type_id], id); + else + /* id out of bounds */ + warning("nikita-2913", + "Invalid plugin id: [%i:%i]", type_id, id); + } else + /* type_id out of bounds */ + warning("nikita-2914", "Invalid type_id: %i", type_id); + return NULL; +} + +/* convert plugin id to the disk format */ +reiser4_internal int +save_plugin_id(reiser4_plugin * plugin /* plugin to convert */ , + d16 * area /* where to store result */ ) +{ + assert("nikita-1261", plugin != NULL); + assert("nikita-1262", area != NULL); + + cputod16((__u16) plugin->h.id, area); + return 0; +} + +/* list of all plugins of given type */ +reiser4_internal plugin_list_head * +get_plugin_list(reiser4_plugin_type type_id /* plugin type + * id */ ) +{ + assert("nikita-1056", is_type_id_valid(type_id)); + return &plugins[type_id].plugins_list; +} + +#if REISER4_DEBUG_OUTPUT +/* print human readable plugin information */ +reiser4_internal void +print_plugin(const char *prefix /* prefix to print */ , + reiser4_plugin * plugin /* plugin to print */ ) +{ + if (plugin != NULL) { + printk("%s: %s (%s:%i)\n", prefix, plugin->h.desc, plugin->h.label, plugin->h.id); + } else + printk("%s: (nil)\n", prefix); +} + +#endif + +/* find plugin type by label */ +static reiser4_plugin_type +find_type(const char *label /* plugin type + * label */ ) +{ + reiser4_plugin_type type_id; + + assert("nikita-550", label != NULL); + + for (type_id = 0; type_id < REISER4_PLUGIN_TYPES && + strcmp(label, plugins[type_id].label); ++type_id) { + ; + } + return type_id; +} + +/* given plugin label find it within given plugin type by scanning + array. Used to map user-visible symbolic name to internal kernel + id */ +static reiser4_plugin * +find_plugin(reiser4_plugin_type_data * ptype /* plugin + * type to + * find + * plugin + * within */ , + const char *label /* plugin label */ ) +{ + int i; + reiser4_plugin *result; + + assert("nikita-551", ptype != NULL); + assert("nikita-552", label != NULL); + + for (i = 0; i < ptype->builtin_num; ++i) { + result = plugin_at(ptype, i); + if (result->h.label == NULL) + continue; + if (!strcmp(result->h.label, label)) + return result; + } + return NULL; +} + +int +grab_plugin(struct inode *self, struct inode *ancestor, pset_member memb) +{ + reiser4_plugin *plug; + reiser4_inode *parent; + + parent = reiser4_inode_data(ancestor); + plug = pset_get(parent->hset, memb) ? : pset_get(parent->pset, memb); + return grab_plugin_from(self, memb, plug); +} + +static void +update_plugin_mask(reiser4_inode *info, pset_member memb) +{ + struct dentry *rootdir; + reiser4_inode *root; + + rootdir = inode_by_reiser4_inode(info)->i_sb->s_root; + if (rootdir != NULL) { + root = reiser4_inode_data(rootdir->d_inode); + /* + * if inode is different from the default one, or we are + * changing plugin of root directory, update plugin_mask + */ + if (pset_get(info->pset, memb) != pset_get(root->pset, memb) || + info == root) + info->plugin_mask |= (1 << memb); + } +} + +int +grab_plugin_from(struct inode *self, pset_member memb, reiser4_plugin *plug) +{ + reiser4_inode *info; + int result = 0; + + info = reiser4_inode_data(self); + if (pset_get(info->pset, memb) == NULL) { + result = pset_set(&info->pset, memb, plug); + if (result == 0) + update_plugin_mask(info, memb); + } + return result; +} + +int +force_plugin(struct inode *self, pset_member memb, reiser4_plugin *plug) +{ + reiser4_inode *info; + int result = 0; + + info = reiser4_inode_data(self); + if (plug->h.pops != NULL && plug->h.pops->change != NULL) + result = plug->h.pops->change(self, plug); + else + result = pset_set(&info->pset, memb, plug); + if (result == 0) + update_plugin_mask(info, memb); + return result; +} + +/* defined in fs/reiser4/plugin/file.c */ +extern file_plugin file_plugins[LAST_FILE_PLUGIN_ID]; +/* defined in fs/reiser4/plugin/dir.c */ +extern dir_plugin dir_plugins[LAST_DIR_ID]; +/* defined in fs/reiser4/plugin/item/static_stat.c */ +extern sd_ext_plugin sd_ext_plugins[LAST_SD_EXTENSION]; +/* defined in fs/reiser4/plugin/hash.c */ +extern hash_plugin hash_plugins[LAST_HASH_ID]; +/* defined in fs/reiser4/plugin/fibration.c */ +extern fibration_plugin fibration_plugins[LAST_FIBRATION_ID]; +/* defined in fs/reiser4/plugin/crypt.c */ +extern crypto_plugin crypto_plugins[LAST_CRYPTO_ID]; +/* defined in fs/reiser4/plugin/digest.c */ +extern digest_plugin digest_plugins[LAST_DIGEST_ID]; +/* defined in fs/reiser4/plugin/compress.c */ +extern compression_plugin compression_plugins[LAST_COMPRESSION_ID]; +/* defined in fs/reiser4/plugin/tail.c */ +extern formatting_plugin formatting_plugins[LAST_TAIL_FORMATTING_ID]; +/* defined in fs/reiser4/plugin/security/security.c */ +extern perm_plugin perm_plugins[LAST_PERM_ID]; +/* defined in fs/reiser4/plugin/item/item.c */ +extern item_plugin item_plugins[LAST_ITEM_ID]; +/* defined in fs/reiser4/plugin/node/node.c */ +extern node_plugin node_plugins[LAST_NODE_ID]; +/* defined in fs/reiser4/plugin/disk_format/disk_format.c */ +extern disk_format_plugin format_plugins[LAST_FORMAT_ID]; +/* defined in jnode.c */ +extern jnode_plugin jnode_plugins[LAST_JNODE_TYPE]; +/* defined in plugin/pseudo.c */ +extern pseudo_plugin pseudo_plugins[LAST_PSEUDO_ID]; + +reiser4_plugin_type_data plugins[REISER4_PLUGIN_TYPES] = { + /* C90 initializers */ + [REISER4_FILE_PLUGIN_TYPE] = { + .type_id = REISER4_FILE_PLUGIN_TYPE, + .label = "file", + .desc = "Object plugins", + .builtin_num = sizeof_array(file_plugins), + .builtin = file_plugins, + .plugins_list = TYPE_SAFE_LIST_HEAD_ZERO, + .size = sizeof (file_plugin) + }, + [REISER4_DIR_PLUGIN_TYPE] = { + .type_id = REISER4_DIR_PLUGIN_TYPE, + .label = "dir", + .desc = "Directory plugins", + .builtin_num = sizeof_array(dir_plugins), + .builtin = dir_plugins, + .plugins_list = TYPE_SAFE_LIST_HEAD_ZERO, + .size = sizeof (dir_plugin) + }, + [REISER4_HASH_PLUGIN_TYPE] = { + .type_id = REISER4_HASH_PLUGIN_TYPE, + .label = "hash", + .desc = "Directory hashes", + .builtin_num = sizeof_array(hash_plugins), + .builtin = hash_plugins, + .plugins_list = TYPE_SAFE_LIST_HEAD_ZERO, + .size = sizeof (hash_plugin) + }, + [REISER4_FIBRATION_PLUGIN_TYPE] = { + .type_id = REISER4_FIBRATION_PLUGIN_TYPE, + .label = "fibration", + .desc = "Directory fibrations", + .builtin_num = sizeof_array(fibration_plugins), + .builtin = fibration_plugins, + .plugins_list = TYPE_SAFE_LIST_HEAD_ZERO, + .size = sizeof (fibration_plugin) + }, + [REISER4_CRYPTO_PLUGIN_TYPE] = { + .type_id = REISER4_CRYPTO_PLUGIN_TYPE, + .label = "crypto", + .desc = "Crypto plugins", + .builtin_num = sizeof_array(crypto_plugins), + .builtin = crypto_plugins, + .plugins_list = TYPE_SAFE_LIST_HEAD_ZERO, + .size = sizeof (crypto_plugin) + }, + [REISER4_DIGEST_PLUGIN_TYPE] = { + .type_id = REISER4_DIGEST_PLUGIN_TYPE, + .label = "digest", + .desc = "Digest plugins", + .builtin_num = sizeof_array(digest_plugins), + .builtin = digest_plugins, + .plugins_list = TYPE_SAFE_LIST_HEAD_ZERO, + .size = sizeof (digest_plugin) + }, + [REISER4_COMPRESSION_PLUGIN_TYPE] = { + .type_id = REISER4_COMPRESSION_PLUGIN_TYPE, + .label = "compression", + .desc = "Compression plugins", + .builtin_num = sizeof_array(compression_plugins), + .builtin = compression_plugins, + .plugins_list = TYPE_SAFE_LIST_HEAD_ZERO, + .size = sizeof (compression_plugin) + }, + + [REISER4_FORMATTING_PLUGIN_TYPE] = { + .type_id = REISER4_FORMATTING_PLUGIN_TYPE, + .label = "formatting", + .desc = "Tail inlining policies", + .builtin_num = sizeof_array(formatting_plugins), + .builtin = formatting_plugins, + .plugins_list = TYPE_SAFE_LIST_HEAD_ZERO, + .size = sizeof (formatting_plugin) + }, + [REISER4_PERM_PLUGIN_TYPE] = { + .type_id = REISER4_PERM_PLUGIN_TYPE, + .label = "perm", + .desc = "Permission checks", + .builtin_num = sizeof_array(perm_plugins), + .builtin = perm_plugins, + .plugins_list = TYPE_SAFE_LIST_HEAD_ZERO, + .size = sizeof (perm_plugin) + }, + [REISER4_ITEM_PLUGIN_TYPE] = { + .type_id = REISER4_ITEM_PLUGIN_TYPE, + .label = "item", + .desc = "Item handlers", + .builtin_num = sizeof_array(item_plugins), + .builtin = item_plugins, + .plugins_list = TYPE_SAFE_LIST_HEAD_ZERO, + .size = sizeof (item_plugin) + }, + [REISER4_NODE_PLUGIN_TYPE] = { + .type_id = REISER4_NODE_PLUGIN_TYPE, + .label = "node", + .desc = "node layout handlers", + .builtin_num = sizeof_array(node_plugins), + .builtin = node_plugins, + .plugins_list = TYPE_SAFE_LIST_HEAD_ZERO, + .size = sizeof (node_plugin) + }, + [REISER4_SD_EXT_PLUGIN_TYPE] = { + .type_id = REISER4_SD_EXT_PLUGIN_TYPE, + .label = "sd_ext", + .desc = "Parts of stat-data", + .builtin_num = sizeof_array(sd_ext_plugins), + .builtin = sd_ext_plugins, + .plugins_list = TYPE_SAFE_LIST_HEAD_ZERO, + .size = sizeof (sd_ext_plugin) + }, + [REISER4_FORMAT_PLUGIN_TYPE] = { + .type_id = REISER4_FORMAT_PLUGIN_TYPE, + .label = "disk_layout", + .desc = "defines filesystem on disk layout", + .builtin_num = sizeof_array(format_plugins), + .builtin = format_plugins, + .plugins_list = TYPE_SAFE_LIST_HEAD_ZERO, + .size = sizeof (disk_format_plugin) + }, + [REISER4_JNODE_PLUGIN_TYPE] = { + .type_id = REISER4_JNODE_PLUGIN_TYPE, + .label = "jnode", + .desc = "defines kind of jnode", + .builtin_num = sizeof_array(jnode_plugins), + .builtin = jnode_plugins, + .plugins_list = TYPE_SAFE_LIST_HEAD_ZERO, + .size = sizeof (jnode_plugin) + }, + [REISER4_PSEUDO_PLUGIN_TYPE] = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .label = "pseudo_file", + .desc = "pseudo file", + .builtin_num = sizeof_array(pseudo_plugins), + .builtin = pseudo_plugins, + .plugins_list = TYPE_SAFE_LIST_HEAD_ZERO, + .size = sizeof (pseudo_plugin) + } +}; + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/plugin.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/plugin.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,832 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Basic plugin data-types. + see fs/reiser4/plugin/plugin.c for details */ + +#if !defined( __FS_REISER4_PLUGIN_TYPES_H__ ) +#define __FS_REISER4_PLUGIN_TYPES_H__ + +#include "../forward.h" +#include "../debug.h" +#include "../dformat.h" +#include "../key.h" +#include "../type_safe_list.h" +#include "compress/compress.h" +#include "plugin_header.h" +#include "item/static_stat.h" +#include "item/internal.h" +#include "item/sde.h" +#include "item/cde.h" +#include "pseudo/pseudo.h" +#include "symlink.h" +#include "dir/hashed_dir.h" +#include "dir/dir.h" +#include "item/item.h" +#include "node/node.h" +#include "node/node40.h" +#include "security/perm.h" +#include "fibration.h" + +#include "space/bitmap.h" +#include "space/space_allocator.h" + +#include "disk_format/disk_format40.h" +#include "disk_format/disk_format.h" + +#include /* for struct super_block, address_space */ +#include /* for struct page */ +#include /* for struct buffer_head */ +#include /* for struct dentry */ +#include +#include + +/* a flow is a sequence of bytes being written to or read from the tree. The + tree will slice the flow into items while storing it into nodes, but all of + that is hidden from anything outside the tree. */ + +struct flow { + reiser4_key key; /* key of start of flow's sequence of bytes */ + loff_t length; /* length of flow's sequence of bytes */ + char *data; /* start of flow's sequence of bytes */ + int user; /* if 1 data is user space, 0 - kernel space */ + rw_op op; /* NIKITA-FIXME-HANS: comment is where? */ +}; + +typedef ssize_t(*rw_f_type) (struct file * file, flow_t * a_flow, loff_t * off); + +typedef struct reiser4_object_on_wire reiser4_object_on_wire; + +/* File plugin. Defines the set of methods that file plugins implement, some of which are optional. + + A file plugin offers to the caller an interface for IO ( writing to and/or reading from) to what the caller sees as one + sequence of bytes. An IO to it may affect more than one physical sequence of bytes, or no physical sequence of bytes, + it may affect sequences of bytes offered by other file plugins to the semantic layer, and the file plugin may invoke + other plugins and delegate work to them, but its interface is structured for offering the caller the ability to read + and/or write what the caller sees as being a single sequence of bytes. + + The file plugin must present a sequence of bytes to the caller, but it does not necessarily have to store a sequence of + bytes, it does not necessarily have to support efficient tree traversal to any offset in the sequence of bytes (tail + and extent items, whose keys contain offsets, do however provide efficient non-sequential lookup of any offset in the + sequence of bytes). + + Directory plugins provide methods for selecting file plugins by resolving a name for them. + + The functionality other filesystems call an attribute, and rigidly tie together, we decompose into orthogonal + selectable features of files. Using the terminology we will define next, an attribute is a perhaps constrained, + perhaps static length, file whose parent has a uni-count-intra-link to it, which might be grandparent-major-packed, and + whose parent has a deletion method that deletes it. + + File plugins can implement constraints. + + Files can be of variable length (e.g. regular unix files), or of static length (e.g. static sized attributes). + + An object may have many sequences of bytes, and many file plugins, but, it has exactly one objectid. It is usually + desirable that an object has a deletion method which deletes every item with that objectid. Items cannot in general be + found by just their objectids. This means that an object must have either a method built into its deletion plugin + method for knowing what items need to be deleted, or links stored with the object that provide the plugin with a method + for finding those items. Deleting a file within an object may or may not have the effect of deleting the entire + object, depending on the file plugin's deletion method. + + LINK TAXONOMY: + + Many objects have a reference count, and when the reference count reaches 0 the object's deletion method is invoked. + Some links embody a reference count increase ("countlinks"), and others do not ("nocountlinks"). + + Some links are bi-directional links ("bilinks"), and some are uni-directional("unilinks"). + + Some links are between parts of the same object ("intralinks"), and some are between different objects ("interlinks"). + + PACKING TAXONOMY: + + Some items of an object are stored with a major packing locality based on their object's objectid (e.g. unix directory + items in plan A), and these are called "self-major-packed". + + Some items of an object are stored with a major packing locality based on their semantic parent object's objectid + (e.g. unix file bodies in plan A), and these are called "parent-major-packed". + + Some items of an object are stored with a major packing locality based on their semantic grandparent, and these are + called "grandparent-major-packed". Now carefully notice that we run into trouble with key length if we have to store a + 8 byte major+minor grandparent based packing locality, an 8 byte parent objectid, an 8 byte attribute objectid, and an + 8 byte offset, all in a 24 byte key. One of these fields must be sacrificed if an item is to be + grandparent-major-packed, and which to sacrifice is left to the item author choosing to make the item + grandparent-major-packed. You cannot make tail items and extent items grandparent-major-packed, though you could make + them self-major-packed (usually they are parent-major-packed). + + In the case of ACLs (which are composed of fixed length ACEs which consist of {subject-type, + subject, and permission bitmask} triples), it makes sense to not have an offset field in the ACE item key, and to allow + duplicate keys for ACEs. Thus, the set of ACES for a given file is found by looking for a key consisting of the + objectid of the grandparent (thus grouping all ACLs in a directory together), the minor packing locality of ACE, the + objectid of the file, and 0. + + IO involves moving data from one location to another, which means that two locations must be specified, source and + destination. + + This source and destination can be in the filesystem, or they can be a pointer in the user process address space plus a byte count. + + If both source and destination are in the filesystem, then at least one of them must be representable as a pure stream + of bytes (which we call a flow, and define as a struct containing a key, a data pointer, and a length). This may mean + converting one of them into a flow. We provide a generic cast_into_flow() method, which will work for any plugin + supporting read_flow(), though it is inefficiently implemented in that it temporarily stores the flow in a buffer + (Question: what to do with huge flows that cannot fit into memory? Answer: we must not convert them all at once. ) + + Performing a write requires resolving the write request into a flow defining the source, and a method that performs the write, and + a key that defines where in the tree the write is to go. + + Performing a read requires resolving the read request into a flow defining the target, and a method that performs the + read, and a key that defines where in the tree the read is to come from. + + There will exist file plugins which have no pluginid stored on the disk for them, and which are only invoked by other + plugins. + +*/ +typedef struct file_plugin { + + /* generic fields */ + plugin_header h; + + /* file_operations->open is dispatched here */ + int (*open) (struct inode * inode, struct file * file); + /* NIKITA-FIXME-HANS: comment all fields, even the ones every non-beginner FS developer knows.... */ + int (*truncate) (struct inode * inode, loff_t size); + + /* save inode cached stat-data onto disk. It was called + reiserfs_update_sd() in 3.x */ + int (*write_sd_by_inode) (struct inode * inode); + int (*readpage) (void *, struct page *); + int (*prepare_write) (struct file *, struct page *, unsigned, unsigned); + + /* captures passed page to current atom and takes care about extents handling. + This is needed for loop back devices support and used from ->commit_write() + +*/ /* ZAM-FIXME-HANS: are you writing to yourself or the reader? Bigger comment please. */ + int (*capturepage) (struct page *); + /* + * add pages created through mmap into object. + */ + int (*capture) (struct inode *inode, struct writeback_control *wbc); + /* these should be implemented using body_read_flow and body_write_flow + builtins */ + ssize_t(*read) (struct file * file, char *buf, size_t size, loff_t * off); + ssize_t(*write) (struct file * file, const char *buf, size_t size, loff_t * off); + + int (*release) (struct inode *inode, struct file * file); + int (*ioctl) (struct inode *, struct file *, unsigned int cmd, unsigned long arg); + int (*mmap) (struct file * file, struct vm_area_struct * vma); + int (*get_block) (struct inode * inode, sector_t block, struct buffer_head * bh_result, int create); +/* private methods: These are optional. If used they will allow you to + minimize the amount of code needed to implement a deviation from some other + method that also uses them. */ + + /* Construct flow into @flow according to user-supplied data. + + This is used by read/write methods to construct a flow to + write/read. ->flow_by_inode() is plugin method, rather than single + global implementation, because key in a flow used by plugin may + depend on data in a @buf. + +NIKITA-FIXME-HANS: please create statistics on what functions are +dereferenced how often for the mongo benchmark. You can supervise +Elena doing this for you if that helps. Email me the list of the top 10, with their counts, and an estimate of the total number of CPU cycles spent dereferencing as a percentage of CPU cycles spent processing (non-idle processing). If the total percent is, say, less than 1%, it will make our coding discussions much easier, and keep me from questioning whether functions like the below are too frequently called to be dereferenced. If the total percent is more than 1%, perhaps private methods should be listed in a "required" comment at the top of each plugin (with stern language about how if the comment is missing it will not be accepted by the maintainer), and implemented using macros not dereferenced functions. How about replacing this whole private methods part of the struct with a thorough documentation of what the standard helper functions are for use in constructing plugins? I think users have been asking for that, though not in so many words. + */ + int (*flow_by_inode) (struct inode *, char *buf, int user, loff_t size, loff_t off, rw_op op, flow_t *); + + /* Return the key used to retrieve an offset of a file. It is used by + default implementation of ->flow_by_inode() method + (common_build_flow()) and, among other things, to get to the extent + from jnode of unformatted node. + */ + int (*key_by_inode) (struct inode * inode, loff_t off, reiser4_key * key); + +/* NIKITA-FIXME-HANS: this comment is not as clear to others as you think.... */ + /* set the plugin for a file. Called during file creation in creat() + but not reiser4() unless an inode already exists for the file. */ + int (*set_plug_in_inode) (struct inode * inode, struct inode * parent, reiser4_object_create_data * data); + +/* NIKITA-FIXME-HANS: comment and name seem to say different things, are you setting up the object itself also or just adjusting the parent?.... */ + /* set up plugins for new @object created in @parent. @root is root + directory. */ + int (*adjust_to_parent) (struct inode * object, struct inode * parent, struct inode * root); + /* this does whatever is necessary to do when object is created. For + instance, for unix files stat data is inserted */ + int (*create) (struct inode * object, struct inode * parent, + reiser4_object_create_data * data); + /* delete empty object. This method should check REISER4_NO_SD + and set REISER4_NO_SD on success. Deletion of empty object + at least includes removal of stat-data if any. For directories this + also includes removal of dot and dot-dot. + */ + int (*delete) (struct inode * object); + + /* method implementing f_op->fsync() */ + int (*sync)(struct inode *, int datasync); + + /* add link from @parent to @object */ + int (*add_link) (struct inode * object, struct inode * parent); + + /* remove link from @parent to @object */ + int (*rem_link) (struct inode * object, struct inode * parent); + + /* return true if item addressed by @coord belongs to @inode. + This is used by read/write to properly slice flow into items + in presence of multiple key assignment policies, because + items of a file are not necessarily contiguous in a key space, + for example, in a plan-b. */ + int (*owns_item) (const struct inode * inode, const coord_t * coord); + + /* checks whether yet another hard links to this object can be + added */ + int (*can_add_link) (const struct inode * inode); + /* checks whether hard links to this object can be removed */ + int (*can_rem_link) (const struct inode * inode); + /* true if there is only one link (aka name) for this file */ + int (*not_linked) (const struct inode * inode); + + /* change inode attributes. */ + int (*setattr) (struct inode * inode, struct iattr * attr); + + /* obtain inode attributes */ + int (*getattr) (struct vfsmount * mnt UNUSED_ARG, struct dentry * dentry, struct kstat * stat); + + /* seek */ + loff_t(*seek) (struct file * f, loff_t offset, int origin); + + int (*detach)(struct inode *child, struct inode *parent); + + /* called when @child was just looked up in the @parent */ + int (*bind) (struct inode * child, struct inode * parent); + + /* process safe-link during mount */ + int (*safelink)(struct inode *object, reiser4_safe_link_t link, + __u64 value); + + /* The couple of estimate methods for all file operations */ + struct { + reiser4_block_nr (*create) (struct inode *); + reiser4_block_nr (*update) (const struct inode *); + reiser4_block_nr (*unlink) (struct inode *, struct inode *); + } estimate; +/* + void (*readpages)(struct file *file, struct address_space *mapping, + struct list_head *pages); +*/ + /* reiser4 specific part of inode has a union of structures which are + specific to a plugin. This method is called when inode is read + (read_inode) and when file is created (common_create_child) so that + file plugin could initialize its inode data */ + void (*init_inode_data)(struct inode *, reiser4_object_create_data *, int); + + /** + * This method performs progressive deletion of items and whole nodes + from right to left. + * + * @tap: the point deletion process begins from, + * @from_key: the beginning of the deleted key range, + * @to_key: the end of the deleted key range, + * @smallest_removed: the smallest removed key, + * + * @return: 0 if success, error code otherwise, -E_REPEAT means that long cut_tree + * operation was interrupted for allowing atom commit . + */ + int (*cut_tree_worker)(tap_t * tap, const reiser4_key * from_key, const reiser4_key * to_key, + reiser4_key * smallest_removed, struct inode * object, int, int*); + + /* truncate file to zero size. called by reiser4_drop_inode before truncate_inode_pages */ + int (*pre_delete)(struct inode *); + + /* called from reiser4_drop_inode() */ + void (*drop)(struct inode *); + + /* called from ->drop() when there are no links, and object should be + * garbage collected. */ + void (*delete_inode)(struct inode *); + + /* called from ->destroy_inode() */ + void (*destroy_inode)(struct inode *); + void (*forget_inode)(struct inode *); + ssize_t (*sendfile)(struct file *, loff_t *, size_t, read_actor_t, void __user *); + /* + * methods to serialize object identify. This is used, for example, by + * reiser4_{en,de}code_fh(). + */ + struct { + /* store object's identity at @area */ + char *(*write)(struct inode *inode, char *area); + /* parse object from wire to the @obj */ + char *(*read)(char *area, reiser4_object_on_wire *obj); + /* given object identity in @obj, find or create its dentry */ + struct dentry *(*get)(struct super_block *s, + reiser4_object_on_wire *obj); + /* how many bytes ->wire.write() consumes */ + int (*size)(struct inode *inode); + /* finish with object identify */ + void (*done)(reiser4_object_on_wire *obj); + } wire; +} file_plugin; + +struct reiser4_object_on_wire { + file_plugin *plugin; + union { + struct { + obj_key_id key_id; + } std; + void *generic; + } u; +}; + +typedef struct dir_plugin { + /* generic fields */ + plugin_header h; + /* for use by open call, based on name supplied will install + appropriate plugin and state information, into the inode such that + subsequent VFS operations that supply a pointer to that inode + operate in a manner appropriate. Note that this may require storing + some state for the plugin, and that this state might even include + the name used by open. */ + int (*lookup) (struct inode * parent_inode, struct dentry **dentry); + /* VFS required/defined operations below this line */ + int (*unlink) (struct inode * parent, struct dentry * victim); + int (*link) (struct inode * parent, struct dentry * existing, struct dentry * where); + /* rename object named by @old entry in @old_dir to be named by @new + entry in @new_dir */ + int (*rename) (struct inode * old_dir, struct dentry * old, struct inode * new_dir, struct dentry * new); + + /* create new object described by @data and add it to the @parent + directory under the name described by @dentry */ + int (*create_child) (reiser4_object_create_data * data, + struct inode ** retobj); + + /* readdir implementation */ + int (*readdir) (struct file * f, void *cookie, filldir_t filldir); + + /* private methods: These are optional. If used they will allow you to + minimize the amount of code needed to implement a deviation from + some other method that uses them. You could logically argue that + they should be a separate type of plugin. */ + + /* check whether "name" is acceptable name to be inserted into + this object. Optionally implemented by directory-like objects. + Can check for maximal length, reserved symbols etc */ + int (*is_name_acceptable) (const struct inode * inode, const char *name, int len); + + void (*build_entry_key) (const struct inode * dir /* directory where + * entry is (or will + * be) in.*/ , + const struct qstr * name /* name of file referenced + * by this entry */ , + reiser4_key * result /* resulting key of directory + * entry */ ); + int (*build_readdir_key) (struct file * dir, reiser4_key * result); + int (*add_entry) (struct inode * object, struct dentry * where, + reiser4_object_create_data * data, reiser4_dir_entry_desc * entry); + + int (*rem_entry) (struct inode * object, struct dentry * where, reiser4_dir_entry_desc * entry); + + /* initialize directory structure for newly created object. For normal + unix directories, insert dot and dotdot. */ + int (*init) (struct inode * object, struct inode * parent, reiser4_object_create_data * data); + /* destroy directory */ + int (*done) (struct inode * child); + + /* called when @subdir was just looked up in the @dir */ + int (*attach) (struct inode * subdir, struct inode * dir); + int (*detach)(struct inode * subdir, struct inode * dir); + + struct dentry *(*get_parent)(struct inode *childdir); + + struct { + reiser4_block_nr (*add_entry) (struct inode *node); + reiser4_block_nr (*rem_entry) (struct inode *node); + reiser4_block_nr (*unlink) (struct inode *, struct inode *); + } estimate; +} dir_plugin; + +typedef struct formatting_plugin { + /* generic fields */ + plugin_header h; + /* returns non-zero iff file's tail has to be stored + in a direct item. */ + int (*have_tail) (const struct inode * inode, loff_t size); +} formatting_plugin; + +typedef struct hash_plugin { + /* generic fields */ + plugin_header h; + /* computes hash of the given name */ + __u64(*hash) (const unsigned char *name, int len); +} hash_plugin; + +typedef struct crypto_plugin { + /* generic fields */ + plugin_header h; + int (*alloc) (struct inode * inode); + void (*free) (struct inode * inode); + /* number of cpu expkey words */ + unsigned nr_keywords; + /* Offset translator. For each offset this returns (k * offset), where + k (k >= 1) is a coefficient of expansion of the crypto algorithm. + For all symmetric algorithms k == 1. For asymmetric algorithms (which + inflate data) offset translation guarantees that all disk cluster's + units will have keys smaller then next cluster's one. + */ + loff_t (*scale)(struct inode * inode, size_t blocksize, loff_t src); + /* Crypto algorithms can accept data only by chunks of crypto block + size. This method is to align any flow up to crypto block size when + we pass it to crypto algorithm. To align means to append padding of + special format specific to the crypto algorithm */ + int (*align_cluster)(__u8 *tail, int clust_size, int blocksize); + /* low-level key manager (check, install, etc..) */ + int (*setkey) (struct crypto_tfm *tfm, const __u8 *key, unsigned int keylen); + /* main text processing procedures */ + void (*encrypt) (__u32 *expkey, __u8 *dst, const __u8 *src); + void (*decrypt) (__u32 *expkey, __u8 *dst, const __u8 *src); +} crypto_plugin; + +typedef struct digest_plugin { + /* generic fields */ + plugin_header h; + /* digest size */ + int dsize; + int (*alloc) (struct inode * inode); + void (*free) (struct inode * inode); +} digest_plugin; + +typedef struct compression_plugin { + /* generic fields */ + plugin_header h; + /* the maximum number of bytes the size of the "compressed" data can + * exceed the uncompressed data. */ + int (*overrun) (unsigned src_len); + coa_t (*alloc) (tfm_action act); + void (*free) (coa_t coa, tfm_action act); + /* minimal size of the flow we still try to compress */ + int (*min_tfm_size) (void); + /* main transform procedures */ + void (*compress) (coa_t coa, __u8 *src_first, unsigned src_len, + __u8 *dst_first, unsigned *dst_len); + void (*decompress) (coa_t coa, __u8 *src_first, unsigned src_len, + __u8 *dst_first, unsigned *dst_len); +}compression_plugin; + +typedef struct sd_ext_plugin { + /* generic fields */ + plugin_header h; + int (*present) (struct inode * inode, char **area, int *len); + int (*absent) (struct inode * inode); + int (*save_len) (struct inode * inode); + int (*save) (struct inode * inode, char **area); + /* alignment requirement for this stat-data part */ + int alignment; +} sd_ext_plugin; + +/* this plugin contains methods to allocate objectid for newly created files, + to deallocate objectid when file gets removed, to report number of used and + free objectids */ +typedef struct oid_allocator_plugin { + /* generic fields */ + plugin_header h; + int (*init_oid_allocator) (reiser4_oid_allocator * map, __u64 nr_files, __u64 oids); + /* used to report statfs->f_files */ + __u64(*oids_used) (reiser4_oid_allocator * map); + /* get next oid to use */ + __u64(*next_oid) (reiser4_oid_allocator * map); + /* used to report statfs->f_ffree */ + __u64(*oids_free) (reiser4_oid_allocator * map); + /* allocate new objectid */ + int (*allocate_oid) (reiser4_oid_allocator * map, oid_t *); + /* release objectid */ + int (*release_oid) (reiser4_oid_allocator * map, oid_t); + /* how many pages to reserve in transaction for allocation of new + objectid */ + int (*oid_reserve_allocate) (reiser4_oid_allocator * map); + /* how many pages to reserve in transaction for freeing of an + objectid */ + int (*oid_reserve_release) (reiser4_oid_allocator * map); + void (*print_info) (const char *, reiser4_oid_allocator *); +} oid_allocator_plugin; + +/* disk layout plugin: this specifies super block, journal, bitmap (if there + are any) locations, etc */ +typedef struct disk_format_plugin { + /* generic fields */ + plugin_header h; + /* replay journal, initialize super_info_data, etc */ + int (*get_ready) (struct super_block *, void *data); + + /* key of root directory stat data */ + const reiser4_key *(*root_dir_key) (const struct super_block *); + + int (*release) (struct super_block *); + jnode *(*log_super) (struct super_block *); + void (*print_info) (const struct super_block *); + int (*check_open) (const struct inode *object); +} disk_format_plugin; + +struct jnode_plugin { + /* generic fields */ + plugin_header h; + int (*init) (jnode * node); + int (*parse) (jnode * node); + struct address_space *(*mapping) (const jnode * node); + unsigned long (*index) (const jnode * node); + jnode *(*clone) (jnode * node); +}; + +/* plugin instance. */ +/* */ +/* This is "wrapper" union for all types of plugins. Most of the code uses */ +/* plugins of particular type (file_plugin, dir_plugin, etc.) rather than */ +/* operates with pointers to reiser4_plugin. This union is only used in */ +/* some generic code in plugin/plugin.c that operates on all */ +/* plugins. Technically speaking purpose of this union is to add type */ +/* safety to said generic code: each plugin type (file_plugin, for */ +/* example), contains plugin_header as its first memeber. This first member */ +/* is located at the same place in memory as .h member of */ +/* reiser4_plugin. Generic code, obtains pointer to reiser4_plugin and */ +/* looks in the .h which is header of plugin type located in union. This */ +/* allows to avoid type-casts. */ +union reiser4_plugin { + /* generic fields */ + plugin_header h; + /* file plugin */ + file_plugin file; + /* directory plugin */ + dir_plugin dir; + /* hash plugin, used by directory plugin */ + hash_plugin hash; + /* fibration plugin used by directory plugin */ + fibration_plugin fibration; + /* crypto plugin, used by file plugin */ + crypto_plugin crypto; + /* digest plugin, used by file plugin */ + digest_plugin digest; + /* compression plugin, used by file plugin */ + compression_plugin compression; + /* tail plugin, used by file plugin */ + formatting_plugin formatting; + /* permission plugin */ + perm_plugin perm; + /* node plugin */ + node_plugin node; + /* item plugin */ + item_plugin item; + /* stat-data extension plugin */ + sd_ext_plugin sd_ext; + /* disk layout plugin */ + disk_format_plugin format; + /* object id allocator plugin */ + oid_allocator_plugin oid_allocator; + /* plugin for different jnode types */ + jnode_plugin jnode; + /* plugin for pseudo files */ + pseudo_plugin pseudo; + /* place-holder for new plugin types that can be registered + dynamically, and used by other dynamically loaded plugins. */ + void *generic; +}; + +struct reiser4_plugin_ops { + /* called when plugin is initialized */ + int (*init) (reiser4_plugin * plugin); + /* called when plugin is unloaded */ + int (*done) (reiser4_plugin * plugin); + /* load given plugin from disk */ + int (*load) (struct inode * inode, + reiser4_plugin * plugin, char **area, int *len); + /* how many space is required to store this plugin's state + in stat-data */ + int (*save_len) (struct inode * inode, reiser4_plugin * plugin); + /* save persistent plugin-data to disk */ + int (*save) (struct inode * inode, reiser4_plugin * plugin, char **area); + /* alignment requirement for on-disk state of this plugin + in number of bytes */ + int alignment; + /* install itself into given inode. This can return error + (e.g., you cannot change hash of non-empty directory). */ + int (*change) (struct inode * inode, reiser4_plugin * plugin); + /* install itself into given inode. This can return error + (e.g., you cannot change hash of non-empty directory). */ + int (*inherit) (struct inode * inode, struct inode * parent, + reiser4_plugin * plugin); +}; + +/* functions implemented in fs/reiser4/plugin/plugin.c */ + +/* stores plugin reference in reiser4-specific part of inode */ +extern int set_object_plugin(struct inode *inode, reiser4_plugin_id id); +extern int setup_plugins(struct super_block *super, reiser4_plugin ** area); +extern reiser4_plugin *lookup_plugin(const char *type_label, const char *plug_label); +extern int init_plugins(void); + +/* functions implemented in fs/reiser4/plugin/object.c */ +void move_flow_forward(flow_t * f, unsigned count); + +/* builtin plugins */ + +/* builtin file-plugins */ +typedef enum { + /* regular file */ + UNIX_FILE_PLUGIN_ID, + /* directory */ + DIRECTORY_FILE_PLUGIN_ID, + /* symlink */ + SYMLINK_FILE_PLUGIN_ID, + /* for objects completely handled by the VFS: fifos, devices, + sockets */ + SPECIAL_FILE_PLUGIN_ID, + /* Plugin id for crypto-compression objects */ + CRC_FILE_PLUGIN_ID, + /* pseudo file */ + PSEUDO_FILE_PLUGIN_ID, + /* number of file plugins. Used as size of arrays to hold + file plugins. */ + LAST_FILE_PLUGIN_ID +} reiser4_file_id; + +/* builtin dir-plugins */ +typedef enum { + HASHED_DIR_PLUGIN_ID, + SEEKABLE_HASHED_DIR_PLUGIN_ID, + PSEUDO_DIR_PLUGIN_ID, + LAST_DIR_ID +} reiser4_dir_id; + +/* builtin hash-plugins */ + +typedef enum { + RUPASOV_HASH_ID, + R5_HASH_ID, + TEA_HASH_ID, + FNV1_HASH_ID, + DEGENERATE_HASH_ID, + LAST_HASH_ID +} reiser4_hash_id; + +/* builtin crypto-plugins */ + +typedef enum { + NONE_CRYPTO_ID, + LAST_CRYPTO_ID +} reiser4_crypto_id; + +/* builtin digest plugins */ + +typedef enum { + NONE_DIGEST_ID, + LAST_DIGEST_ID +} reiser4_digest_id; + +/* builtin tail-plugins */ + +typedef enum { + NEVER_TAILS_FORMATTING_ID, + ALWAYS_TAILS_FORMATTING_ID, + SMALL_FILE_FORMATTING_ID, + LAST_TAIL_FORMATTING_ID +} reiser4_formatting_id; + +/* Encapsulations of crypto specific data */ +typedef struct crypto_data { + reiser4_crypto_id cra; /* id of the crypto algorithm */ + reiser4_digest_id dia; /* id of the digest algorithm */ + __u8 * key; /* secret key */ + __u16 keysize; /* key size, bits */ + __u8 * keyid; /* keyid */ + __u16 keyid_size; /* keyid size, bytes */ +} crypto_data_t; + +/* compression/clustering specific data */ +typedef struct compression_data { + reiser4_compression_id coa; /* id of the compression algorithm */ +} compression_data_t; + +typedef __u8 cluster_data_t; /* cluster info */ + +/* data type used to pack parameters that we pass to vfs + object creation function create_object() */ +struct reiser4_object_create_data { + /* plugin to control created object */ + reiser4_file_id id; + /* mode of regular file, directory or special file */ +/* what happens if some other sort of perm plugin is in use? */ + int mode; + /* rdev of special file */ + dev_t rdev; + /* symlink target */ + const char *name; + /* add here something for non-standard objects you invent, like + query for interpolation file etc. */ + crypto_data_t * crypto; + compression_data_t * compression; + cluster_data_t * cluster; + + struct inode *parent; + struct dentry *dentry; +}; + +#define MAX_PLUGIN_TYPE_LABEL_LEN 32 +#define MAX_PLUGIN_PLUG_LABEL_LEN 32 + +/* used for interface with user-land: table-driven parsing in + reiser4(). */ +typedef struct plugin_locator { + reiser4_plugin_type type_id; + reiser4_plugin_id id; + char type_label[MAX_PLUGIN_TYPE_LABEL_LEN]; + char plug_label[MAX_PLUGIN_PLUG_LABEL_LEN]; +} plugin_locator; + +extern int locate_plugin(struct inode *inode, plugin_locator * loc); + +static inline reiser4_plugin * +plugin_by_id(reiser4_plugin_type type_id, reiser4_plugin_id id); + +static inline reiser4_plugin * +plugin_by_disk_id(reiser4_tree * tree, reiser4_plugin_type type_id, d16 * did); + +#define PLUGIN_BY_ID(TYPE,ID,FIELD) \ +static inline TYPE *TYPE ## _by_id( reiser4_plugin_id id ) \ +{ \ + reiser4_plugin *plugin = plugin_by_id ( ID, id ); \ + return plugin ? & plugin -> FIELD : NULL; \ +} \ +static inline TYPE *TYPE ## _by_disk_id( reiser4_tree *tree, d16 *id ) \ +{ \ + reiser4_plugin *plugin = plugin_by_disk_id ( tree, ID, id ); \ + return plugin ? & plugin -> FIELD : NULL; \ +} \ +static inline TYPE *TYPE ## _by_unsafe_id( reiser4_plugin_id id ) \ +{ \ + reiser4_plugin *plugin = plugin_by_unsafe_id ( ID, id ); \ + return plugin ? & plugin -> FIELD : NULL; \ +} \ +static inline reiser4_plugin* TYPE ## _to_plugin( TYPE* plugin ) \ +{ \ + return ( reiser4_plugin * ) plugin; \ +} \ +static inline reiser4_plugin_id TYPE ## _id( TYPE* plugin ) \ +{ \ + return TYPE ## _to_plugin (plugin) -> h.id; \ +} \ +typedef struct { int foo; } TYPE ## _plugin_dummy + +PLUGIN_BY_ID(item_plugin, REISER4_ITEM_PLUGIN_TYPE, item); +PLUGIN_BY_ID(file_plugin, REISER4_FILE_PLUGIN_TYPE, file); +PLUGIN_BY_ID(dir_plugin, REISER4_DIR_PLUGIN_TYPE, dir); +PLUGIN_BY_ID(node_plugin, REISER4_NODE_PLUGIN_TYPE, node); +PLUGIN_BY_ID(sd_ext_plugin, REISER4_SD_EXT_PLUGIN_TYPE, sd_ext); +PLUGIN_BY_ID(perm_plugin, REISER4_PERM_PLUGIN_TYPE, perm); +PLUGIN_BY_ID(hash_plugin, REISER4_HASH_PLUGIN_TYPE, hash); +PLUGIN_BY_ID(fibration_plugin, REISER4_FIBRATION_PLUGIN_TYPE, fibration); +PLUGIN_BY_ID(crypto_plugin, REISER4_CRYPTO_PLUGIN_TYPE, crypto); +PLUGIN_BY_ID(digest_plugin, REISER4_DIGEST_PLUGIN_TYPE, digest); +PLUGIN_BY_ID(compression_plugin, REISER4_COMPRESSION_PLUGIN_TYPE, compression); +PLUGIN_BY_ID(formatting_plugin, REISER4_FORMATTING_PLUGIN_TYPE, formatting); +PLUGIN_BY_ID(disk_format_plugin, REISER4_FORMAT_PLUGIN_TYPE, format); +PLUGIN_BY_ID(jnode_plugin, REISER4_JNODE_PLUGIN_TYPE, jnode); +PLUGIN_BY_ID(pseudo_plugin, REISER4_PSEUDO_PLUGIN_TYPE, pseudo); + +extern int save_plugin_id(reiser4_plugin * plugin, d16 * area); + + +TYPE_SAFE_LIST_DEFINE(plugin, reiser4_plugin, h.linkage); + +extern plugin_list_head *get_plugin_list(reiser4_plugin_type type_id); + +#define for_all_plugins( ptype, plugin ) \ +for( plugin = plugin_list_front( get_plugin_list( ptype ) ) ; \ + ! plugin_list_end( get_plugin_list( ptype ), plugin ) ; \ + plugin = plugin_list_next( plugin ) ) + +/* enumeration of fields within plugin_set */ +typedef enum { + PSET_FILE, + PSET_DIR, /* PSET_FILE and PSET_DIR should be first elements: + * inode.c:read_inode() depends on this. */ + PSET_PERM, + PSET_FORMATTING, + PSET_HASH, + PSET_FIBRATION, + PSET_SD, + PSET_DIR_ITEM, + PSET_CRYPTO, + PSET_DIGEST, + PSET_COMPRESSION, + PSET_LAST +} pset_member; + +int grab_plugin(struct inode *self, struct inode *ancestor, pset_member memb); +int grab_plugin_from(struct inode *self, pset_member memb, reiser4_plugin *plug); +int force_plugin(struct inode *self, pset_member memb, reiser4_plugin *plug); + +/* __FS_REISER4_PLUGIN_TYPES_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/plugin_header.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/plugin_header.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,136 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* plugin header. Data structures required by all plugin types. */ + +#if !defined( __PLUGIN_HEADER_H__ ) +#define __PLUGIN_HEADER_H__ + +/* plugin data-types and constants */ + +#include "../type_safe_list.h" +#include "../dformat.h" + +typedef enum { + REISER4_FILE_PLUGIN_TYPE, + REISER4_DIR_PLUGIN_TYPE, + REISER4_ITEM_PLUGIN_TYPE, + REISER4_NODE_PLUGIN_TYPE, + REISER4_HASH_PLUGIN_TYPE, + REISER4_FIBRATION_PLUGIN_TYPE, + REISER4_FORMATTING_PLUGIN_TYPE, + REISER4_PERM_PLUGIN_TYPE, + REISER4_SD_EXT_PLUGIN_TYPE, + REISER4_FORMAT_PLUGIN_TYPE, + REISER4_JNODE_PLUGIN_TYPE, + REISER4_CRYPTO_PLUGIN_TYPE, + REISER4_DIGEST_PLUGIN_TYPE, + REISER4_COMPRESSION_PLUGIN_TYPE, + REISER4_PSEUDO_PLUGIN_TYPE, + REISER4_PLUGIN_TYPES +} reiser4_plugin_type; + +struct reiser4_plugin_ops; +/* generic plugin operations, supported by each + plugin type. */ +typedef struct reiser4_plugin_ops reiser4_plugin_ops; + +TYPE_SAFE_LIST_DECLARE(plugin); + +/* the common part of all plugin instances. */ +typedef struct plugin_header { + /* plugin type */ + reiser4_plugin_type type_id; + /* id of this plugin */ + reiser4_plugin_id id; + /* plugin operations */ + reiser4_plugin_ops *pops; +/* NIKITA-FIXME-HANS: usage of and access to label and desc is not commented and defined. */ + /* short label of this plugin */ + const char *label; + /* descriptive string.. */ + const char *desc; + /* list linkage */ + plugin_list_link linkage; +} plugin_header; + + +/* PRIVATE INTERFACES */ +/* NIKITA-FIXME-HANS: what is this for and why does it duplicate what is in plugin_header? */ +/* plugin type representation. */ +typedef struct reiser4_plugin_type_data { + /* internal plugin type identifier. Should coincide with + index of this item in plugins[] array. */ + reiser4_plugin_type type_id; + /* short symbolic label of this plugin type. Should be no longer + than MAX_PLUGIN_TYPE_LABEL_LEN characters including '\0'. */ + const char *label; + /* plugin type description longer than .label */ + const char *desc; + +/* NIKITA-FIXME-HANS: define built-in */ + /* number of built-in plugin instances of this type */ + int builtin_num; + /* array of built-in plugins */ + void *builtin; + plugin_list_head plugins_list; + size_t size; +} reiser4_plugin_type_data; + +extern reiser4_plugin_type_data plugins[REISER4_PLUGIN_TYPES]; + +int is_type_id_valid(reiser4_plugin_type type_id); +int is_plugin_id_valid(reiser4_plugin_type type_id, reiser4_plugin_id id); + +static inline reiser4_plugin * +plugin_at(reiser4_plugin_type_data * ptype, int i) +{ + char *builtin; + + builtin = ptype->builtin; + return (reiser4_plugin *) (builtin + i * ptype->size); +} + + +/* return plugin by its @type_id and @id */ +static inline reiser4_plugin * +plugin_by_id(reiser4_plugin_type type_id /* plugin type id */ , + reiser4_plugin_id id /* plugin id */ ) +{ + assert("nikita-1651", is_type_id_valid(type_id)); + assert("nikita-1652", is_plugin_id_valid(type_id, id)); + return plugin_at(&plugins[type_id], id); +} + +extern reiser4_plugin * +plugin_by_unsafe_id(reiser4_plugin_type type_id, reiser4_plugin_id id); + +/* get plugin whose id is stored in disk format */ +static inline reiser4_plugin * +plugin_by_disk_id(reiser4_tree * tree UNUSED_ARG /* tree, + * plugin + * belongs + * to */ , + reiser4_plugin_type type_id /* plugin type + * id */ , + d16 * did /* plugin id in disk format */ ) +{ + /* what we should do properly is to maintain within each + file-system a dictionary that maps on-disk plugin ids to + "universal" ids. This dictionary will be resolved on mount + time, so that this function will perform just one additional + array lookup. */ + return plugin_by_unsafe_id(type_id, d16tocpu(did)); +} + +/* __PLUGIN_HEADER_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/plugin_set.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/plugin_set.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,347 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ +/* NIKITA-FIXME-HANS: you didn't discuss this with me before coding it did you? Remove plugin-sets from code by March 15th, 2004 */ +/* plugin-sets */ + +/* + * Each inode comes with a whole set of plugins: file plugin, directory + * plugin, hash plugin, tail policy plugin, security plugin, etc. + * + * Storing them (pointers to them, that is) in inode is a waste of + * space. Especially, given that on average file system plugins of vast + * majority of files will belong to few sets (e.g., one set for regular files, + * another set for standard directory, etc.) + * + * Plugin set (pset) is an object containing pointers to all plugins required + * by inode. Inode only stores a pointer to pset. psets are "interned", that + * is, different inodes with the same set of plugins point to the same + * pset. This is archived by storing psets in global hash table. Races are + * avoided by simple (and efficient so far) solution of never recycling psets, + * even when last inode pointing to it is destroyed. + * + */ + +#include "../debug.h" + +#include "plugin_set.h" + +#include +#include + +/* slab for plugin sets */ +static kmem_cache_t *plugin_set_slab; + +static spinlock_t plugin_set_lock[8] __cacheline_aligned_in_smp = { + [0 ... 7] = SPIN_LOCK_UNLOCKED +}; + +/* hash table support */ + +#define PS_TABLE_SIZE (32) + +static inline plugin_set * +cast_to(const unsigned long * a) +{ + return container_of(a, plugin_set, hashval); +} + +static inline int +pseq(const unsigned long * a1, const unsigned long * a2) +{ + plugin_set *set1; + plugin_set *set2; + + /* make sure fields are not missed in the code below */ + cassert(sizeof *set1 == + + sizeof set1->hashval + + sizeof set1->link + + + sizeof set1->file + + sizeof set1->dir + + sizeof set1->perm + + sizeof set1->formatting + + sizeof set1->hash + + sizeof set1->fibration + + sizeof set1->sd + + sizeof set1->dir_item + + sizeof set1->crypto + + sizeof set1->digest + + sizeof set1->compression); + + set1 = cast_to(a1); + set2 = cast_to(a2); + return + set1->hashval == set2->hashval && + + set1->file == set2->file && + set1->dir == set2->dir && + set1->perm == set2->perm && + set1->formatting == set2->formatting && + set1->hash == set2->hash && + set1->fibration == set2->fibration && + set1->sd == set2->sd && + set1->dir_item == set2->dir_item && + set1->crypto == set2->crypto && + set1->digest == set2->digest && + set1->compression == set2->compression; +} + +#define HASH_FIELD(hash, set, field) \ +({ \ + (hash) += (unsigned long)(set)->field >> 2; \ +}) + +static inline unsigned long calculate_hash(const plugin_set *set) +{ + unsigned long result; + + result = 0; + HASH_FIELD(result, set, file); + HASH_FIELD(result, set, dir); + HASH_FIELD(result, set, perm); + HASH_FIELD(result, set, formatting); + HASH_FIELD(result, set, hash); + HASH_FIELD(result, set, fibration); + HASH_FIELD(result, set, sd); + HASH_FIELD(result, set, dir_item); + HASH_FIELD(result, set, crypto); + HASH_FIELD(result, set, digest); + HASH_FIELD(result, set, compression); + return result & (PS_TABLE_SIZE - 1); +} + +static inline unsigned long +pshash(ps_hash_table *table, const unsigned long * a) +{ + return *a; +} + +/* The hash table definition */ +#define KMALLOC(size) kmalloc((size), GFP_KERNEL) +#define KFREE(ptr, size) kfree(ptr) +TYPE_SAFE_HASH_DEFINE(ps, plugin_set, unsigned long, hashval, link, pshash, pseq); +#undef KFREE +#undef KMALLOC + +static ps_hash_table ps_table; +static plugin_set empty_set = { + .hashval = 0, + .file = NULL, + .dir = NULL, + .perm = NULL, + .formatting = NULL, + .hash = NULL, + .fibration = NULL, + .sd = NULL, + .dir_item = NULL, + .crypto = NULL, + .digest = NULL, + .compression = NULL, + .link = { NULL } +}; + +reiser4_internal plugin_set *plugin_set_get_empty(void) +{ + return &empty_set; +} + +reiser4_internal void plugin_set_put(plugin_set *set) +{ +} + +static inline unsigned long * +pset_field(plugin_set *set, int offset) +{ + return (unsigned long *)(((char *)set) + offset); +} + +static int plugin_set_field(plugin_set **set, const unsigned long val, const int offset) +{ + unsigned long *spot; + spinlock_t *lock; + plugin_set replica; + plugin_set *twin; + plugin_set *psal; + plugin_set *orig; + + assert("nikita-2902", set != NULL); + assert("nikita-2904", *set != NULL); + + spot = pset_field(*set, offset); + if (unlikely(*spot == val)) + return 0; + + replica = *(orig = *set); + *pset_field(&replica, offset) = val; + replica.hashval = calculate_hash(&replica); + rcu_read_lock(); + twin = ps_hash_find(&ps_table, &replica.hashval); + if (unlikely(twin == NULL)) { + rcu_read_unlock(); + psal = kmem_cache_alloc(plugin_set_slab, GFP_KERNEL); + if (psal == NULL) + return RETERR(-ENOMEM); + *psal = replica; + lock = &plugin_set_lock[replica.hashval & 7]; + spin_lock(lock); + twin = ps_hash_find(&ps_table, &replica.hashval); + if (likely(twin == NULL)) { + *set = psal; + ps_hash_insert_rcu(&ps_table, psal); + } else { + *set = twin; + kmem_cache_free(plugin_set_slab, psal); + } + spin_unlock(lock); + } else { + rcu_read_unlock(); + *set = twin; + } + return 0; +} + +static struct { + int offset; + reiser4_plugin_type type; +} pset_descr[PSET_LAST] = { + [PSET_FILE] = { + .offset = offsetof(plugin_set, file), + .type = REISER4_FILE_PLUGIN_TYPE + }, + [PSET_DIR] = { + .offset = offsetof(plugin_set, dir), + .type = REISER4_DIR_PLUGIN_TYPE + }, + [PSET_PERM] = { + .offset = offsetof(plugin_set, perm), + .type = REISER4_PERM_PLUGIN_TYPE + }, + [PSET_FORMATTING] = { + .offset = offsetof(plugin_set, formatting), + .type = REISER4_FORMATTING_PLUGIN_TYPE + }, + [PSET_HASH] = { + .offset = offsetof(plugin_set, hash), + .type = REISER4_HASH_PLUGIN_TYPE + }, + [PSET_FIBRATION] = { + .offset = offsetof(plugin_set, fibration), + .type = REISER4_FIBRATION_PLUGIN_TYPE + }, + [PSET_SD] = { + .offset = offsetof(plugin_set, sd), + .type = REISER4_ITEM_PLUGIN_TYPE + }, + [PSET_DIR_ITEM] = { + .offset = offsetof(plugin_set, dir_item), + .type = REISER4_ITEM_PLUGIN_TYPE + }, + [PSET_CRYPTO] = { + .offset = offsetof(plugin_set, crypto), + .type = REISER4_CRYPTO_PLUGIN_TYPE + }, + [PSET_DIGEST] = { + .offset = offsetof(plugin_set, digest), + .type = REISER4_DIGEST_PLUGIN_TYPE + }, + [PSET_COMPRESSION] = { + .offset = offsetof(plugin_set, compression), + .type = REISER4_COMPRESSION_PLUGIN_TYPE + } +}; + +#if REISER4_DEBUG +static reiser4_plugin_type +pset_member_to_type(pset_member memb) +{ + assert("nikita-3501", 0 <= memb && memb < PSET_LAST); + return pset_descr[memb].type; +} +#endif + +reiser4_plugin_type +pset_member_to_type_unsafe(pset_member memb) +{ + if (0 <= memb && memb < PSET_LAST) + return pset_descr[memb].type; + else + return REISER4_PLUGIN_TYPES; +} + +int pset_set(plugin_set **set, pset_member memb, reiser4_plugin *plugin) +{ + assert("nikita-3492", set != NULL); + assert("nikita-3493", *set != NULL); + assert("nikita-3494", plugin != NULL); + assert("nikita-3495", 0 <= memb && memb < PSET_LAST); + assert("nikita-3496", plugin->h.type_id == pset_member_to_type(memb)); + + return plugin_set_field(set, + (unsigned long)plugin, pset_descr[memb].offset); +} + +reiser4_plugin *pset_get(plugin_set *set, pset_member memb) +{ + assert("nikita-3497", set != NULL); + assert("nikita-3498", 0 <= memb && memb < PSET_LAST); + + return *(reiser4_plugin **)(((char *)set) + pset_descr[memb].offset); +} + +#define DEFINE_PLUGIN_SET(type, field) \ +reiser4_internal int plugin_set_ ## field(plugin_set **set, type *val) \ +{ \ + cassert(sizeof val == sizeof(unsigned long)); \ + return plugin_set_field(set, (unsigned long)val, \ + offsetof(plugin_set, field)); \ +} + +DEFINE_PLUGIN_SET(file_plugin, file) +DEFINE_PLUGIN_SET(dir_plugin, dir) +DEFINE_PLUGIN_SET(formatting_plugin, formatting) +DEFINE_PLUGIN_SET(hash_plugin, hash) +DEFINE_PLUGIN_SET(fibration_plugin, fibration) +DEFINE_PLUGIN_SET(item_plugin, sd) +DEFINE_PLUGIN_SET(crypto_plugin, crypto) +DEFINE_PLUGIN_SET(digest_plugin, digest) +DEFINE_PLUGIN_SET(compression_plugin, compression) + +reiser4_internal int plugin_set_init(void) +{ + int result; + + result = ps_hash_init(&ps_table, PS_TABLE_SIZE); + if (result == 0) { + plugin_set_slab = kmem_cache_create("plugin_set", + sizeof (plugin_set), 0, + SLAB_HWCACHE_ALIGN, + NULL, NULL); + if (plugin_set_slab == NULL) + result = RETERR(-ENOMEM); + } + return result; +} + +reiser4_internal void plugin_set_done(void) +{ + plugin_set * cur, * next; + + for_all_in_htable(&ps_table, ps, cur, next) { + ps_hash_remove(&ps_table, cur); + kmem_cache_free(plugin_set_slab, cur); + } + kmem_cache_destroy(plugin_set_slab); + ps_hash_done(&ps_table); +} + + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/plugin_set.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/plugin_set.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,77 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* plugin-sets. see fs/reiser4/plugin/plugin_set.c for details */ + +#if !defined( __PLUGIN_SET_H__ ) +#define __PLUGIN_SET_H__ + +#include "../type_safe_hash.h" +#include "plugin.h" + +#include + +struct plugin_set; +typedef struct plugin_set plugin_set; + +TYPE_SAFE_HASH_DECLARE(ps, plugin_set); + +struct plugin_set { + unsigned long hashval; + /* plugin of file */ + file_plugin *file; + /* plugin of dir */ + dir_plugin *dir; + /* perm plugin for this file */ + perm_plugin *perm; + /* tail policy plugin. Only meaningful for regular files */ + formatting_plugin *formatting; + /* hash plugin. Only meaningful for directories. */ + hash_plugin *hash; + /* fibration plugin. Only meaningful for directories. */ + fibration_plugin *fibration; + /* plugin of stat-data */ + item_plugin *sd; + /* plugin of items a directory is built of */ + item_plugin *dir_item; + /* crypto plugin */ + crypto_plugin *crypto; + /* digest plugin */ + digest_plugin *digest; + /* compression plugin */ + compression_plugin *compression; + ps_hash_link link; +}; + +extern plugin_set *plugin_set_get_empty(void); +extern void plugin_set_put(plugin_set *set); + +extern int plugin_set_file (plugin_set **set, file_plugin *file); +extern int plugin_set_dir (plugin_set **set, dir_plugin *file); +extern int plugin_set_formatting (plugin_set **set, formatting_plugin *file); +extern int plugin_set_hash (plugin_set **set, hash_plugin *file); +extern int plugin_set_fibration (plugin_set **set, fibration_plugin *file); +extern int plugin_set_sd (plugin_set **set, item_plugin *file); +extern int plugin_set_crypto (plugin_set **set, crypto_plugin *file); +extern int plugin_set_digest (plugin_set **set, digest_plugin *file); +extern int plugin_set_compression(plugin_set **set, compression_plugin *file); + +extern int plugin_set_init(void); +extern void plugin_set_done(void); + +extern int pset_set(plugin_set **set, pset_member memb, reiser4_plugin *plugin); +extern reiser4_plugin *pset_get(plugin_set *set, pset_member memb); + +extern reiser4_plugin_type pset_member_to_type_unsafe(pset_member memb); + +/* __PLUGIN_SET_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/pseudo/pseudo.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/pseudo/pseudo.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1801 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Handling of "pseudo" files representing unified access to meta data in + reiser4. */ + +/* + * See http://namesys.com/v4/v4.html, and especially + * http://namesys.com/v4/pseudo.html for basic information about reiser4 + * pseudo files, access to meta-data, reiser4() system call, etc. + * + * Pseudo files should be accessible from both reiser4() system call and + * normal POSIX calls. + * + * OVERVIEW + * + * Pseudo files provide access to various functionality through file + * system name space. As such they are similar to pseudo file systems + * already existing in UNIX and Linux: procfs, sysfs, etc. But pseudo + * files are embedded into name space of Reiser4---real block device based + * file system, and are more tightly integrated with it. In particular, + * some pseudo files are "attached" to other files (either "real" or also + * pseudo), by being accessible through path names of the form + * + * "a/b/c/something" + * + * Here object accessible through "a/b/c/something" is attached to the + * object accessible through "a/b/c" , and the latter is said to be the + * "host" object of the former. + * + * Object can have multiple pseudo files attached to it, distinguished by + * the last component of their names "something", "somethingelse", + * etc. + * + * (Note however, that currently "real" files have only one single pseudo + * file attached to them, viz. pseudo directory "....". This directory in + * turn contains all other pseudo files pertaining to the real file that + * "...." is attached to. To avoid referencing "...." all the time + * "a/b/c" is called a host of "/a/b/c/..../something". This violates + * definition above, but is convenient.) + * + * Moreover, in addition to the purely pseudo files (that is, file system + * objects whose content (as available through read(2) system call) is not + * backed by any kind of persistent storage), extended file attributes + * (see attr(5) on Linux, and http://acl.bestbits.at/) including security + * attributes such as ACLs are also available through file system name + * space. + * + * As a result each file system object has a sub-name-space rooted at it, + * which is in striking contrast with traditional UNIX file system, where + * only directories has sub-objects and all other types of files (regular, + * FIFO-s, devices, and symlinks) are leaves. + * + * For the sake of objectivity it should be mentioned that this is not + * _completely_ new development in file system design, see + * http://docs.sun.com/db/doc/816-0220/6m6nkorp9?a=view + * + * In particular, as each object has sub-objects, name space tree is + * infinite in both extent (number of reachable objects) and depth. + * + * Some pseudo files are "built-in". They are present as sub-objects in + * each file system object, unless specifically disabled. + * + * Built-in pseudo files are implemented in this file and described at + * http://namesys.com/v4/pseudo.html + * + * IMPLEMENTATION + * + * Pseudo files are implemented as normal inodes, living in the same super + * block as other inodes for reiser4 file system. Their inode numbers are + * generated by fs/inode.c:new_inode() function and are not persistent (in + * the sense that they are not guaranteed to be the same after + * remount). To avoid clashes with "normal" inodes, all pseudo inodes are + * placed into otherwise unused locality (for example, 0), hence allowing + * reiser4_inode_find_actor() to tell them from normal inodes. + * + * All pseudo inodes share the same object plugin + * PSEUDO_FILE_PLUGIN_ID. In pseudo-inode specific part of reiser4_inode + * (pseudo_info), two things are stored: + * + * 1. pointer to the inode of the "host object" (for /a/b/c/..../acl, + * /a/b/c is the host object) + * + * 2. pointer to pseudo plugin, used by PSEUDO_FILE_PLUGIN_ID to + * implement VFS operations. + * + * This design has following advantages: + * + * 1. provides for ease addition of new pseudo files without going + * through writing whole new object plugin. + * + * 2. allows sys_reiser4() to be implemented by directory invoking + * pseudo plugin methods. + * + */ + +#include "../../inode.h" +#include "../../debug.h" +#include "../plugin.h" + +#include "pseudo.h" + +static int init_pseudo(struct inode *parent, struct inode *pseudo, + pseudo_plugin *pplug, const char *name); + +static struct inode *add_pseudo(struct inode *parent, + pseudo_plugin *pplug, struct dentry **d); + +/* + * helper method: set ->datum field in the pseudo file specific portion of + * reiser4 inode. + */ +static void pseudo_set_datum(struct inode *pseudo, unsigned long datum) +{ + reiser4_inode_data(pseudo)->file_plugin_data.pseudo_info.datum = datum; +} + +/* + * return id of pseudo file plugin for this inode @p + */ +static int pseudo_id(struct inode *p) +{ + return reiser4_inode_data(p)->file_plugin_data.pseudo_info.plugin->h.id; +} + +/* + * helper method used to implement ->lookup() method of pseudo files. + * + * Try to find a pseudo plugin that matches given name (stored in @dentry) and + * has ->parent field equal to @id. + * + * Convention is that ->parent field is set to the id of the pseudo plugin of + * the parent pseudo file in the hierarchy (that is, plugin for + * "a/..../foo/bar" has ->parent set to the plugin id of "a/..../foo"), with + * the exception of "a/...." that uses special reserved value TOP_LEVEL for + * ->parent. + */ +static int +lookup_of_plugin(struct inode *parent, int id, struct dentry **dentry) +{ + const char *name; + struct inode *pseudo; + reiser4_plugin *plugin; + int result; + + name = (*dentry)->d_name.name; + pseudo = ERR_PTR(-ENOENT); + + /* scan all pseudo file plugins and check each */ + for_all_plugins(REISER4_PSEUDO_PLUGIN_TYPE, plugin) { + pseudo_plugin *pplug; + + pplug = &plugin->pseudo; + if (pplug->parent == id && + pplug->try != NULL && pplug->try(pplug, parent, name)) { + pseudo = add_pseudo(parent, pplug, dentry); + break; + } + } + if (!IS_ERR(pseudo)) + result = 0; + else + result = PTR_ERR(pseudo); + return result; +} + +/* + * implement ->lookup() method using convention described in the comment for + * lookup_of_plugin() function. + */ +static int lookup_table(struct inode *parent, struct dentry ** dentry) +{ + assert("nikita-3511", parent != NULL); + assert("nikita-3512", dentry != NULL); + assert("nikita-3513", + inode_file_plugin(parent)->h.id == PSEUDO_FILE_PLUGIN_ID); + + /* + * call lookup_of_plugin() passing id of pseudo plugin for @parent as + * "id" parameter. + */ + return lookup_of_plugin(parent, pseudo_id(parent), dentry); +} + +/* + * helper to implement ->readdir() method for the pseudo files. It uses the + * same convention as lookup_of_plugin() function. + */ +static int +readdir_table(struct file *f, void *dirent, filldir_t filld) +{ + loff_t off; + ino_t ino; + int skip; + int id; + + struct inode *inode; + reiser4_plugin *plugin; + + off = f->f_pos; + if (off < 0) + return 0; + + inode = f->f_dentry->d_inode; + switch ((int)off) { + /* + * first, return dot and dotdot + */ + case 0: + ino = inode->i_ino; + if (filld(dirent, ".", 1, off, ino, DT_DIR) < 0) + break; + ++ off; + /* fallthrough */ + case 1: + ino = parent_ino(f->f_dentry); + if (filld(dirent, "..", 2, off, ino, DT_DIR) < 0) + break; + ++ off; + /* fallthrough */ + default: + skip = off - 2; + id = pseudo_id(inode); + /* then, scan all pseudo plugins, looking for the ones with + * matching ->parent */ + for_all_plugins(REISER4_PSEUDO_PLUGIN_TYPE, plugin) { + pseudo_plugin *pplug; + const char *name; + + pplug = &plugin->pseudo; + if (pplug->parent == id && pplug->readdirable) { + if (skip == 0) { + name = pplug->h.label; + /* + * if match is found---fed @filld with + * it + */ + if (filld(dirent, name, strlen(name), + off, + off + (long)f, DT_REG) < 0) + break; + ++ off; + } else + -- skip; + } + } + } + f->f_pos = off; + return 0; +} + +/* + * special value of ->parent field in pseudo file plugin used by "...." top + * level pseudo directory. + */ +#define TOP_LEVEL (-1) + +/* + * try to look up built-in pseudo file by its name. + */ +reiser4_internal int +lookup_pseudo_file(struct inode *parent, struct dentry **dentry) +{ + assert("nikita-2999", parent != NULL); + assert("nikita-3000", dentry != NULL); + +#if !ENABLE_REISER4_PSEUDO + return RETERR(-ENOENT); +#endif /* ENABLE_REISER4_PSEUDO */ + /* if pseudo files are disabled for this file system bail out */ + if (reiser4_is_set(parent->i_sb, REISER4_NO_PSEUDO)) + return RETERR(-ENOENT); + else + return lookup_of_plugin(parent, TOP_LEVEL, dentry); +} + +/* create inode for pseudo file with plugin @pplug, and add it to the @parent + * under name @d */ +static struct inode *add_pseudo(struct inode *parent, + pseudo_plugin *pplug, struct dentry **d) +{ + struct inode *pseudo; + + pseudo = new_inode(parent->i_sb); + if (pseudo != NULL) { + int result; + + result = init_pseudo(parent, pseudo, pplug, (*d)->d_name.name); + if (result != 0) + pseudo = ERR_PTR(result); + else + *d = d_splice_alias(pseudo, *d); + } else + pseudo = ERR_PTR(RETERR(-ENOMEM)); + return pseudo; +} + + +/* helper function: return host object of @inode pseudo file */ +static struct inode *get_inode_host(struct inode *inode) +{ + assert("nikita-3510", + inode_file_plugin(inode)->h.id == PSEUDO_FILE_PLUGIN_ID); + return reiser4_inode_data(inode)->file_plugin_data.pseudo_info.host; +} + +/* helper function: return parent object of @inode pseudo file */ +static struct inode *get_inode_parent(struct inode *inode) +{ + assert("nikita-3510", + inode_file_plugin(inode)->h.id == PSEUDO_FILE_PLUGIN_ID); + return reiser4_inode_data(inode)->file_plugin_data.pseudo_info.parent; +} + +/* + * initialize pseudo file @pseudo to be child of @parent, with plugin @pplug + * and name @name. + */ +static int +init_pseudo(struct inode *parent, struct inode *pseudo, + pseudo_plugin *pplug, const char *name) +{ + int result; + struct inode *host; + reiser4_inode *idata; + reiser4_object_create_data data; + static const oid_t pseudo_locality = 0x0ull; + + idata = reiser4_inode_data(pseudo); + /* all pseudo files live in special reserved locality */ + idata->locality_id = pseudo_locality; + + /* + * setup ->parent and ->host fields + */ + if (pplug->parent != TOP_LEVEL) + /* host of "a/..../b/c" is "a" */ + host = get_inode_host(parent); + else + /* host of "a/...." is "a" */ + host = parent; + + idata->file_plugin_data.pseudo_info.host = host; + idata->file_plugin_data.pseudo_info.parent = parent; + idata->file_plugin_data.pseudo_info.plugin = pplug; + + data.id = PSEUDO_FILE_PLUGIN_ID; + data.mode = pplug->lookup_mode; + + plugin_set_file(&idata->pset, file_plugin_by_id(PSEUDO_FILE_PLUGIN_ID)); + /* if plugin has a ->lookup method, it means that @pseudo should + * behave like directory. */ + if (pplug->lookup != NULL) + plugin_set_dir(&idata->pset, + dir_plugin_by_id(PSEUDO_DIR_PLUGIN_ID)); + + /* perform standard plugin initialization */ + result = inode_file_plugin(pseudo)->set_plug_in_inode(pseudo, + parent, &data); + if (result != 0) { + warning("nikita-3203", "Cannot install pseudo plugin"); + return result; + } + + /* inherit permission plugin from parent, */ + grab_plugin(pseudo, parent, PSET_PERM); + /* and credentials... */ + pseudo->i_uid = parent->i_uid; + pseudo->i_gid = parent->i_gid; + + pseudo->i_nlink = 1; + /* insert inode into VFS hash table */ + insert_inode_hash(pseudo); + return 0; +} + +/* helper function: return host object by file descriptor */ +static struct inode *get_pseudo_host(struct file *file) +{ + struct inode *inode; + + inode = file->f_dentry->d_inode; + return get_inode_host(inode); +} + +/* helper function: return host object by seq_file */ +static struct inode *get_seq_pseudo_host(struct seq_file *seq) +{ + struct file *file; + + file = seq->private; + return get_pseudo_host(file); +} + +/* + * implementation of ->try method for pseudo files with fixed names. + */ +static int try_by_label(pseudo_plugin *pplug, + const struct inode *parent, const char *name) +{ + return !strcmp(name, pplug->h.label); +} + +/* + * read method for the "..../uid" pseudo file. + */ +static int show_uid(struct seq_file *seq, void *cookie) +{ + seq_printf(seq, "%lu", (long unsigned)get_seq_pseudo_host(seq)->i_uid); + return 0; +} + +/* helper: check permissions required to modify ..../[ug]id */ +static int check_perm(struct inode *inode) +{ + if (IS_RDONLY(inode)) + return RETERR(-EROFS); + if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) + return RETERR(-EPERM); + return 0; +} + +/* + * helper function to update [ug]id of @inode. Called by "..../[ug]id" write + * methods + */ +static int update_ugid(struct dentry *dentry, struct inode *inode, + uid_t uid, gid_t gid) +{ + int result; + + /* logic COPIED from fs/open.c:chown_common() */ + result = check_perm(inode); + if (result == 0) { + struct iattr newattrs; + + newattrs.ia_valid = ATTR_CTIME; + if (uid != (uid_t) -1) { + newattrs.ia_valid |= ATTR_UID; + newattrs.ia_uid = uid; + } + if (gid != (uid_t) -1) { + newattrs.ia_valid |= ATTR_GID; + newattrs.ia_gid = gid; + } + if (!S_ISDIR(inode->i_mode)) + newattrs.ia_valid |= ATTR_KILL_SUID|ATTR_KILL_SGID; + down(&inode->i_sem); + result = notify_change(dentry, &newattrs); + up(&inode->i_sem); + } + return result; +} + +/* + * write method for the "..../uid": extract uid from user-supplied buffer, + * and update uid + */ +static int store_uid(struct file *file, const char *buf) +{ + uid_t uid; + int result; + + if (sscanf(buf, "%i", &uid) == 1) { + struct inode *host; + + host = get_pseudo_host(file); + result = update_ugid(file->f_dentry->d_parent->d_parent, + host, uid, -1); + } else + result = RETERR(-EINVAL); + return result; +} + +/* + * read method for the "..../uid" pseudo file. + */ +static int show_gid(struct seq_file *seq, void *cookie) +{ + seq_printf(seq, "%lu", (long unsigned)get_seq_pseudo_host(seq)->i_gid); + return 0; +} + +/* + * write method for the "..../gid": extract uid from user-supplied buffer, + * and update gid + */ +static int get_gid(struct file *file, const char *buf) +{ + gid_t gid; + int result; + + if (sscanf(buf, "%i", &gid) == 1) { + struct inode *host; + + host = get_pseudo_host(file); + result = update_ugid(file->f_dentry->d_parent->d_parent, + host, -1, gid); + } else + result = RETERR(-EINVAL); + return result; +} + +/* + * read method for the "..../oid" pseudo file + */ +static int show_oid(struct seq_file *seq, void *cookie) +{ + seq_printf(seq, "%llu", (unsigned long long)get_inode_oid(get_seq_pseudo_host(seq))); + return 0; +} + +/* + * read method for the "..../key" pseudo file + */ +static int show_key(struct seq_file *seq, void *cookie) +{ + char buf[KEY_BUF_LEN]; + reiser4_key key; + + sprintf_key(buf, build_sd_key(get_seq_pseudo_host(seq), &key)); + seq_printf(seq, "%s", buf); + return 0; +} + +/* + * read method for the "..../size" pseudo file + */ +static int show_size(struct seq_file *seq, void *cookie) +{ + seq_printf(seq, "%lli", get_seq_pseudo_host(seq)->i_size); + return 0; +} + +/* + * read method for the "..../nlink" pseudo file + */ +static int show_nlink(struct seq_file *seq, void *cookie) +{ + seq_printf(seq, "%u", get_seq_pseudo_host(seq)->i_nlink); + return 0; +} + +/* + * read method for the "..../locality" pseudo file + */ +static int show_locality(struct seq_file *seq, void *cookie) +{ + seq_printf(seq, "%llu", + (unsigned long long)reiser4_inode_data(get_seq_pseudo_host(seq))->locality_id); + return 0; +} + +/* + * read method for the "..../rwx" pseudo file + */ +static int show_rwx(struct seq_file *seq, void *cookie) +{ + umode_t m; + + m = get_seq_pseudo_host(seq)->i_mode; + seq_printf(seq, "%#ho %c%c%c%c%c%c%c%c%c%c", + m, + + S_ISREG(m) ? '-' : + S_ISDIR(m) ? 'd' : + S_ISCHR(m) ? 'c' : + S_ISBLK(m) ? 'b' : + S_ISFIFO(m) ? 'p' : + S_ISLNK(m) ? 'l' : + S_ISSOCK(m) ? 's' : '?', + + m & S_IRUSR ? 'r' : '-', + m & S_IWUSR ? 'w' : '-', + m & S_IXUSR ? 'x' : '-', + + m & S_IRGRP ? 'r' : '-', + m & S_IWGRP ? 'w' : '-', + m & S_IXGRP ? 'x' : '-', + + m & S_IROTH ? 'r' : '-', + m & S_IWOTH ? 'w' : '-', + m & S_IXOTH ? 'x' : '-'); + return 0; +} + +/* + * write method for the "..../rwx" file. Extract permission bits from the + * user supplied buffer and update ->i_mode. + */ +static int get_rwx(struct file *file, const char *buf) +{ + umode_t rwx; + int result; + + if (sscanf(buf, "%hi", &rwx) == 1) { + struct inode *host; + + host = get_pseudo_host(file); + result = check_perm(host); + if (result == 0) { + struct iattr newattrs; + + down(&host->i_sem); + if (rwx == (umode_t)~0) + rwx = host->i_mode; + newattrs.ia_mode = + (rwx & S_IALLUGO) | (host->i_mode & ~S_IALLUGO); + newattrs.ia_valid = ATTR_MODE | ATTR_CTIME; + result = notify_change(file->f_dentry->d_parent->d_parent, + &newattrs); + up(&host->i_sem); + } + } else + result = RETERR(-EINVAL); + return result; +} + +/* + * seq-methods for "..../pseudo" + */ + +/* + * start iteration over all pseudo files + */ +static void * pseudos_start(struct seq_file *m, loff_t *pos) +{ + if (*pos >= LAST_PSEUDO_ID) + return NULL; + return pseudo_plugin_by_id(*pos); +} + +/* + * stop iteration over all pseudo files + */ +static void pseudos_stop(struct seq_file *m, void *v) +{ +} + +/* + * go to next pseudo file in the sequence + */ +static void * pseudos_next(struct seq_file *m, void *v, loff_t *pos) +{ + ++ (*pos); + return pseudos_start(m, pos); +} + +/* + * output information about particular pseudo file. + */ +static int pseudos_show(struct seq_file *m, void *v) +{ + pseudo_plugin *pplug; + + pplug = v; + if (pplug->try != NULL) + seq_printf(m, "%s\n", pplug->h.label); + return 0; +} + +/* + * seq-methods for "..../bmap" + */ + +/* + * start iteration over all blocks allocated for the host file + */ +static void * bmap_start(struct seq_file *m, loff_t *pos) +{ + struct inode *host; + + host = get_seq_pseudo_host(m); + if (*pos << host->i_blkbits >= host->i_size) + return NULL; + else + return (void *)((unsigned long)*pos + 1); +} + +/* + * stop iteration over all blocks allocated for the host file + */ +static void bmap_stop(struct seq_file *m, void *v) +{ +} + +/* + * go to the next block in the sequence of blocks allocated for the host + * file. + */ +static void * bmap_next(struct seq_file *m, void *v, loff_t *pos) +{ + ++ (*pos); + return bmap_start(m, pos); +} + +extern int reiser4_lblock_to_blocknr(struct address_space *mapping, + sector_t lblock, reiser4_block_nr *blocknr); + +/* + * output information about single block number allocated for the host file + * into user supplied buffer + */ +static int bmap_show(struct seq_file *m, void *v) +{ + sector_t lblock; + int result; + reiser4_block_nr blocknr; + + lblock = ((sector_t)(unsigned long)v) - 1; + result = reiser4_lblock_to_blocknr(get_seq_pseudo_host(m)->i_mapping, + lblock, &blocknr); + if (result == 0) { + if (blocknr_is_fake(&blocknr)) + seq_printf(m, "%#llx\n", (unsigned long long)blocknr); + else + seq_printf(m, "%llu\n", (unsigned long long)blocknr); + } + return result; +} + +/* + * seq-methods for the "..../readdir" + */ + +/* "cursor" used to iterate over all directory entries for the host file */ +typedef struct readdir_cookie { + /* position within the tree */ + tap_t tap; + /* coord used by ->tap */ + coord_t coord; + /* lock handle used by ->tap */ + lock_handle lh; +} readdir_cookie; + +/* true if @coord stores directory entries for @host */ +static int is_host_item(struct inode *host, coord_t *coord) +{ + if (item_type_by_coord(coord) != DIR_ENTRY_ITEM_TYPE) + return 0; + if (!inode_file_plugin(host)->owns_item(host, coord)) + return 0; + return 1; +} + +/* helper function to release resources allocated to iterate over directory + * entries for the host file */ +static void finish(readdir_cookie *c) +{ + if (c != NULL && !IS_ERR(c)) { + /* release c->tap->lh long term lock... */ + tap_done(&c->tap); + /* ... and free cursor itself */ + kfree(c); + } +} + +/* + * start iterating over directory entries for the host file + */ +static void * readdir_start(struct seq_file *m, loff_t *pos) +{ + struct inode *host; + readdir_cookie *c; + dir_plugin *dplug; + reiser4_key dotkey; + struct qstr dotname; + int result; + loff_t entryno; + + /* + * first, lookup item containing dot of the host + */ + + host = get_seq_pseudo_host(m); + dplug = inode_dir_plugin(host); + + dotname.name = "."; + dotname.len = 1; + + down(&host->i_sem); + if (dplug == NULL || dplug->build_entry_key == NULL) { + finish(NULL); + return NULL; + } + + /* build key of dot */ + dplug->build_entry_key(host, &dotname, &dotkey); + + /* allocate cursor */ + c = kmalloc(sizeof *c, GFP_KERNEL); + if (c == NULL) { + finish(NULL); + return ERR_PTR(RETERR(-ENOMEM)); + } + + /* tree lookup */ + result = object_lookup(host, + &dotkey, + &c->coord, + &c->lh, + ZNODE_READ_LOCK, + FIND_EXACT, + LEAF_LEVEL, + LEAF_LEVEL, + CBK_READDIR_RA, + NULL); + + tap_init(&c->tap, &c->coord, &c->lh, ZNODE_READ_LOCK); + if (result == 0) + /* + * ok, now c->tap is positioned at the dot. We are requested + * to start readdir from the offset *pos. Skip that number of + * entries. That's not very efficient for the large + * directories. + */ + result = tap_load(&c->tap); { + if (result == 0) { + for (entryno = 0; entryno != *pos; ++ entryno) { + result = go_next_unit(&c->tap); + if (result == -E_NO_NEIGHBOR) { + finish(c); + return NULL; + } + if (result != 0) + break; + if (!is_host_item(host, c->tap.coord)) { + finish(c); + return NULL; + } + } + } + } + if (result != 0) { + finish(c); + return ERR_PTR(result); + } else + return c; +} + +/* + * stop iterating over directory entries for the host file + */ +static void readdir_stop(struct seq_file *m, void *v) +{ + up(&get_seq_pseudo_host(m)->i_sem); + finish(v); +} + +/* + * go to the next entry in the host directory + */ +static void * readdir_next(struct seq_file *m, void *v, loff_t *pos) +{ + readdir_cookie *c; + struct inode *host; + int result; + + c = v; + ++ (*pos); + host = get_seq_pseudo_host(m); + /* next entry is in the next unit */ + result = go_next_unit(&c->tap); + if (result == 0) { + /* check whether end of the directory was reached. */ + if (!is_host_item(host, c->tap.coord)) { + finish(c); + return NULL; + } else + return v; + } else { + finish(c); + return ERR_PTR(result); + } +} + +/* + * output information about single directory entry in the host directory + */ +static int readdir_show(struct seq_file *m, void *v) +{ + readdir_cookie *c; + item_plugin *iplug; + char *name; + char buf[DE_NAME_BUF_LEN]; + + c = v; + iplug = item_plugin_by_coord(&c->coord); + + name = iplug->s.dir.extract_name(&c->coord, buf); + assert("nikita-3221", name != NULL); + /* entries are separated by the "/" in the user buffer, because this + * is the only symbol (besides NUL) that is not allowed in file + * names. */ + seq_printf(m, "%s/", name); + return 0; +} + +/* + * methods for "..../plugin" + */ + +/* + * entry in the table mapping plugin pseudo file name to the corresponding + * pset member. + */ +typedef struct plugin_entry { + const char *name; + pset_member memb; +} plugin_entry; + +/* initializer for plugin_entry */ +#define PLUGIN_ENTRY(field, ind) \ +{ \ + .name = #field, \ + .memb = ind \ +} + +#define PSEUDO_ARRAY_ENTRY(idx, aname) \ +[idx] = { \ + .name = aname \ +} + +/* + * initialize array defining files available under "..../plugin". + */ +static plugin_entry pentry[] = { + /* "a/..../plugin/file" corresponds to the PSET_FILE plugin of its + * host file (that is, "a"), etc. */ + PLUGIN_ENTRY(file, PSET_FILE), + PLUGIN_ENTRY(dir, PSET_DIR), + PLUGIN_ENTRY(perm, PSET_PERM), + PLUGIN_ENTRY(formatting, PSET_FORMATTING), + PLUGIN_ENTRY(hash, PSET_HASH), + PLUGIN_ENTRY(fibration, PSET_FIBRATION), + PLUGIN_ENTRY(sd, PSET_SD), + PLUGIN_ENTRY(dir_item, PSET_DIR_ITEM), + PLUGIN_ENTRY(crypto, PSET_CRYPTO), + PLUGIN_ENTRY(digest, PSET_DIGEST), + PLUGIN_ENTRY(compression, PSET_COMPRESSION), + { + .name = NULL, + } +}; + +/* + * enumeration of files available under "a/..../plugin/foo" + */ +typedef enum { + PFIELD_TYPEID, /* "a/..../plugin/foo/type_id" contains type id of the + * plugin foo */ + PFIELD_ID, /* "a/..../plugin/foo/id" contains id of the plugin + * foo */ + PFIELD_LABEL, /* "a/..../plugin/foo/label" contains label of the + * plugin foo */ + PFIELD_DESC /* "a/..../plugin/foo/desc" contains description of + * the plugin foo */ +} plugin_field; + +/* map pseudo files under "a/..../plugin/foo" to their names */ +static plugin_entry fentry[] = { + PSEUDO_ARRAY_ENTRY(PFIELD_TYPEID, "type_id"), + PSEUDO_ARRAY_ENTRY(PFIELD_ID, "id"), + PSEUDO_ARRAY_ENTRY(PFIELD_LABEL, "label"), + PSEUDO_ARRAY_ENTRY(PFIELD_DESC, "desc"), + { + .name = NULL + }, +}; + +/* read method for "a/..../plugin/foo" */ +static int show_plugin(struct seq_file *seq, void *cookie) +{ + struct inode *host; + struct file *file; + struct inode *inode; + reiser4_plugin *plug; + plugin_entry *entry; + int idx; + plugin_set *pset; + + file = seq->private; + inode = file->f_dentry->d_inode; + + host = get_inode_host(inode); + idx = reiser4_inode_data(inode)->file_plugin_data.pseudo_info.datum; + entry = &pentry[idx]; + pset = reiser4_inode_data(host)->pset; + plug = pset_get(pset, entry->memb); + + if (plug != NULL) + seq_printf(seq, "%i %s %s", + plug->h.id, plug->h.label, plug->h.desc); + return 0; +} + +/* + * write method for "a/..../plugin/foo": extract plugin label from the user + * supplied buffer @buf and update plugin foo, if possible. + */ +static int set_plugin(struct file *file, const char *buf) +{ + struct inode *host; + struct inode *inode; + reiser4_plugin *plug; + plugin_entry *entry; + int idx; + plugin_set *pset; + int result; + reiser4_context ctx; + + inode = file->f_dentry->d_inode; + init_context(&ctx, inode->i_sb); + + host = get_inode_host(inode); + idx = reiser4_inode_data(inode)->file_plugin_data.pseudo_info.datum; + entry = &pentry[idx]; + pset = reiser4_inode_data(host)->pset; + + plug = lookup_plugin(entry->name, buf); + if (plug != NULL) { + result = force_plugin(host, entry->memb, plug); + if (result == 0) { + __u64 tograb; + + /* + * if plugin was updated successfully, save changes in + * the stat-data + */ + tograb = inode_file_plugin(host)->estimate.update(host); + result = reiser4_grab_space(tograb, BA_CAN_COMMIT); + if (result == 0) + result = reiser4_mark_inode_dirty(host); + } + } else + result = RETERR(-ENOENT); + context_set_commit_async(&ctx); + reiser4_exit_context(&ctx); + return result; +} + +/* + * helper function to implement ->lookup() method of pseudo directory plugin + * for the file that contains multiple similar children pseudo files. + * + * For example, "a/..../plugin/" directory contains files for each plugin + * associated with the host file "a". Handling of read/write for these file is + * exactly the same, the only difference being the pset member id for the + * corresponding plugin. Similarly, "a/..../plugin/foo/" itself contains + * files that are used to provide user access to the corresponding fields of + * the "foo" plugin, and all such fields can be handled similarly (see + * show_plugin_field()) + * + * To avoid code duplication in such situation, an array is constructed that + * is used as a map from the name of "child" object to the corresponding + * "datum". All child objects are handled by the same pseudo plugin, and are + * differentiated by the datum installed into pseudo file inode. + */ +static int array_lookup_pseudo(struct inode *parent, struct dentry ** dentry, + plugin_entry *array, pseudo_plugin *pplug) +{ + int result; + int idx; + struct inode *pseudo; + + pseudo = ERR_PTR(-ENOENT); + /* search for the given name in the array */ + for (idx = 0; array[idx].name != NULL; ++ idx) { + if (!strcmp((*dentry)->d_name.name, array[idx].name)) { + pseudo = add_pseudo(parent, pplug, dentry); + break; + } + } + if (IS_ERR(pseudo)) + result = PTR_ERR(pseudo); + else { + result = 0; + /* if name was found, set datum in the inode */ + pseudo_set_datum(pseudo, idx); + } + return result; +} + +/* + * helper method to implement array for the situation when we have multiple + * child pseudo files with similar functionality. See comment for + * array_lookup_pseudo(). + */ +static int array_readdir_pseudo(struct file *f, void *dirent, filldir_t filld, + plugin_entry *array, int size) +{ + loff_t off; + ino_t ino; + + off = f->f_pos; + if (off < 0) + return 0; + + /* for god's sake, why switch(loff_t) requires __cmpdi2? */ + switch ((int)off) { + case 0: + ino = f->f_dentry->d_inode->i_ino; + if (filld(dirent, ".", 1, off, ino, DT_DIR) < 0) + break; + ++ off; + /* fallthrough */ + case 1: + ino = parent_ino(f->f_dentry); + if (filld(dirent, "..", 2, off, ino, DT_DIR) < 0) + break; + ++ off; + /* fallthrough */ + default: + /* scan array for the names */ + for (; off < size + 1; ++ off) { + const char *name; + + name = array[off - 2].name; + if (filld(dirent, name, strlen(name), + off, off + (long)f, DT_REG) < 0) + break; + } + } + f->f_pos = off; + return 0; +} + + +/* + * ->lookup() method for the "a/..../plugin/foo/" directory. It uses array + * representation of child objects, described in the comment for + * array_lookup_pseudo(). + */ +static int lookup_plugin_field(struct inode *parent, struct dentry ** dentry) +{ + return array_lookup_pseudo(parent, dentry, fentry, + pseudo_plugin_by_id(PSEUDO_PLUGIN_FIELD_ID)); +} + +/* + * read method for "a/..../plugin/foo/field" + */ +static int show_plugin_field(struct seq_file *seq, void *cookie) +{ + struct inode *parent; + struct inode *host; + struct file *file; + struct inode *inode; + reiser4_plugin *plug; + plugin_entry *entry; + int pidx; + int idx; + plugin_set *pset; + + file = seq->private; + inode = file->f_dentry->d_inode; + + parent = get_inode_parent(inode); + host = get_inode_host(inode); + pidx = reiser4_inode_data(parent)->file_plugin_data.pseudo_info.datum; + idx = reiser4_inode_data(inode)->file_plugin_data.pseudo_info.datum; + entry = &pentry[pidx]; + pset = reiser4_inode_data(host)->pset; + plug = pset_get(pset, entry->memb); + + if (plug != NULL) { + switch (idx) { + case PFIELD_TYPEID: + seq_printf(seq, "%i", plug->h.type_id); + break; + case PFIELD_ID: + seq_printf(seq, "%i", plug->h.id); + break; + case PFIELD_LABEL: + seq_printf(seq, "%s", plug->h.label); + break; + case PFIELD_DESC: + seq_printf(seq, "%s", plug->h.desc); + break; + } + } + + return 0; +} + +/* + * ->readdir() method for "a/..../plugin/foo/". It uses array representation of + * child objects, described in the comment for array_lookup_pseudo(). + */ +static int readdir_plugin_field(struct file *f, void *dirent, filldir_t filld) +{ + return array_readdir_pseudo(f, dirent, filld, + fentry, sizeof_array(fentry)); +} + +/* + * ->lookup() method for the "a/..../plugin/" directory. It uses array + * representation of child objects, described in the comment for + * array_lookup_pseudo(). + */ +static int lookup_plugins(struct inode *parent, struct dentry ** dentry) +{ + return array_lookup_pseudo(parent, dentry, pentry, + pseudo_plugin_by_id(PSEUDO_PLUGIN_ID)); +} + +/* + * ->readdir() method for "a/..../plugin/". It uses array representation of + * child objects, described in the comment for array_lookup_pseudo(). + */ +static int readdir_plugins(struct file *f, void *dirent, filldir_t filld) +{ + return array_readdir_pseudo(f, dirent, filld, + pentry, sizeof_array(pentry)); +} + +/* + * seq-methods for the "a/..../items" + */ + +/* + * start iteration over a sequence of items for the host file. This iterator + * uses the same cursor as a readdir iterator above. + */ +static void * items_start(struct seq_file *m, loff_t *pos) +{ + struct inode *host; + readdir_cookie *c; + file_plugin *fplug; + reiser4_key headkey; + int result; + loff_t entryno; + + /* + * first, find first item in the file, then, scan to the *pos-th one. + */ + + host = get_seq_pseudo_host(m); + fplug = inode_file_plugin(host); + + down(&host->i_sem); + if (fplug->key_by_inode == NULL) { + finish(NULL); + return NULL; + } + + /* construct a key of the first item */ + fplug->key_by_inode(host, 0, &headkey); + + c = kmalloc(sizeof *c, GFP_KERNEL); + if (c == NULL) { + finish(NULL); + return ERR_PTR(RETERR(-ENOMEM)); + } + + /* find first item */ + result = object_lookup(host, + &headkey, + &c->coord, + &c->lh, + ZNODE_READ_LOCK, + FIND_MAX_NOT_MORE_THAN, + TWIG_LEVEL, + LEAF_LEVEL, + 0, + NULL); + + tap_init(&c->tap, &c->coord, &c->lh, ZNODE_READ_LOCK); + if (result == 0) + result = tap_load(&c->tap); { + if (result == 0) { + /* + * skip @pos items + */ + for (entryno = 0; entryno != *pos; ++ entryno) { + result = go_next_unit(&c->tap); + if (result == -E_NO_NEIGHBOR) { + finish(c); + return NULL; + } + if (result != 0) + break; + if (!fplug->owns_item(host, c->tap.coord)) { + finish(c); + return NULL; + } + } + } + } + if (result != 0) { + finish(c); + return ERR_PTR(result); + } else + return c; +} + +/* + * stop iteration over a sequence of items for the host file + */ +static void items_stop(struct seq_file *m, void *v) +{ + up(&get_seq_pseudo_host(m)->i_sem); + finish(v); +} + +/* go to the next item in the host file */ +static void * items_next(struct seq_file *m, void *v, loff_t *pos) +{ + readdir_cookie *c; + struct inode *host; + int result; + + c = v; + ++ (*pos); + host = get_seq_pseudo_host(m); + result = go_next_unit(&c->tap); + if (result == 0) { + if (!inode_file_plugin(host)->owns_item(host, c->tap.coord)) { + finish(c); + return NULL; + } else + return v; + } else { + finish(c); + return ERR_PTR(result); + } +} + +/* output information about single item of the host file */ +static int items_show(struct seq_file *m, void *v) +{ + readdir_cookie *c; + item_plugin *iplug; + char buf[KEY_BUF_LEN]; + reiser4_key key; + + + c = v; + iplug = item_plugin_by_coord(&c->coord); + + /* output key... */ + sprintf_key(buf, unit_key_by_coord(&c->coord, &key)); + /* ... and item plugin label... */ + seq_printf(m, "%s %s\n", buf, iplug->h.label); + return 0; +} + +extern int +invoke_create_method(struct inode *, struct dentry *, + reiser4_object_create_data *); + +/* + * write method for the "a/..../new" file. Extract file name from the user + * supplied buffer @buf, and create regular file with that name within host + * file (that is better to be a directory). + */ +static int get_new(struct file *file, const char *buf) +{ + int result; + + /* check that @buf contains no slashes */ + if (strchr(buf, '/') == NULL) { + struct dentry *d; + struct qstr name; + unsigned int c; + unsigned long hash; + + reiser4_object_create_data data; + memset(&data, 0, sizeof data); + + data.mode = S_IFREG | 0 /* mode */; + data.id = UNIX_FILE_PLUGIN_ID; + + name.name = buf; + c = *(const unsigned char *)buf; + + /* build hash of the name */ + hash = init_name_hash(); + do { + buf++; + hash = partial_name_hash(c, hash); + c = *(const unsigned char *)buf; + } while (c); + name.len = buf - (const char *) name.name; + name.hash = end_name_hash(hash); + + /* allocate dentry */ + d = d_alloc(file->f_dentry->d_parent->d_parent, &name); + if (d == NULL) + result = RETERR(-ENOMEM); + else { + /* call ->create() method of the host directory */ + result = invoke_create_method(get_pseudo_host(file), + d, &data); + reiser4_free_dentry_fsdata(d); + } + } else + result = RETERR(-EINVAL); + return result; +} + +/* + * initialize pseudo plugins. + */ +pseudo_plugin pseudo_plugins[LAST_PSEUDO_ID] = { + [PSEUDO_METAS_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_METAS_ID, + .pops = NULL, + .label = "....", + .desc = "meta-files", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = TOP_LEVEL, + .try = try_by_label, + .readdirable = 0, + .lookup = lookup_table, + .lookup_mode = S_IFDIR | S_IRUGO | S_IXUGO, + .read_type = PSEUDO_READ_NONE, + .write_type = PSEUDO_WRITE_NONE, + .readdir = readdir_table + }, + [PSEUDO_UID_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_UID_ID, + .pops = NULL, + .label = "uid", + .desc = "returns owner", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = PSEUDO_METAS_ID, + .try = try_by_label, + .readdirable = 1, + .lookup = NULL, + .lookup_mode = S_IFREG | S_IRUGO | S_IWUSR, + .read_type = PSEUDO_READ_SINGLE, + .read = { + .single_show = show_uid + }, + .write_type = PSEUDO_WRITE_STRING, + .write = { + .gets = store_uid + } + }, + [PSEUDO_GID_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_GID_ID, + .pops = NULL, + .label = "gid", + .desc = "returns group", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = PSEUDO_METAS_ID, + .try = try_by_label, + .readdirable = 1, + .lookup = NULL, + .lookup_mode = S_IFREG | S_IRUGO | S_IWUSR, + .read_type = PSEUDO_READ_SINGLE, + .read = { + .single_show = show_gid + }, + .write_type = PSEUDO_WRITE_STRING, + .write = { + .gets = get_gid + } + }, + [PSEUDO_RWX_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_RWX_ID, + .pops = NULL, + .label = "rwx", + .desc = "returns rwx permissions", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = PSEUDO_METAS_ID, + .try = try_by_label, + .readdirable = 1, + .lookup = NULL, + .lookup_mode = S_IFREG | S_IRUGO | S_IWUSR, + .read_type = PSEUDO_READ_SINGLE, + .read = { + .single_show = show_rwx + }, + .write_type = PSEUDO_WRITE_STRING, + .write = { + .gets = get_rwx + } + }, + [PSEUDO_OID_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_OID_ID, + .pops = NULL, + .label = "oid", + .desc = "returns object id", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = PSEUDO_METAS_ID, + .try = try_by_label, + .readdirable = 1, + .lookup = NULL, + .lookup_mode = S_IFREG | S_IRUGO, + .read_type = PSEUDO_READ_SINGLE, + .read = { + .single_show = show_oid + }, + .write_type = PSEUDO_WRITE_NONE + }, + [PSEUDO_KEY_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_KEY_ID, + .pops = NULL, + .label = "key", + .desc = "returns object's key", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = PSEUDO_METAS_ID, + .try = try_by_label, + .readdirable = 1, + .lookup = NULL, + .lookup_mode = S_IFREG | S_IRUGO, + .read_type = PSEUDO_READ_SINGLE, + .read = { + .single_show = show_key + }, + .write_type = PSEUDO_WRITE_NONE + }, + [PSEUDO_SIZE_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_SIZE_ID, + .pops = NULL, + .label = "size", + .desc = "returns object's size", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = PSEUDO_METAS_ID, + .try = try_by_label, + .readdirable = 1, + .lookup = NULL, + .lookup_mode = S_IFREG | S_IRUGO, + .read_type = PSEUDO_READ_SINGLE, + .read = { + .single_show = show_size + }, + .write_type = PSEUDO_WRITE_NONE + }, + [PSEUDO_NLINK_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_NLINK_ID, + .pops = NULL, + .label = "nlink", + .desc = "returns nlink count", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = PSEUDO_METAS_ID, + .try = try_by_label, + .readdirable = 1, + .lookup = NULL, + .lookup_mode = S_IFREG | S_IRUGO, + .read_type = PSEUDO_READ_SINGLE, + .read = { + .single_show = show_nlink + }, + .write_type = PSEUDO_WRITE_NONE + }, + [PSEUDO_LOCALITY_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_LOCALITY_ID, + .pops = NULL, + .label = "locality", + .desc = "returns object's locality", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = PSEUDO_METAS_ID, + .try = try_by_label, + .readdirable = 1, + .lookup = NULL, + .lookup_mode = S_IFREG | S_IRUGO, + .read_type = PSEUDO_READ_SINGLE, + .read = { + .single_show = show_locality + }, + .write_type = PSEUDO_WRITE_NONE + }, + [PSEUDO_PSEUDOS_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_PSEUDOS_ID, + .pops = NULL, + .label = "pseudo", + .desc = "returns a list of pseudo files", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = PSEUDO_METAS_ID, + .try = try_by_label, + .readdirable = 1, + .lookup = NULL, + .lookup_mode = S_IFREG | S_IRUGO, + .read_type = PSEUDO_READ_SEQ, + .read = { + .ops = { + .start = pseudos_start, + .stop = pseudos_stop, + .next = pseudos_next, + .show = pseudos_show + } + }, + .write_type = PSEUDO_WRITE_NONE + }, + [PSEUDO_BMAP_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_BMAP_ID, + .pops = NULL, + .label = "bmap", + .desc = "returns a list blocks for this file", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = PSEUDO_METAS_ID, + .try = try_by_label, + .readdirable = 1, + .lookup = NULL, + .lookup_mode = S_IFREG | S_IRUGO, + .read_type = PSEUDO_READ_SEQ, + .read = { + .ops = { + .start = bmap_start, + .stop = bmap_stop, + .next = bmap_next, + .show = bmap_show + } + }, + .write_type = PSEUDO_WRITE_NONE + }, + [PSEUDO_READDIR_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_READDIR_ID, + .pops = NULL, + .label = "readdir", + .desc = "returns a list of names in the dir", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = PSEUDO_METAS_ID, + .try = try_by_label, + .readdirable = 1, + .lookup = NULL, + .lookup_mode = S_IFREG | S_IRUGO, + .read_type = PSEUDO_READ_SEQ, + .read = { + .ops = { + .start = readdir_start, + .stop = readdir_stop, + .next = readdir_next, + .show = readdir_show + } + }, + .write_type = PSEUDO_WRITE_NONE + }, + [PSEUDO_PLUGIN_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_PLUGIN_ID, + .pops = NULL, + .label = "plugin", + .desc = "plugin", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = PSEUDO_PLUGINS_ID, + .try = NULL, + .readdirable = 0, + .lookup = lookup_plugin_field, + /* + * foo/..../plugin/bar is much like a directory. So, why + * there is no S_IFDIR term in the .lookup_mode, you ask? + * + * fs/namei.c:may_open(): + * + * if (S_ISDIR(inode->i_mode) && (flag & FMODE_WRITE)) + * return -EISDIR; + * + * Directory cannot be opened for write. How smart. + */ + .lookup_mode = S_IFREG | S_IRUGO | S_IWUSR | S_IXUGO, + .read_type = PSEUDO_READ_SINGLE, + .read = { + .single_show = show_plugin + }, + .write_type = PSEUDO_WRITE_STRING, + .write = { + .gets = set_plugin + }, + .readdir = readdir_plugin_field + }, + [PSEUDO_PLUGINS_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_PLUGINS_ID, + .pops = NULL, + .label = "plugin", + .desc = "list of plugins", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = PSEUDO_METAS_ID, + .try = try_by_label, + .readdirable = 1, + .lookup = lookup_plugins, + .lookup_mode = S_IFDIR | S_IRUGO | S_IXUGO, + .read_type = PSEUDO_READ_NONE, + .write_type = PSEUDO_WRITE_NONE, + .readdir = readdir_plugins + }, + [PSEUDO_PLUGIN_FIELD_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_PLUGIN_ID, + .pops = NULL, + .label = "plugin-field", + .desc = "plugin field", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = PSEUDO_PLUGIN_ID, + .try = NULL, + .readdirable = 0, + .lookup = NULL, + .lookup_mode = S_IFREG | S_IRUGO, + .read_type = PSEUDO_READ_SINGLE, + .read = { + .single_show = show_plugin_field + }, + .write_type = PSEUDO_WRITE_NONE, + .readdir = NULL + }, + [PSEUDO_ITEMS_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_ITEMS_ID, + .pops = NULL, + .label = "items", + .desc = "returns a list of items for this file", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = PSEUDO_METAS_ID, + .try = try_by_label, + .readdirable = 1, + .lookup = NULL, + .lookup_mode = S_IFREG | S_IRUGO, + .read_type = PSEUDO_READ_SEQ, + .read = { + .ops = { + .start = items_start, + .stop = items_stop, + .next = items_next, + .show = items_show + } + }, + .write_type = PSEUDO_WRITE_NONE + }, + [PSEUDO_NEW_ID] = { + .h = { + .type_id = REISER4_PSEUDO_PLUGIN_TYPE, + .id = PSEUDO_NEW_ID, + .pops = NULL, + .label = "new", + .desc = "creates new file in the host", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .parent = PSEUDO_METAS_ID, + .try = try_by_label, + .readdirable = 1, + .lookup = NULL, + .lookup_mode = S_IFREG | S_IWUSR, + .read_type = PSEUDO_READ_NONE, + .read = { + .single_show = show_rwx + }, + .write_type = PSEUDO_WRITE_STRING, + .write = { + .gets = get_new + } + }, +}; + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/pseudo/pseudo.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/pseudo/pseudo.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,176 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Handling of "pseudo" files representing unified access to meta data in + reiser4. See pseudo.c for more comments. */ + +#if !defined( __REISER4_PSEUDO_H__ ) +#define __REISER4_PSEUDO_H__ + +#include "../plugin_header.h" +#include "../../key.h" + +#include +#include + +/* + * tag used by wrappers in plugin/file/pseudo.c to perform actions for the + * particular flavor of pseudo file. + */ +typedef enum { + /* this pseudo file cannot be read */ + PSEUDO_READ_NONE, + /* this pseudo file used seq_* functions (fs/seq_file.c) to generate + * it's content */ + PSEUDO_READ_SEQ, + /* this pseudo file contains single value */ + PSEUDO_READ_SINGLE, + /* this pseudo file has some special ->read() method that should be + * called */ + PSEUDO_READ_FORWARD +} pseudo_read_type; + +typedef enum { + /* this pseudo file cannot be written into */ + PSEUDO_WRITE_NONE, + /* this pseudo file's content is generated by sprintf() */ + PSEUDO_WRITE_STRING, + /* this pseudo file has some special ->write() method that should be + * called */ + PSEUDO_WRITE_FORWARD +} pseudo_write_type; + +/* low level operations on the pseudo files. + + Methods from this interface are directly callable by reiser4 system call. + + This operation structure looks suspiciously like yet another plugin + type. Doing so would simplify some things. For example, there are already + functions to look up plugin by name, etc. + +*/ +struct pseudo_plugin; +typedef struct pseudo_plugin pseudo_plugin; +struct pseudo_plugin { + + /* common fields */ + plugin_header h; + + /* + * id of plugin of the parent pseudo file in the directory + * hierarchy. See comment for + * plugin/pseudo/pseudo.c:lookup_of_plugin(). + */ + int parent; + + /* + * check whether this pseudo file matches name @name within @parent + */ + int (*try) (pseudo_plugin *pplug, + const struct inode *parent, const char *name); + /* + * true if this pseudo file is visible in readdir. + */ + int readdirable; + /* lookup method applicable to this pseudo file by method name. + + This is for something like "foo/..acl/dup", here "../acl" is the + name of a pseudo file, and "dup" is name of an operation (method) + applicable to "../acl". Once "..acl" is resolved to ACL object, + ->lookup( "dup" ) can be called to get operation. + + */ + int (*lookup)(struct inode *parent, struct dentry ** dentry); + + /* + * rwx bits returned by stat(2) for this pseudo file + */ + umode_t lookup_mode; + + /* NOTE-NIKITA some other operations. Reiser4 syntax people should + add something here. */ + + /* + * how content of this pseudo file is generated + */ + pseudo_read_type read_type; + union { + /* for PSEUDO_READ_SEQ */ + struct seq_operations ops; + /* for PSEUDO_READ_SINGLE */ + int (*single_show) (struct seq_file *, void *); + /* for PSEUDO_READ_FORWARD */ + ssize_t (*read)(struct file *, char __user *, size_t , loff_t *); + } read; + + /* + * how this pseudo file reacts to write(2) calls + */ + pseudo_write_type write_type; + union { + /* for PSEUDO_WRITE_STRING */ + int (*gets)(struct file *, const char *); + /* for PSEUDO_WRITE_FORWARD */ + ssize_t (*write)(struct file *, + const char __user *, size_t , loff_t *); + } write; + /* + * ->readdir method + */ + int (*readdir)(struct file *f, void *dirent, filldir_t filld); +}; + +/* portion of reiser4_inode specific for pseudo files */ +typedef struct pseudo_info { + /* pseudo file plugin controlling this file */ + pseudo_plugin *plugin; + /* host object, for /etc/passwd/..oid, this is pointer to inode of + * /etc/passwd */ + struct inode *host; + /* immediate parent object. This is different from ->host for deeply + * nested pseudo files like foo/..plugin/foo */ + struct inode *parent; + /* for private use of pseudo file plugin */ + unsigned long datum; +} pseudo_info_t; + +extern int lookup_pseudo_file(struct inode *parent, struct dentry **dentry); + +/* + * ids of pseudo files. See plugin/pseudo/pseudo.c for more details on each + * particular pseudo file. + */ +typedef enum { + PSEUDO_METAS_ID, + PSEUDO_UID_ID, + PSEUDO_GID_ID, + PSEUDO_RWX_ID, + PSEUDO_OID_ID, + PSEUDO_KEY_ID, + PSEUDO_SIZE_ID, + PSEUDO_NLINK_ID, + PSEUDO_LOCALITY_ID, + PSEUDO_PSEUDOS_ID, + PSEUDO_BMAP_ID, + PSEUDO_READDIR_ID, + PSEUDO_PLUGIN_ID, + PSEUDO_PLUGINS_ID, + PSEUDO_PLUGIN_FIELD_ID, + PSEUDO_ITEMS_ID, + PSEUDO_NEW_ID, + LAST_PSEUDO_ID +} reiser4_pseudo_id; + +/* __REISER4_PSEUDO_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/security/perm.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/security/perm.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,91 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ +/* NIKITA-FIXME-HANS: this comment describes what code? */ +/* definition of item plugins. */ + +#include "../plugin.h" +#include "../plugin_header.h" +#include "../../debug.h" + +#include +#include /* for struct dentry */ +#include +#include + +static int +mask_ok_common(struct inode *inode, int mask) +{ + return generic_permission(inode, mask, NULL); +} + +static int +setattr_ok_common(struct dentry *dentry, struct iattr *attr) +{ + int result; + struct inode *inode; + + assert("nikita-2272", dentry != NULL); + assert("nikita-2273", attr != NULL); + + inode = dentry->d_inode; + assert("nikita-2274", inode != NULL); + + result = inode_change_ok(inode, attr); + if (result == 0) { + unsigned int valid; + + valid = attr->ia_valid; + if ((valid & ATTR_UID && attr->ia_uid != inode->i_uid) || + (valid & ATTR_GID && attr->ia_gid != inode->i_gid)) + result = DQUOT_TRANSFER(inode, attr) ? -EDQUOT : 0; + } + return result; +} + +static int +read_ok_common( + struct file * file, const char *buf, size_t size, loff_t *off) +{ + return access_ok(VERIFY_WRITE, buf, size) ? 0 : -EFAULT; +} + +static int +write_ok_common( + struct file * file, const char *buf, size_t size, loff_t *off) +{ + return access_ok(VERIFY_READ, buf, size) ? 0 : -EFAULT; +} + +perm_plugin perm_plugins[LAST_PERM_ID] = { +/* NIKITA-FIXME-HANS: what file contains rwx permissions methods code? */ + [RWX_PERM_ID] = { + .h = { + .type_id = REISER4_PERM_PLUGIN_TYPE, + .id = RWX_PERM_ID, + .pops = NULL, + .label = "rwx", + .desc = "standard UNIX permissions", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .read_ok = read_ok_common, + .write_ok = write_ok_common, + .lookup_ok = NULL, + .create_ok = NULL, + .link_ok = NULL, + .unlink_ok = NULL, + .delete_ok = NULL, + .mask_ok = mask_ok_common, + .setattr_ok = setattr_ok_common, + .getattr_ok = NULL, + .rename_ok = NULL, + }, +}; + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/security/perm.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/security/perm.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,88 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Perm (short for "permissions") plugins common stuff. */ + +#if !defined( __REISER4_PERM_H__ ) +#define __REISER4_PERM_H__ + +#include "../../forward.h" +#include "../plugin_header.h" + +#include +#include /* for struct file */ +#include /* for struct dentry */ + +/* interface for perm plugin. + + Perm plugin method can be implemented through: + + 1. consulting ->i_mode bits in stat data + + 2. obtaining acl from the tree and inspecting it + + 3. asking some kernel module or user-level program to authorize access. + + This allows for integration with things like capabilities, SELinux-style + secutiry contexts, etc. + +*/ +/* NIKITA-FIXME-HANS: define what this is targeted for. It does not seem to be intended for use with sys_reiser4. Explain. */ +typedef struct perm_plugin { + /* generic plugin fields */ + plugin_header h; + + /* check permissions for read/write */ + int (*read_ok) (struct file * file, const char *buf, size_t size, loff_t * off); + int (*write_ok) (struct file * file, const char *buf, size_t size, loff_t * off); + + /* check permissions for lookup */ + int (*lookup_ok) (struct inode * parent, struct dentry * dentry); + + /* check permissions for create */ + int (*create_ok) (struct inode * parent, struct dentry * dentry, reiser4_object_create_data * data); + + /* check permissions for linking @where to @existing */ + int (*link_ok) (struct dentry * existing, struct inode * parent, struct dentry * where); + + /* check permissions for unlinking @victim from @parent */ + int (*unlink_ok) (struct inode * parent, struct dentry * victim); + + /* check permissions for deletion of @object whose last reference is + by @parent */ + int (*delete_ok) (struct inode * parent, struct dentry * victim); + int (*mask_ok) (struct inode * inode, int mask); + /* check whether attribute change is acceptable */ + int (*setattr_ok) (struct dentry * dentry, struct iattr * attr); + + /* check whether stat(2) is allowed */ + int (*getattr_ok) (struct vfsmount * mnt UNUSED_ARG, struct dentry * dentry, struct kstat * stat); + /* check whether rename(2) is allowed */ + int (*rename_ok) (struct inode * old_dir, struct dentry * old, + struct inode * new_dir, struct dentry * new); +} perm_plugin; +/* NIKITA-FIXME-HANS: I really hate things like this that kill the ability of Meta-. to work. Please eliminate this macro, exce */ +/* call ->check_ok method of perm plugin for inode */ +#define perm_chk(inode, check, ...) \ +({ \ + perm_plugin *perm; \ + \ + perm = inode_perm_plugin(inode); \ + (perm == NULL || perm->check ## _ok == NULL) ? \ + 0 : \ + perm->check ## _ok(__VA_ARGS__); \ +}) + +typedef enum { RWX_PERM_ID, LAST_PERM_ID } reiser4_perm_id; + +/* __REISER4_PERM_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/space/bitmap.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/space/bitmap.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1646 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "../../debug.h" +#include "../../dformat.h" +#include "../../txnmgr.h" +#include "../../jnode.h" +#include "../../block_alloc.h" +#include "../../tree.h" +#include "../../super.h" +#include "../../lib.h" +#include "../plugin.h" +#include "space_allocator.h" +#include "bitmap.h" + +#include +#include /* for struct super_block */ +#include +#include + +/* Proposed (but discarded) optimization: dynamic loading/unloading of bitmap + * blocks + + A useful optimization of reiser4 bitmap handling would be dynamic bitmap + blocks loading/unloading which is different from v3.x where all bitmap + blocks are loaded at mount time. + + To implement bitmap blocks unloading we need to count bitmap block usage + and detect currently unused blocks allowing them to be unloaded. It is not + a simple task since we allow several threads to modify one bitmap block + simultaneously. + + Briefly speaking, the following schema is proposed: we count in special + variable associated with each bitmap block. That is for counting of block + alloc/dealloc operations on that bitmap block. With a deferred block + deallocation feature of reiser4 all those operation will be represented in + atom dirty/deleted lists as jnodes for freshly allocated or deleted + nodes. + + So, we increment usage counter for each new node allocated or deleted, and + decrement it at atom commit one time for each node from the dirty/deleted + atom's list. Of course, freshly allocated node deletion and node reusing + from atom deleted (if we do so) list should decrement bitmap usage counter + also. + + This schema seems to be working but that reference counting is + not easy to debug. I think we should agree with Hans and do not implement + it in v4.0. Current code implements "on-demand" bitmap blocks loading only. + + For simplicity all bitmap nodes (both commit and working bitmap blocks) are + loaded into memory on fs mount time or each bitmap nodes are loaded at the + first access to it, the "dont_load_bitmap" mount option controls whether + bimtap nodes should be loaded at mount time. Dynamic unloading of bitmap + nodes currently is not supported. */ + +#define CHECKSUM_SIZE 4 + +#define BYTES_PER_LONG (sizeof(long)) + +#if BITS_PER_LONG == 64 +# define LONG_INT_SHIFT (6) +#else +# define LONG_INT_SHIFT (5) +#endif + +#define LONG_INT_MASK (BITS_PER_LONG - 1UL) + +typedef unsigned long ulong_t; + + +#define bmap_size(blocksize) ((blocksize) - CHECKSUM_SIZE) +#define bmap_bit_count(blocksize) (bmap_size(blocksize) << 3) + +/* Block allocation/deallocation are done through special bitmap objects which + are allocated in an array at fs mount. */ +struct bitmap_node { + struct semaphore sema; /* long term lock object */ + + jnode *wjnode; /* j-nodes for WORKING ... */ + jnode *cjnode; /* ... and COMMIT bitmap blocks */ + + bmap_off_t first_zero_bit; /* for skip_busy option implementation */ + + atomic_t loaded; /* a flag which shows that bnode is loaded + * already */ +}; + +static inline char * +bnode_working_data(struct bitmap_node *bnode) +{ + char *data; + + data = jdata(bnode->wjnode); + assert("zam-429", data != NULL); + + return data + CHECKSUM_SIZE; +} + +static inline char * +bnode_commit_data(const struct bitmap_node *bnode) +{ + char *data; + + data = jdata(bnode->cjnode); + assert("zam-430", data != NULL); + + return data + CHECKSUM_SIZE; +} + +static inline __u32 +bnode_commit_crc(const struct bitmap_node *bnode) +{ + char *data; + + data = jdata(bnode->cjnode); + assert("vpf-261", data != NULL); + + return d32tocpu((d32 *) data); +} + +static inline void +bnode_set_commit_crc(struct bitmap_node *bnode, __u32 crc) +{ + char *data; + + data = jdata(bnode->cjnode); + assert("vpf-261", data != NULL); + + cputod32(crc, (d32 *) data); +} + +/* ZAM-FIXME-HANS: is the idea that this might be a union someday? having + * written the code, does this added abstraction still have */ +/* ANSWER(Zam): No, the abstractions is in the level above (exact place is the + * reiser4_space_allocator structure) */ +/* ZAM-FIXME-HANS: I don't understand your english in comment above. */ +/* FIXME-HANS(Zam): I don't understand the questions like "might be a union + * someday?". What they about? If there is a reason to have a union, it should + * be a union, if not, it should not be a union. "..might be someday" means no + * reason. */ +struct bitmap_allocator_data { + /* an array for bitmap blocks direct access */ + struct bitmap_node *bitmap; +}; + +#define get_barray(super) \ +(((struct bitmap_allocator_data *)(get_super_private(super)->space_allocator.u.generic)) -> bitmap) + +#define get_bnode(super, i) (get_barray(super) + i) + +/* allocate and initialize jnode with JNODE_BITMAP type */ +static jnode * +bnew(void) +{ + jnode *jal = jalloc(); + + if (jal) + jnode_init(jal, current_tree, JNODE_BITMAP); + + return jal; +} + +/* this file contains: + - bitmap based implementation of space allocation plugin + - all the helper functions like set bit, find_first_zero_bit, etc */ + +/* Audited by: green(2002.06.12) */ +static int +find_next_zero_bit_in_word(ulong_t word, int start_bit) +{ + ulong_t mask = 1UL << start_bit; + int i = start_bit; + + while ((word & mask) != 0) { + mask <<= 1; + if (++i >= BITS_PER_LONG) + break; + } + + return i; +} + +#include + +#if BITS_PER_LONG == 64 + +#define OFF(addr) (((ulong_t)(addr) & (BYTES_PER_LONG - 1)) << 3) +#define BASE(addr) ((ulong_t*) ((ulong_t)(addr) & ~(BYTES_PER_LONG - 1))) + +static inline void reiser4_set_bit(int nr, void * addr) +{ + ext2_set_bit(nr + OFF(addr), BASE(addr)); +} + +static inline void reiser4_clear_bit(int nr, void * addr) +{ + ext2_clear_bit(nr + OFF(addr), BASE(addr)); +} + +static inline int reiser4_test_bit(int nr, void * addr) +{ + return ext2_test_bit(nr + OFF(addr), BASE(addr)); +} +static inline int reiser4_find_next_zero_bit(void * addr, int maxoffset, int offset) +{ + int off = OFF(addr); + + return ext2_find_next_zero_bit(BASE(addr), maxoffset + off, offset + off) - off; +} + +#else + +#define reiser4_set_bit(nr, addr) ext2_set_bit(nr, addr) +#define reiser4_clear_bit(nr, addr) ext2_clear_bit(nr, addr) +#define reiser4_test_bit(nr, addr) ext2_test_bit(nr, addr) + +#define reiser4_find_next_zero_bit(addr, maxoffset, offset) \ +ext2_find_next_zero_bit(addr, maxoffset, offset) +#endif + +/* Search for a set bit in the bit array [@start_offset, @max_offset[, offsets + * are counted from @addr, return the offset of the first bit if it is found, + * @maxoffset otherwise. */ +static bmap_off_t __reiser4_find_next_set_bit( + void *addr, bmap_off_t max_offset, bmap_off_t start_offset) +{ + ulong_t *base = addr; + /* start_offset is in bits, convert it to byte offset within bitmap. */ + int word_nr = start_offset >> LONG_INT_SHIFT; + /* bit number within the byte. */ + int bit_nr = start_offset & LONG_INT_MASK; + int max_word_nr = (max_offset - 1) >> LONG_INT_SHIFT; + + assert("zam-387", max_offset != 0); + + /* Unaligned @start_offset case. */ + if (bit_nr != 0) { + bmap_nr_t nr; + + nr = find_next_zero_bit_in_word(~(base[word_nr]), bit_nr); + + if (nr < BITS_PER_LONG) + return (word_nr << LONG_INT_SHIFT) + nr; + + ++word_nr; + } + + /* Fast scan trough aligned words. */ + while (word_nr <= max_word_nr) { + if (base[word_nr] != 0) { + return (word_nr << LONG_INT_SHIFT) + + find_next_zero_bit_in_word(~(base[word_nr]), 0); + } + + ++word_nr; + } + + return max_offset; +} + +#if BITS_PER_LONG == 64 + +static bmap_off_t reiser4_find_next_set_bit( + void *addr, bmap_off_t max_offset, bmap_off_t start_offset) +{ + bmap_off_t off = OFF(addr); + + return __reiser4_find_next_set_bit(BASE(addr), max_offset + off, start_offset + off) - off; +} + +#else +#define reiser4_find_next_set_bit(addr, max_offset, start_offset) \ + __reiser4_find_next_set_bit(addr, max_offset, start_offset) +#endif + +/* search for the first set bit in single word. */ +static int find_last_set_bit_in_word (ulong_t word, int start_bit) +{ + ulong_t bit_mask; + int nr = start_bit; + + assert ("zam-965", start_bit < BITS_PER_LONG); + assert ("zam-966", start_bit >= 0); + + bit_mask = (1UL << nr); + + while (bit_mask != 0) { + if (bit_mask & word) + return nr; + bit_mask >>= 1; + nr --; + } + return BITS_PER_LONG; +} + +/* Search bitmap for a set bit in backward direction from the end to the + * beginning of given region + * + * @result: result offset of the last set bit + * @addr: base memory address, + * @low_off: low end of the search region, edge bit included into the region, + * @high_off: high end of the search region, edge bit included into the region, + * + * @return: 0 - set bit was found, -1 otherwise. + */ +static int +reiser4_find_last_set_bit (bmap_off_t * result, void * addr, bmap_off_t low_off, bmap_off_t high_off) +{ + ulong_t * base = addr; + int last_word; + int first_word; + int last_bit; + int nr; + + assert ("zam-961", high_off >= 0); + assert ("zam-962", high_off >= low_off); + + last_word = high_off >> LONG_INT_SHIFT; + last_bit = high_off & LONG_INT_MASK; + first_word = low_off >> LONG_INT_SHIFT; + + if (last_bit < BITS_PER_LONG) { + nr = find_last_set_bit_in_word(base[last_word], last_bit); + if (nr < BITS_PER_LONG) { + *result = (last_word << LONG_INT_SHIFT) + nr; + return 0; + } + -- last_word; + } + while (last_word >= first_word) { + if (base[last_word] != 0x0) { + last_bit = find_last_set_bit_in_word(base[last_word], BITS_PER_LONG - 1); + assert ("zam-972", last_bit < BITS_PER_LONG); + *result = (last_word << LONG_INT_SHIFT) + last_bit; + return 0; + } + -- last_word; + } + + return -1; /* set bit not found */ +} + +/* Search bitmap for a clear bit in backward direction from the end to the + * beginning of given region */ +static int +reiser4_find_last_zero_bit (bmap_off_t * result, void * addr, bmap_off_t low_off, bmap_off_t high_off) +{ + ulong_t * base = addr; + int last_word; + int first_word; + int last_bit; + int nr; + + last_word = high_off >> LONG_INT_SHIFT; + last_bit = high_off & LONG_INT_MASK; + first_word = low_off >> LONG_INT_SHIFT; + + if (last_bit < BITS_PER_LONG) { + nr = find_last_set_bit_in_word(~base[last_word], last_bit); + if (nr < BITS_PER_LONG) { + *result = (last_word << LONG_INT_SHIFT) + nr; + return 0; + } + -- last_word; + } + while (last_word >= first_word) { + if (base[last_word] != (ulong_t)(-1)) { + *result = (last_word << LONG_INT_SHIFT) + + find_last_set_bit_in_word(~base[last_word], BITS_PER_LONG - 1); + return 0; + } + -- last_word; + } + + return -1; /* zero bit not found */ +} + +/* Audited by: green(2002.06.12) */ +static void +reiser4_clear_bits(char *addr, bmap_off_t start, bmap_off_t end) +{ + int first_byte; + int last_byte; + + unsigned char first_byte_mask = 0xFF; + unsigned char last_byte_mask = 0xFF; + + assert("zam-410", start < end); + + first_byte = start >> 3; + last_byte = (end - 1) >> 3; + + if (last_byte > first_byte + 1) + memset(addr + first_byte + 1, 0, (size_t) (last_byte - first_byte - 1)); + + first_byte_mask >>= 8 - (start & 0x7); + last_byte_mask <<= ((end - 1) & 0x7) + 1; + + if (first_byte == last_byte) { + addr[first_byte] &= (first_byte_mask | last_byte_mask); + } else { + addr[first_byte] &= first_byte_mask; + addr[last_byte] &= last_byte_mask; + } +} + +/* Audited by: green(2002.06.12) */ +/* ZAM-FIXME-HANS: comment this */ +static void +reiser4_set_bits(char *addr, bmap_off_t start, bmap_off_t end) +{ + int first_byte; + int last_byte; + + unsigned char first_byte_mask = 0xFF; + unsigned char last_byte_mask = 0xFF; + + assert("zam-386", start < end); + + first_byte = start >> 3; + last_byte = (end - 1) >> 3; + + if (last_byte > first_byte + 1) + memset(addr + first_byte + 1, 0xFF, (size_t) (last_byte - first_byte - 1)); + + first_byte_mask <<= start & 0x7; + last_byte_mask >>= 7 - ((end - 1) & 0x7); + + if (first_byte == last_byte) { + addr[first_byte] |= (first_byte_mask & last_byte_mask); + } else { + addr[first_byte] |= first_byte_mask; + addr[last_byte] |= last_byte_mask; + } +} + +#define ADLER_BASE 65521 +#define ADLER_NMAX 5552 + +/* Calculates the adler32 checksum for the data pointed by `data` of the + length `len`. This function was originally taken from zlib, version 1.1.3, + July 9th, 1998. + + Copyright (C) 1995-1998 Jean-loup Gailly and Mark Adler + + This software is provided 'as-is', without any express or implied + warranty. In no event will the authors be held liable for any damages + arising from the use of this software. + + Permission is granted to anyone to use this software for any purpose, + including commercial applications, and to alter it and redistribute it + freely, subject to the following restrictions: + + 1. The origin of this software must not be misrepresented; you must not + claim that you wrote the original software. If you use this software + in a product, an acknowledgment in the product documentation would be + appreciated but is not required. + 2. Altered source versions must be plainly marked as such, and must not be + misrepresented as being the original software. + 3. This notice may not be removed or altered from any source distribution. + + Jean-loup Gailly Mark Adler + jloup@gzip.org madler@alumni.caltech.edu + + The above comment applies only to the adler32 function. +*/ + +static __u32 +adler32(char *data, __u32 len) +{ + unsigned char *t = data; + __u32 s1 = 1; + __u32 s2 = 0; + int k; + + while (len > 0) { + k = len < ADLER_NMAX ? len : ADLER_NMAX; + len -= k; + + while (k--) { + s1 += *t++; + s2 += s1; + } + + s1 %= ADLER_BASE; + s2 %= ADLER_BASE; + } + return (s2 << 16) | s1; +} + +#define sb_by_bnode(bnode) \ + ((struct super_block *)jnode_get_tree(bnode->wjnode)->super) + +static __u32 +bnode_calc_crc(const struct bitmap_node *bnode, unsigned long size) +{ + return adler32(bnode_commit_data(bnode), bmap_size(size)); +} + + +static int +bnode_check_adler32(const struct bitmap_node *bnode, unsigned long size) +{ + if (bnode_calc_crc(bnode, size) != bnode_commit_crc (bnode)) { + bmap_nr_t bmap; + + bmap = bnode - get_bnode(sb_by_bnode(bnode), 0); + + warning("vpf-263", + "Checksum for the bitmap block %llu is incorrect",bmap); + + return RETERR(-EIO); + } + + return 0; +} + +#define REISER4_CHECK_BMAP_CRC (0) + +#if REISER4_CHECK_BMAP_CRC +static int +bnode_check_crc(const struct bitmap_node *bnode) +{ + return bnode_check_adler32(bnode, + bmap_size(sb_by_bnode(bnode)->s_blocksize)); +} + +/* REISER4_CHECK_BMAP_CRC */ +#else + +#define bnode_check_crc(bnode) (0) + +/* REISER4_CHECK_BMAP_CRC */ +#endif + +/* Recalculates the adler32 checksum for only 1 byte change. + adler - previous adler checksum + old_data, data - old, new byte values. + tail == (chunk - offset) : length, checksum was calculated for, - offset of + the changed byte within this chunk. + This function can be used for checksum calculation optimisation. +*/ + +static __u32 +adler32_recalc(__u32 adler, unsigned char old_data, unsigned char data, __u32 tail) +{ + __u32 delta = data - old_data + 2 * ADLER_BASE; + __u32 s1 = adler & 0xffff; + __u32 s2 = (adler >> 16) & 0xffff; + + s1 = (delta + s1) % ADLER_BASE; + s2 = (delta * tail + s2) % ADLER_BASE; + + return (s2 << 16) | s1; +} + + +#define LIMIT(val, boundary) ((val) > (boundary) ? (boundary) : (val)) + +/* A number of bitmap blocks for given fs. This number can be stored on disk + or calculated on fly; it depends on disk format. +VS-FIXME-HANS: explain calculation, using device with block count of 8 * 4096 blocks as an example. + FIXME-VS: number of blocks in a filesystem is taken from reiser4 + super private data */ +/* Audited by: green(2002.06.12) */ +static bmap_nr_t +get_nr_bmap(const struct super_block *super) +{ + assert("zam-393", reiser4_block_count(super) != 0); + + return div64_32(reiser4_block_count(super) - 1, bmap_bit_count(super->s_blocksize), NULL) + 1; + +} + +/* calculate bitmap block number and offset within that bitmap block */ +static void +parse_blocknr(const reiser4_block_nr * block, bmap_nr_t * bmap, bmap_off_t * offset) +{ + struct super_block *super = get_current_context()->super; + + *bmap = div64_32(*block, bmap_bit_count(super->s_blocksize), offset); + + assert("zam-433", *bmap < get_nr_bmap(super)); +} + +#if REISER4_DEBUG +/* Audited by: green(2002.06.12) */ +static void +check_block_range(const reiser4_block_nr * start, const reiser4_block_nr * len) +{ + struct super_block *sb = reiser4_get_current_sb(); + + assert("zam-436", sb != NULL); + + assert("zam-455", start != NULL); + assert("zam-437", *start != 0); + assert("zam-541", !blocknr_is_fake(start)); + assert("zam-441", *start < reiser4_block_count(sb)); + + if (len != NULL) { + assert("zam-438", *len != 0); + assert("zam-442", *start + *len <= reiser4_block_count(sb)); + } +} + +static void +check_bnode_loaded(const struct bitmap_node *bnode) +{ + assert("zam-485", bnode != NULL); + assert("zam-483", jnode_page(bnode->wjnode) != NULL); + assert("zam-484", jnode_page(bnode->cjnode) != NULL); + assert("nikita-2820", jnode_is_loaded(bnode->wjnode)); + assert("nikita-2821", jnode_is_loaded(bnode->cjnode)); +} + +#else + +# define check_block_range(start, len) do { /* nothing */} while(0) +# define check_bnode_loaded(bnode) do { /* nothing */} while(0) + +#endif + +/* modify bnode->first_zero_bit (if we free bits before); bnode should be + spin-locked */ +static inline void +adjust_first_zero_bit(struct bitmap_node *bnode, bmap_off_t offset) +{ + if (offset < bnode->first_zero_bit) + bnode->first_zero_bit = offset; +} + +/* return a physical disk address for logical bitmap number @bmap */ +/* FIXME-VS: this is somehow related to disk layout? */ +/* ZAM-FIXME-HANS: your answer is? Use not more than one function dereference + * per block allocation so that performance is not affected. Probably this + * whole file should be considered part of the disk layout plugin, and other + * disk layouts can use other defines and efficiency will not be significantly + * affected. */ + +#define REISER4_FIRST_BITMAP_BLOCK \ + ((REISER4_MASTER_OFFSET / PAGE_CACHE_SIZE) + 2) + +/* Audited by: green(2002.06.12) */ +static void +get_bitmap_blocknr(struct super_block *super, bmap_nr_t bmap, reiser4_block_nr * bnr) +{ + + assert("zam-390", bmap < get_nr_bmap(super)); + +#ifdef CONFIG_REISER4_BADBLOCKS +#define BITMAP_PLUGIN_DISKMAP_ID ((0xc0e1<<16) | (0xe0ff)) + /* Check if the diskmap have this already, first. */ + if ( reiser4_get_diskmap_value( BITMAP_PLUGIN_DISKMAP_ID, bmap, bnr) == 0 ) + return; /* Found it in diskmap */ +#endif + /* FIXME_ZAM: before discussing of disk layouts and disk format + plugins I implement bitmap location scheme which is close to scheme + used in reiser 3.6 */ + if (bmap == 0) { + *bnr = REISER4_FIRST_BITMAP_BLOCK; + } else { + *bnr = bmap * bmap_bit_count(super->s_blocksize); + } +} + +/* construct a fake block number for shadow bitmap (WORKING BITMAP) block */ +/* Audited by: green(2002.06.12) */ +static void +get_working_bitmap_blocknr(bmap_nr_t bmap, reiser4_block_nr * bnr) +{ + *bnr = (reiser4_block_nr) ((bmap & ~REISER4_BLOCKNR_STATUS_BIT_MASK) | REISER4_BITMAP_BLOCKS_STATUS_VALUE); +} + +/* bnode structure initialization */ +static void +init_bnode(struct bitmap_node *bnode, + struct super_block *super UNUSED_ARG, bmap_nr_t bmap UNUSED_ARG) +{ + memset(bnode, 0, sizeof (struct bitmap_node)); + + sema_init(&bnode->sema, 1); + atomic_set(&bnode->loaded, 0); +} + +static void +release(jnode *node) +{ + jrelse(node); + JF_SET(node, JNODE_HEARD_BANSHEE); + jput(node); +} + +/* This function is for internal bitmap.c use because it assumes that jnode is + in under full control of this thread */ +static void +done_bnode(struct bitmap_node *bnode) +{ + if (bnode) { + atomic_set(&bnode->loaded, 0); + if (bnode->wjnode != NULL) + release(bnode->wjnode); + if (bnode->cjnode != NULL) + release(bnode->cjnode); + bnode->wjnode = bnode->cjnode = NULL; + } +} + +/* ZAM-FIXME-HANS: comment this. Called only by load_and_lock_bnode()*/ +static int +prepare_bnode(struct bitmap_node *bnode, jnode **cjnode_ret, jnode **wjnode_ret) +{ + struct super_block *super; + jnode *cjnode; + jnode *wjnode; + bmap_nr_t bmap; + int ret; + + super = reiser4_get_current_sb(); + + *wjnode_ret = wjnode = bnew(); + if (wjnode == NULL) + return RETERR(-ENOMEM); + + *cjnode_ret = cjnode = bnew(); + if (cjnode == NULL) + return RETERR(-ENOMEM); + + bmap = bnode - get_bnode(super, 0); + + get_working_bitmap_blocknr(bmap, &wjnode->blocknr); + get_bitmap_blocknr(super, bmap, &cjnode->blocknr); + + jref(cjnode); + jref(wjnode); + + /* load commit bitmap */ + ret = jload_gfp(cjnode, GFP_NOFS, 1); + + if (ret) + goto error; + + /* allocate memory for working bitmap block. Note that for + * bitmaps jinit_new() doesn't actually modifies node content, + * so parallel calls to this are ok. */ + ret = jinit_new(wjnode, GFP_NOFS); + + if (ret != 0) { + jrelse(cjnode); + goto error; + } + + return 0; + + error: + jput(cjnode); + jput(wjnode); + *wjnode_ret = *cjnode_ret = NULL; + return ret; + +} + +/* Check the bnode data on read. */ +static int check_struct_bnode(struct bitmap_node *bnode, __u32 blksize) { + void *data; + int ret; + + /* Check CRC */ + ret = bnode_check_adler32(bnode, blksize); + + if (ret) { + return ret; + } + + data = jdata(bnode->cjnode) + CHECKSUM_SIZE; + + /* Check the very first bit -- it must be busy. */ + if (!reiser4_test_bit(0, data)) { + warning("vpf-1362", "The allocator block %llu is not marked " + "as used.", (unsigned long long)bnode->cjnode->blocknr); + + return -EINVAL; + } + + return 0; +} + +/* load bitmap blocks "on-demand" */ +static int +load_and_lock_bnode(struct bitmap_node *bnode) +{ + int ret; + + jnode *cjnode; + jnode *wjnode; + + assert("nikita-3040", schedulable()); + +/* ZAM-FIXME-HANS: since bitmaps are never unloaded, this does not + * need to be atomic, right? Just leave a comment that if bitmaps were + * unloadable, this would need to be atomic. */ + if (atomic_read(&bnode->loaded)) { + /* bitmap is already loaded, nothing to do */ + check_bnode_loaded(bnode); + down(&bnode->sema); + assert("nikita-2827", atomic_read(&bnode->loaded)); + return 0; + } + + ret = prepare_bnode(bnode, &cjnode, &wjnode); + if (ret == 0) { + down(&bnode->sema); + + if (!atomic_read(&bnode->loaded)) { + assert("nikita-2822", cjnode != NULL); + assert("nikita-2823", wjnode != NULL); + assert("nikita-2824", jnode_is_loaded(cjnode)); + assert("nikita-2825", jnode_is_loaded(wjnode)); + + bnode->wjnode = wjnode; + bnode->cjnode = cjnode; + + ret = check_struct_bnode(bnode, current_blocksize); + if (!ret) { + cjnode = wjnode = NULL; + atomic_set(&bnode->loaded, 1); + /* working bitmap is initialized by on-disk + * commit bitmap. This should be performed + * under semaphore. */ + memcpy(bnode_working_data(bnode), + bnode_commit_data(bnode), + bmap_size(current_blocksize)); + } else { + up(&bnode->sema); + } + } else + /* race: someone already loaded bitmap while we were + * busy initializing data. */ + check_bnode_loaded(bnode); + } + + if (wjnode != NULL) + release(wjnode); + if (cjnode != NULL) + release(cjnode); + + return ret; +} + +static void +release_and_unlock_bnode(struct bitmap_node *bnode) +{ + check_bnode_loaded(bnode); + up(&bnode->sema); +} + +/* This function does all block allocation work but only for one bitmap + block.*/ +/* FIXME_ZAM: It does not allow us to allocate block ranges across bitmap + block responsibility zone boundaries. This had no sense in v3.6 but may + have it in v4.x */ +/* ZAM-FIXME-HANS: do you mean search one bitmap block forward? */ +static int +search_one_bitmap_forward(bmap_nr_t bmap, bmap_off_t * offset, bmap_off_t max_offset, + int min_len, int max_len) +{ + struct super_block *super = get_current_context()->super; + struct bitmap_node *bnode = get_bnode(super, bmap); + + char *data; + + bmap_off_t search_end; + bmap_off_t start; + bmap_off_t end; + + int set_first_zero_bit = 0; + + int ret; + + assert("zam-364", min_len > 0); + assert("zam-365", max_len >= min_len); + assert("zam-366", *offset < max_offset); + + ret = load_and_lock_bnode(bnode); + + if (ret) + return ret; + + data = bnode_working_data(bnode); + + start = *offset; + + if (bnode->first_zero_bit >= start) { + start = bnode->first_zero_bit; + set_first_zero_bit = 1; + } + + while (start + min_len < max_offset) { + + start = reiser4_find_next_zero_bit((long *) data, max_offset, start); + if (set_first_zero_bit) { + bnode->first_zero_bit = start; + set_first_zero_bit = 0; + } + if (start >= max_offset) + break; + + search_end = LIMIT(start + max_len, max_offset); + end = reiser4_find_next_set_bit((long *) data, search_end, start); + if (end >= start + min_len) { + /* we can't trust find_next_set_bit result if set bit + was not fount, result may be bigger than + max_offset */ + if (end > search_end) + end = search_end; + + ret = end - start; + *offset = start; + + reiser4_set_bits(data, start, end); + + /* FIXME: we may advance first_zero_bit if [start, + end] region overlaps the first_zero_bit point */ + + break; + } + + start = end + 1; + } + + release_and_unlock_bnode(bnode); + + return ret; +} + +static int +search_one_bitmap_backward (bmap_nr_t bmap, bmap_off_t * start_offset, bmap_off_t end_offset, + int min_len, int max_len) +{ + struct super_block *super = get_current_context()->super; + struct bitmap_node *bnode = get_bnode(super, bmap); + char *data; + bmap_off_t start; + int ret; + + assert("zam-958", min_len > 0); + assert("zam-959", max_len >= min_len); + assert("zam-960", *start_offset >= end_offset); + + ret = load_and_lock_bnode(bnode); + if (ret) + return ret; + + data = bnode_working_data(bnode); + start = *start_offset; + + while (1) { + bmap_off_t end, search_end; + + /* Find the beginning of the zero filled region */ + if (reiser4_find_last_zero_bit(&start, data, end_offset, start)) + break; + /* Is there more than `min_len' bits from `start' to + * `end_offset'? */ + if (start < end_offset + min_len - 1) + break; + + /* Do not search to `end_offset' if we need to find less than + * `max_len' zero bits. */ + if (end_offset + max_len - 1 < start) + search_end = start - max_len + 1; + else + search_end = end_offset; + + if (reiser4_find_last_set_bit(&end, data, search_end, start)) + end = search_end; + else + end ++; + + if (end + min_len <= start + 1) { + if (end < search_end) + end = search_end; + ret = start - end + 1; + *start_offset = end; /* `end' is lowest offset */ + assert ("zam-987", reiser4_find_next_set_bit(data, start + 1, end) >= start + 1); + reiser4_set_bits(data, end, start + 1); + break; + } + + if (end <= end_offset) + /* left search boundary reached. */ + break; + start = end - 1; + } + + release_and_unlock_bnode(bnode); + return ret; +} + +/* allocate contiguous range of blocks in bitmap */ +static int bitmap_alloc_forward(reiser4_block_nr * start, const reiser4_block_nr * end, + int min_len, int max_len) +{ + bmap_nr_t bmap, end_bmap; + bmap_off_t offset, end_offset; + int len; + + reiser4_block_nr tmp; + + struct super_block *super = get_current_context()->super; + const bmap_off_t max_offset = bmap_bit_count(super->s_blocksize); + + parse_blocknr(start, &bmap, &offset); + + tmp = *end - 1; + parse_blocknr(&tmp, &end_bmap, &end_offset); + ++end_offset; + + assert("zam-358", end_bmap >= bmap); + assert("zam-359", ergo(end_bmap == bmap, end_offset > offset)); + + for (; bmap < end_bmap; bmap++, offset = 0) { + len = search_one_bitmap_forward(bmap, &offset, max_offset, min_len, max_len); + if (len != 0) + goto out; + } + + len = search_one_bitmap_forward(bmap, &offset, end_offset, min_len, max_len); +out: + *start = bmap * max_offset + offset; + return len; +} + +/* allocate contiguous range of blocks in bitmap (from @start to @end in + * backward direction) */ +static int bitmap_alloc_backward(reiser4_block_nr * start, const reiser4_block_nr * end, + int min_len, int max_len) +{ + bmap_nr_t bmap, end_bmap; + bmap_off_t offset, end_offset; + int len; + struct super_block *super = get_current_context()->super; + const bmap_off_t max_offset = bmap_bit_count(super->s_blocksize); + + parse_blocknr(start, &bmap, &offset); + parse_blocknr(end, &end_bmap, &end_offset); + + assert("zam-961", end_bmap <= bmap); + assert("zam-962", ergo(end_bmap == bmap, end_offset <= offset)); + + for (; bmap > end_bmap; bmap --, offset = max_offset - 1) { + len = search_one_bitmap_backward(bmap, &offset, 0, min_len, max_len); + if (len != 0) + goto out; + } + + len = search_one_bitmap_backward(bmap, &offset, end_offset, min_len, max_len); + out: + *start = bmap * max_offset + offset; + return len; +} + +/* plugin->u.space_allocator.alloc_blocks() */ +static int +alloc_blocks_forward(reiser4_blocknr_hint * hint, int needed, + reiser4_block_nr * start, reiser4_block_nr * len) +{ + struct super_block *super = get_current_context()->super; + int actual_len; + + reiser4_block_nr search_start; + reiser4_block_nr search_end; + + assert("zam-398", super != NULL); + assert("zam-412", hint != NULL); + assert("zam-397", hint->blk < reiser4_block_count(super)); + + if (hint->max_dist == 0) + search_end = reiser4_block_count(super); + else + search_end = LIMIT(hint->blk + hint->max_dist, reiser4_block_count(super)); + + /* We use @hint -> blk as a search start and search from it to the end + of the disk or in given region if @hint -> max_dist is not zero */ + search_start = hint->blk; + + actual_len = bitmap_alloc_forward(&search_start, &search_end, 1, needed); + + /* There is only one bitmap search if max_dist was specified or first + pass was from the beginning of the bitmap. We also do one pass for + scanning bitmap in backward direction. */ + if (!(actual_len != 0 || hint->max_dist != 0 || search_start == 0)) { + /* next step is a scanning from 0 to search_start */ + search_end = search_start; + search_start = 0; + actual_len = bitmap_alloc_forward(&search_start, &search_end, 1, needed); + } + if (actual_len == 0) + return RETERR(-ENOSPC); + if (actual_len < 0) + return RETERR(actual_len); + *len = actual_len; + *start = search_start; + return 0; +} + +static int alloc_blocks_backward (reiser4_blocknr_hint * hint, int needed, + reiser4_block_nr * start, reiser4_block_nr * len) +{ + reiser4_block_nr search_start; + reiser4_block_nr search_end; + int actual_len; + + ON_DEBUG(struct super_block * super = reiser4_get_current_sb()); + + assert ("zam-969", super != NULL); + assert ("zam-970", hint != NULL); + assert ("zam-971", hint->blk < reiser4_block_count(super)); + + search_start = hint->blk; + if (hint->max_dist == 0 || search_start <= hint->max_dist) + search_end = 0; + else + search_end = search_start - hint->max_dist; + + actual_len = bitmap_alloc_backward(&search_start, &search_end, 1, needed); + if (actual_len == 0) + return RETERR(-ENOSPC); + if (actual_len < 0) + return RETERR(actual_len); + *len = actual_len; + *start = search_start; + return 0; +} + +/* plugin->u.space_allocator.alloc_blocks() */ +reiser4_internal int +alloc_blocks_bitmap(reiser4_space_allocator * allocator UNUSED_ARG, + reiser4_blocknr_hint * hint, int needed, + reiser4_block_nr * start, reiser4_block_nr * len) +{ + if (hint->backward) + return alloc_blocks_backward(hint, needed, start, len); + return alloc_blocks_forward(hint, needed, start, len); +} + +/* plugin->u.space_allocator.dealloc_blocks(). */ +/* It just frees blocks in WORKING BITMAP. Usually formatted an unformatted + nodes deletion is deferred until transaction commit. However, deallocation + of temporary objects like wandered blocks and transaction commit records + requires immediate node deletion from WORKING BITMAP.*/ +reiser4_internal void +dealloc_blocks_bitmap(reiser4_space_allocator * allocator UNUSED_ARG, reiser4_block_nr start, reiser4_block_nr len) +{ + struct super_block *super = reiser4_get_current_sb(); + + bmap_nr_t bmap; + bmap_off_t offset; + + struct bitmap_node *bnode; + int ret; + + assert("zam-468", len != 0); + check_block_range(&start, &len); + + parse_blocknr(&start, &bmap, &offset); + + assert("zam-469", offset + len <= bmap_bit_count(super->s_blocksize)); + + bnode = get_bnode(super, bmap); + + assert("zam-470", bnode != NULL); + + ret = load_and_lock_bnode(bnode); + assert("zam-481", ret == 0); + + reiser4_clear_bits(bnode_working_data(bnode), offset, (bmap_off_t) (offset + len)); + + adjust_first_zero_bit(bnode, offset); + + release_and_unlock_bnode(bnode); +} + + +/* plugin->u.space_allocator.check_blocks(). */ +reiser4_internal void +check_blocks_bitmap(const reiser4_block_nr * start, const reiser4_block_nr * len, int desired) +{ +#if REISER4_DEBUG + struct super_block *super = reiser4_get_current_sb(); + + bmap_nr_t bmap; + bmap_off_t start_offset; + bmap_off_t end_offset; + + struct bitmap_node *bnode; + int ret; + + assert("zam-622", len != NULL); + check_block_range(start, len); + parse_blocknr(start, &bmap, &start_offset); + + end_offset = start_offset + *len; + assert("nikita-2214", end_offset <= bmap_bit_count(super->s_blocksize)); + + bnode = get_bnode(super, bmap); + + assert("nikita-2215", bnode != NULL); + + ret = load_and_lock_bnode(bnode); + assert("zam-626", ret == 0); + + assert("nikita-2216", jnode_is_loaded(bnode->wjnode)); + + if (desired) { + assert("zam-623", reiser4_find_next_zero_bit(bnode_working_data(bnode), end_offset, start_offset) + >= end_offset); + } else { + assert("zam-624", reiser4_find_next_set_bit(bnode_working_data(bnode), end_offset, start_offset) + >= end_offset); + } + + release_and_unlock_bnode(bnode); +#endif +} + +/* conditional insertion of @node into atom's overwrite set if it was not there */ +static void +cond_add_to_overwrite_set (txn_atom * atom, jnode * node) +{ + assert("zam-546", atom != NULL); + assert("zam-547", atom->stage == ASTAGE_PRE_COMMIT); + assert("zam-548", node != NULL); + + LOCK_ATOM(atom); + LOCK_JNODE(node); + + if (node->atom == NULL) { + JF_SET(node, JNODE_OVRWR); + insert_into_atom_ovrwr_list(atom, node); + } else { + assert("zam-549", node->atom == atom); + } + + UNLOCK_JNODE(node); + UNLOCK_ATOM(atom); +} + +/* an actor which applies delete set to COMMIT bitmap pages and link modified + pages in a single-linked list */ +static int +apply_dset_to_commit_bmap(txn_atom * atom, const reiser4_block_nr * start, const reiser4_block_nr * len, void *data) +{ + + bmap_nr_t bmap; + bmap_off_t offset; + int ret; + + long long *blocks_freed_p = data; + + struct bitmap_node *bnode; + + struct super_block *sb = reiser4_get_current_sb(); + + check_block_range(start, len); + + parse_blocknr(start, &bmap, &offset); + + /* FIXME-ZAM: we assume that all block ranges are allocated by this + bitmap-based allocator and each block range can't go over a zone of + responsibility of one bitmap block; same assumption is used in + other journal hooks in bitmap code. */ + bnode = get_bnode(sb, bmap); + assert("zam-448", bnode != NULL); + + /* it is safe to unlock atom with is in ASTAGE_PRE_COMMIT */ + assert ("zam-767", atom->stage == ASTAGE_PRE_COMMIT); + ret = load_and_lock_bnode(bnode); + if (ret) + return ret; + + /* put bnode into atom's overwrite set */ + cond_add_to_overwrite_set (atom, bnode->cjnode); + + data = bnode_commit_data(bnode); + + ret = bnode_check_crc(bnode); + if (ret != 0) + return ret; + + if (len != NULL) { + /* FIXME-ZAM: a check that all bits are set should be there */ + assert("zam-443", offset + *len <= bmap_bit_count(sb->s_blocksize)); + reiser4_clear_bits(data, offset, (bmap_off_t) (offset + *len)); + + (*blocks_freed_p) += *len; + } else { + reiser4_clear_bit(offset, data); + (*blocks_freed_p)++; + } + + bnode_set_commit_crc(bnode, bnode_calc_crc(bnode, sb->s_blocksize)); + + release_and_unlock_bnode(bnode); + + return 0; +} + +/* plugin->u.space_allocator.pre_commit_hook(). */ +/* It just applies transaction changes to fs-wide COMMIT BITMAP, hoping the + rest is done by transaction manager (allocate wandered locations for COMMIT + BITMAP blocks, copy COMMIT BITMAP blocks data). */ +/* Only one instance of this function can be running at one given time, because + only one transaction can be committed a time, therefore it is safe to access + some global variables without any locking */ + +#if REISER4_COPY_ON_CAPTURE + +extern spinlock_t scan_lock; + +reiser4_internal int +pre_commit_hook_bitmap(void) +{ + struct super_block * super = reiser4_get_current_sb(); + txn_atom *atom; + + long long blocks_freed = 0; + + atom = get_current_atom_locked (); + BUG_ON(atom->stage != ASTAGE_PRE_COMMIT); + assert ("zam-876", atom->stage == ASTAGE_PRE_COMMIT); + spin_unlock_atom(atom); + + + + { /* scan atom's captured list and find all freshly allocated nodes, + * mark corresponded bits in COMMIT BITMAP as used */ + /* how cpu significant is this scan, should we someday have a freshly_allocated list? -Hans */ + capture_list_head *head = ATOM_CLEAN_LIST(atom); + jnode *node; + + spin_lock(&scan_lock); + node = capture_list_front(head); + + while (!capture_list_end(head, node)) { + int ret; + + assert("vs-1445", NODE_LIST(node) == CLEAN_LIST); + BUG_ON(node->atom != atom); + JF_SET(node, JNODE_SCANNED); + spin_unlock(&scan_lock); + + /* we detect freshly allocated jnodes */ + if (JF_ISSET(node, JNODE_RELOC)) { + bmap_nr_t bmap; + + bmap_off_t offset; + bmap_off_t index; + struct bitmap_node *bn; + __u32 size = bmap_size(super->s_blocksize); + char byte; + __u32 crc; + + assert("zam-559", !JF_ISSET(node, JNODE_OVRWR)); + assert("zam-460", !blocknr_is_fake(&node->blocknr)); + + parse_blocknr(&node->blocknr, &bmap, &offset); + bn = get_bnode(super, bmap); + + index = offset >> 3; + assert("vpf-276", index < size); + + ret = bnode_check_crc(bnode); + if (ret != 0) + return ret; + + check_bnode_loaded(bn); + load_and_lock_bnode(bn); + + byte = *(bnode_commit_data(bn) + index); + reiser4_set_bit(offset, bnode_commit_data(bn)); + + crc = adler32_recalc(bnode_commit_crc(bn), byte, + *(bnode_commit_data(bn) + + index), + size - index), + + bnode_set_commit_crc(bn, crc); + + release_and_unlock_bnode(bn); + + ret = bnode_check_crc(bnode); + if (ret != 0) + return ret; + + /* working of this depends on how it inserts + new j-node into clean list, because we are + scanning the same list now. It is OK, if + insertion is done to the list front */ + cond_add_to_overwrite_set (atom, bn->cjnode); + } + + spin_lock(&scan_lock); + JF_CLR(node, JNODE_SCANNED); + node = capture_list_next(node); + } + spin_unlock(&scan_lock); + } + + blocknr_set_iterator(atom, &atom->delete_set, apply_dset_to_commit_bmap, &blocks_freed, 0); + + blocks_freed -= atom->nr_blocks_allocated; + + { + reiser4_super_info_data *sbinfo; + + sbinfo = get_super_private(super); + + reiser4_spin_lock_sb(sbinfo); + sbinfo->blocks_free_committed += blocks_freed; + reiser4_spin_unlock_sb(sbinfo); + } + + return 0; +} + +#else /* ! REISER4_COPY_ON_CAPTURE */ + +reiser4_internal int +pre_commit_hook_bitmap(void) +{ + struct super_block * super = reiser4_get_current_sb(); + txn_atom *atom; + + long long blocks_freed = 0; + + atom = get_current_atom_locked (); + assert ("zam-876", atom->stage == ASTAGE_PRE_COMMIT); + spin_unlock_atom(atom); + + { /* scan atom's captured list and find all freshly allocated nodes, + * mark corresponded bits in COMMIT BITMAP as used */ + capture_list_head *head = ATOM_CLEAN_LIST(atom); + jnode *node = capture_list_front(head); + + while (!capture_list_end(head, node)) { + /* we detect freshly allocated jnodes */ + if (JF_ISSET(node, JNODE_RELOC)) { + int ret; + bmap_nr_t bmap; + + bmap_off_t offset; + bmap_off_t index; + struct bitmap_node *bn; + __u32 size = bmap_size(super->s_blocksize); + __u32 crc; + char byte; + + assert("zam-559", !JF_ISSET(node, JNODE_OVRWR)); + assert("zam-460", !blocknr_is_fake(&node->blocknr)); + + parse_blocknr(&node->blocknr, &bmap, &offset); + bn = get_bnode(super, bmap); + + index = offset >> 3; + assert("vpf-276", index < size); + + ret = bnode_check_crc(bnode); + if (ret != 0) + return ret; + + check_bnode_loaded(bn); + load_and_lock_bnode(bn); + + byte = *(bnode_commit_data(bn) + index); + reiser4_set_bit(offset, bnode_commit_data(bn)); + + crc = adler32_recalc(bnode_commit_crc(bn), byte, + *(bnode_commit_data(bn) + + index), + size - index), + + bnode_set_commit_crc(bn, crc); + + release_and_unlock_bnode(bn); + + ret = bnode_check_crc(bn); + if (ret != 0) + return ret; + + /* working of this depends on how it inserts + new j-node into clean list, because we are + scanning the same list now. It is OK, if + insertion is done to the list front */ + cond_add_to_overwrite_set (atom, bn->cjnode); + } + + node = capture_list_next(node); + } + } + + blocknr_set_iterator(atom, &atom->delete_set, apply_dset_to_commit_bmap, &blocks_freed, 0); + + blocks_freed -= atom->nr_blocks_allocated; + + { + reiser4_super_info_data *sbinfo; + + sbinfo = get_super_private(super); + + reiser4_spin_lock_sb(sbinfo); + sbinfo->blocks_free_committed += blocks_freed; + reiser4_spin_unlock_sb(sbinfo); + } + + return 0; +} +#endif /* ! REISER4_COPY_ON_CAPTURE */ + +/* plugin->u.space_allocator.init_allocator + constructor of reiser4_space_allocator object. It is called on fs mount */ +reiser4_internal int +init_allocator_bitmap(reiser4_space_allocator * allocator, struct super_block *super, void *arg UNUSED_ARG) +{ + struct bitmap_allocator_data *data = NULL; + bmap_nr_t bitmap_blocks_nr; + bmap_nr_t i; + + assert("nikita-3039", schedulable()); + + /* getting memory for bitmap allocator private data holder */ + data = reiser4_kmalloc(sizeof (struct bitmap_allocator_data), GFP_KERNEL); + + if (data == NULL) + return RETERR(-ENOMEM); + + /* allocation and initialization for the array of bnodes */ + bitmap_blocks_nr = get_nr_bmap(super); + + /* FIXME-ZAM: it is not clear what to do with huge number of bitmaps + which is bigger than 2^32 (= 8 * 4096 * 4096 * 2^32 bytes = 5.76e+17, + may I never meet someone who still uses the ia32 architecture when + storage devices of that size enter the market, and wants to use ia32 + with that storage device, much less reiser4. ;-) -Hans). Kmalloc is not possible and, + probably, another dynamic data structure should replace a static + array of bnodes. */ + /*data->bitmap = reiser4_kmalloc((size_t) (sizeof (struct bitmap_node) * bitmap_blocks_nr), GFP_KERNEL);*/ + data->bitmap = vmalloc(sizeof (struct bitmap_node) * bitmap_blocks_nr); + if (data->bitmap == NULL) { + reiser4_kfree(data); + return RETERR(-ENOMEM); + } + + for (i = 0; i < bitmap_blocks_nr; i++) + init_bnode(data->bitmap + i, super, i); + + allocator->u.generic = data; + +#if REISER4_DEBUG + get_super_private(super)->min_blocks_used += bitmap_blocks_nr; +#endif + + /* Load all bitmap blocks at mount time. */ + if (!test_bit(REISER4_DONT_LOAD_BITMAP, &get_super_private(super)->fs_flags)) { + __u64 start_time, elapsed_time; + struct bitmap_node * bnode; + int ret; + + if (REISER4_DEBUG) + printk(KERN_INFO "loading reiser4 bitmap..."); + start_time = jiffies; + + for (i = 0; i < bitmap_blocks_nr; i++) { + bnode = data->bitmap + i; + ret = load_and_lock_bnode(bnode); + if (ret) { + destroy_allocator_bitmap(allocator, super); + return ret; + } + release_and_unlock_bnode(bnode); + } + + elapsed_time = jiffies - start_time; + if (REISER4_DEBUG) + printk("...done (%llu jiffies)\n", + (unsigned long long)elapsed_time); + } + + return 0; +} + +/* plugin->u.space_allocator.destroy_allocator + destructor. It is called on fs unmount */ +reiser4_internal int +destroy_allocator_bitmap(reiser4_space_allocator * allocator, struct super_block *super) +{ + bmap_nr_t bitmap_blocks_nr; + bmap_nr_t i; + + struct bitmap_allocator_data *data = allocator->u.generic; + + assert("zam-414", data != NULL); + assert("zam-376", data->bitmap != NULL); + + bitmap_blocks_nr = get_nr_bmap(super); + + for (i = 0; i < bitmap_blocks_nr; i++) { + struct bitmap_node *bnode = data->bitmap + i; + + down(&bnode->sema); + +#if REISER4_DEBUG + if (atomic_read(&bnode->loaded)) { + jnode *wj = bnode->wjnode; + jnode *cj = bnode->cjnode; + + assert("zam-480", jnode_page(cj) != NULL); + assert("zam-633", jnode_page(wj) != NULL); + + assert("zam-634", + memcmp(jdata(wj), jdata(wj), + bmap_size(super->s_blocksize)) == 0); + + } +#endif + done_bnode(bnode); + up(&bnode->sema); + } + + /*reiser4_kfree(data->bitmap);*/ + vfree(data->bitmap); + reiser4_kfree(data); + + allocator->u.generic = NULL; + + return 0; +} + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 80 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/space/bitmap.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/space/bitmap.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,41 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#if !defined (__REISER4_PLUGIN_SPACE_BITMAP_H__) +#define __REISER4_PLUGIN_SPACE_BITMAP_H__ + +#include "../../dformat.h" +#include "../../block_alloc.h" + +#include /* for __u?? */ +#include /* for struct super_block */ +/* EDWARD-FIXME-HANS: write something as informative as the below for every .h file lacking it. */ +/* declarations of functions implementing methods of space allocator plugin for + bitmap based allocator. The functions themselves are in bitmap.c */ +extern int init_allocator_bitmap(reiser4_space_allocator *, struct super_block *, void *); +extern int destroy_allocator_bitmap(reiser4_space_allocator *, struct super_block *); +extern int alloc_blocks_bitmap(reiser4_space_allocator *, + reiser4_blocknr_hint *, int needed, reiser4_block_nr * start, reiser4_block_nr * len); +extern void check_blocks_bitmap(const reiser4_block_nr *, const reiser4_block_nr *, int); + +extern void dealloc_blocks_bitmap(reiser4_space_allocator *, reiser4_block_nr, reiser4_block_nr); +extern int pre_commit_hook_bitmap(void); + +#define post_commit_hook_bitmap() do{}while(0) +#define post_write_back_hook_bitmap() do{}while(0) +#define print_info_bitmap(pref, al) do{}while(0) + +typedef __u64 bmap_nr_t; +typedef __u32 bmap_off_t; + +#endif /* __REISER4_PLUGIN_SPACE_BITMAP_H__ */ + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/space/space_allocator.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/space/space_allocator.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,80 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#ifndef __SPACE_ALLOCATOR_H__ +#define __SPACE_ALLOCATOR_H__ + +#include "../../forward.h" +#include "bitmap.h" +/* NIKITA-FIXME-HANS: surely this could use a comment. Something about how bitmap is the only space allocator for now, + * but... */ +#define DEF_SPACE_ALLOCATOR(allocator) \ + \ +static inline int sa_init_allocator (reiser4_space_allocator * al, struct super_block *s, void * opaque) \ +{ \ + return init_allocator_##allocator (al, s, opaque); \ +} \ + \ +static inline void sa_destroy_allocator (reiser4_space_allocator *al, struct super_block *s) \ +{ \ + destroy_allocator_##allocator (al, s); \ +} \ + \ +static inline int sa_alloc_blocks (reiser4_space_allocator *al, reiser4_blocknr_hint * hint, \ + int needed, reiser4_block_nr * start, reiser4_block_nr * len) \ +{ \ + return alloc_blocks_##allocator (al, hint, needed, start, len); \ +} \ +static inline void sa_dealloc_blocks (reiser4_space_allocator * al, reiser4_block_nr start, reiser4_block_nr len) \ +{ \ + dealloc_blocks_##allocator (al, start, len); \ +} \ + \ +static inline void sa_check_blocks (const reiser4_block_nr * start, const reiser4_block_nr * end, int desired) \ +{ \ + check_blocks_##allocator (start, end, desired); \ +} \ + \ +static inline void sa_pre_commit_hook (void) \ +{ \ + pre_commit_hook_##allocator (); \ +} \ + \ +static inline void sa_post_commit_hook (void) \ +{ \ + post_commit_hook_##allocator (); \ +} \ + \ +static inline void sa_post_write_back_hook (void) \ +{ \ + post_write_back_hook_##allocator(); \ +} \ + \ +static inline void sa_print_info(const char * prefix, reiser4_space_allocator * al) \ +{ \ + print_info_##allocator (prefix, al); \ +} + +DEF_SPACE_ALLOCATOR(bitmap) + +/* this object is part of reiser4 private in-core super block */ +struct reiser4_space_allocator { + union { + /* space allocators might use this pointer to reference their + * data. */ + void *generic; + } u; +}; + +/* __SPACE_ALLOCATOR_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/symlink.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/symlink.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,85 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#include "../forward.h" +#include "../debug.h" +#include "item/static_stat.h" +#include "plugin.h" +#include "../tree.h" +#include "../vfs_ops.h" +#include "../inode.h" +#include "object.h" + +#include +#include /* for struct inode */ + +/* symlink plugin's specific functions */ + +reiser4_internal int +create_symlink(struct inode *symlink, /* inode of symlink */ + struct inode *dir UNUSED_ARG, /* parent directory */ + reiser4_object_create_data * data /* info passed + * to us, this + * is filled by + * reiser4() + * syscall in + * particular */ ) +{ + int result; + + assert("nikita-680", symlink != NULL); + assert("nikita-681", S_ISLNK(symlink->i_mode)); + assert("nikita-685", inode_get_flag(symlink, REISER4_NO_SD)); + assert("nikita-682", dir != NULL); + assert("nikita-684", data != NULL); + assert("nikita-686", data->id == SYMLINK_FILE_PLUGIN_ID); + + /* + * stat data of symlink has symlink extension in which we store + * symlink content, that is, path symlink is pointing to. + */ + reiser4_inode_data(symlink)->extmask |= (1 << SYMLINK_STAT); + + assert("vs-838", symlink->u.generic_ip == 0); + symlink->u.generic_ip = (void *) data->name; + + assert("vs-843", symlink->i_size == 0); + INODE_SET_FIELD(symlink, i_size, strlen(data->name)); + + /* insert stat data appended with data->name */ + result = write_sd_by_inode_common(symlink); + if (result) { + /* FIXME-VS: Make sure that symlink->u.generic_ip is not attached + to kmalloced data */ + INODE_SET_FIELD(symlink, i_size, 0); + } else { + assert("vs-849", symlink->u.generic_ip && inode_get_flag(symlink, REISER4_GENERIC_PTR_USED)); + assert("vs-850", !memcmp((char *) symlink->u.generic_ip, data->name, (size_t) symlink->i_size + 1)); + } + return result; +} + +/* plugin->destroy_inode() */ +reiser4_internal void +destroy_inode_symlink(struct inode * inode) +{ + assert("edward-799", inode_file_plugin(inode) == file_plugin_by_id(SYMLINK_FILE_PLUGIN_ID)); + assert("edward-800", !is_bad_inode(inode) && is_inode_loaded(inode)); + assert("edward-801", inode_get_flag(inode, REISER4_GENERIC_PTR_USED)); + assert("vs-839", S_ISLNK(inode->i_mode)); + + reiser4_kfree_in_sb(inode->u.generic_ip, inode->i_sb); + inode->u.generic_ip = 0; + inode_clr_flag(inode, REISER4_GENERIC_PTR_USED); +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ + diff -puN /dev/null fs/reiser4/plugin/symlink.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/symlink.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,24 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#if !defined( __REISER4_SYMLINK_H__ ) +#define __REISER4_SYMLINK_H__ + +#include "../forward.h" +#include /* for struct inode */ + +int create_symlink(struct inode *symlink, struct inode *dir, reiser4_object_create_data * data); +void destroy_inode_symlink(struct inode * inode); + +/* __REISER4_SYMLINK_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/plugin/tail_policy.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/plugin/tail_policy.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,109 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Tail policy plugins */ + +/* Tail policy is used by object plugin (of regular file) to convert file + between two representations. TO BE CONTINUED. +NIKITA-FIXME-HANS: the "TO BE CONTINUED" means what? +GREV-FIXME-HANS: why the references to tails above? fix comments and website.... tail implies it is less than the whole file that is formatted, and it is not.... not in v4.... + + Currently following policies are implemented: + + never tail + + always tail + + only tail if file is smaller than 4 blocks (default). +*/ + +#include "../tree.h" +#include "../inode.h" +#include "../super.h" +#include "object.h" +#include "plugin.h" +#include "node/node.h" +#include "plugin_header.h" +#include "../lib.h" + +#include +#include /* For struct inode */ + +/* Never store file's tail as direct item */ +/* Audited by: green(2002.06.12) */ +static int +have_formatting_never(const struct inode *inode UNUSED_ARG /* inode to operate on */ , + loff_t size UNUSED_ARG /* new object size */ ) +{ + return 0; +} + +/* Always store file's tail as direct item */ +/* Audited by: green(2002.06.12) */ +static int +have_formatting_always(const struct inode *inode UNUSED_ARG /* inode to operate on */ , + loff_t size UNUSED_ARG /* new object size */ ) +{ + return 1; +} + +/* This function makes test if we should store file denoted @inode as tails only or + as extents only. */ +static int +have_formatting_default(const struct inode *inode UNUSED_ARG /* inode to operate on */ , + loff_t size /* new object size */ ) +{ + assert("umka-1253", inode != NULL); + + if (size > inode->i_sb->s_blocksize * 4) + return 0; + + return 1; +} + +/* tail plugins */ +formatting_plugin formatting_plugins[LAST_TAIL_FORMATTING_ID] = { + [NEVER_TAILS_FORMATTING_ID] = { + .h = { + .type_id = REISER4_FORMATTING_PLUGIN_TYPE, + .id = NEVER_TAILS_FORMATTING_ID, + .pops = NULL, + .label = "never", + .desc = "Never store file's tail", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .have_tail = have_formatting_never + }, + [ALWAYS_TAILS_FORMATTING_ID] = { + .h = { + .type_id = REISER4_FORMATTING_PLUGIN_TYPE, + .id = ALWAYS_TAILS_FORMATTING_ID, + .pops = NULL, + .label = "always", + .desc = "Always store file's tail", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .have_tail = have_formatting_always + }, + [SMALL_FILE_FORMATTING_ID] = { + .h = { + .type_id = REISER4_FORMATTING_PLUGIN_TYPE, + .id = SMALL_FILE_FORMATTING_ID, + .pops = NULL, + .label = "4blocks", + .desc = "store files shorter than 4 blocks in tail items", + .linkage = TYPE_SAFE_LIST_LINK_ZERO + }, + .have_tail = have_formatting_default + } +}; + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/pool.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/pool.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,226 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Fast pool allocation. + + There are situations when some sub-system normally asks memory allocator + for only few objects, but under some circumstances could require much + more. Typical and actually motivating example is tree balancing. It needs + to keep track of nodes that were involved into it, and it is well-known + that in reasonable packed balanced tree most (92.938121%) percent of all + balancings end up after working with only few nodes (3.141592 on + average). But in rare cases balancing can involve much more nodes + (3*tree_height+1 in extremal situation). + + On the one hand, we don't want to resort to dynamic allocation (slab, + malloc(), etc.) to allocate data structures required to keep track of + nodes during balancing. On the other hand, we cannot statically allocate + required amount of space on the stack, because first: it is useless wastage + of precious resource, and second: this amount is unknown in advance (tree + height can change). + + Pools, implemented in this file are solution for this problem: + + - some configurable amount of objects is statically preallocated on the + stack + + - if this preallocated pool is exhausted and more objects is requested + they are allocated dynamically. + + Pools encapsulate distinction between statically and dynamically allocated + objects. Both allocation and recycling look exactly the same. + + To keep track of dynamically allocated objects, pool adds its own linkage + to each object. + + NOTE-NIKITA This linkage also contains some balancing-specific data. This + is not perfect. On the other hand, balancing is currently the only client + of pool code. + + NOTE-NIKITA Another desirable feature is to rewrite all pool manipulation + functions in the style of tslist/tshash, i.e., make them unreadable, but + type-safe. + + +*/ + +#include "debug.h" +#include "pool.h" +#include "super.h" + +#include +#include + +/* initialise new pool object */ +static void +reiser4_init_pool_obj(reiser4_pool_header * h /* pool object to + * initialise */ ) +{ + pool_usage_list_clean(h); + pool_level_list_clean(h); + pool_extra_list_clean(h); +} + +/* initialise new pool */ +reiser4_internal void +reiser4_init_pool(reiser4_pool * pool /* pool to initialise */ , + size_t obj_size /* size of objects in @pool */ , + int num_of_objs /* number of preallocated objects */ , + char *data /* area for preallocated objects */ ) +{ + reiser4_pool_header *h; + int i; + + assert("nikita-955", pool != NULL); + assert("nikita-1044", obj_size > 0); + assert("nikita-956", num_of_objs >= 0); + assert("nikita-957", data != NULL); + + memset(pool, 0, sizeof *pool); + pool->obj_size = obj_size; + pool->data = data; + pool_usage_list_init(&pool->free); + pool_usage_list_init(&pool->used); + pool_extra_list_init(&pool->extra); + memset(data, 0, obj_size * num_of_objs); + for (i = 0; i < num_of_objs; ++i) { + h = (reiser4_pool_header *) (data + i * obj_size); + reiser4_init_pool_obj(h); + pool_usage_list_push_back(&pool->free, h); + } +} + +/* release pool resources + + Release all resources acquired by this pool, specifically, dynamically + allocated objects. + +*/ +reiser4_internal void +reiser4_done_pool(reiser4_pool * pool UNUSED_ARG /* pool to destroy */ ) +{ +} + +/* allocate carry object from pool + + First, try to get preallocated object. If this fails, resort to dynamic + allocation. + +*/ +static void * +reiser4_pool_alloc(reiser4_pool * pool /* pool to allocate object + * from */ ) +{ + reiser4_pool_header *result; + + assert("nikita-959", pool != NULL); + + if (!pool_usage_list_empty(&pool->free)) { + result = pool_usage_list_pop_front(&pool->free); + pool_usage_list_clean(result); + assert("nikita-965", pool_extra_list_is_clean(result)); + } else { + /* pool is empty. Extra allocations don't deserve dedicated + slab to be served from, as they are expected to be rare. */ + result = reiser4_kmalloc(pool->obj_size, GFP_KERNEL); + if (result != 0) { + reiser4_init_pool_obj(result); + pool_extra_list_push_front(&pool->extra, result); + } else + return ERR_PTR(RETERR(-ENOMEM)); + } + ++pool->objs; + pool_level_list_clean(result); + pool_usage_list_push_front(&pool->used, result); + memset(result + 1, 0, pool->obj_size - sizeof *result); + return result; +} + +/* return object back to the pool */ +reiser4_internal void +reiser4_pool_free(reiser4_pool * pool, + reiser4_pool_header * h /* pool to return object back + * into */ ) +{ + assert("nikita-961", h != NULL); + assert("nikita-962", pool != NULL); + + -- pool->objs; + assert("nikita-963", pool->objs >= 0); + + pool_usage_list_remove_clean(h); + pool_level_list_remove_clean(h); + if (pool_extra_list_is_clean(h)) + pool_usage_list_push_front(&pool->free, h); + else { + pool_extra_list_remove_clean(h); + reiser4_kfree(h); + } +} + +/* add new object to the carry level list + + Carry level is FIFO most of the time, but not always. Complications arise + when make_space() function tries to go to the left neighbor and thus adds + carry node before existing nodes, and also, when updating delimiting keys + after moving data between two nodes, we want left node to be locked before + right node. + + Latter case is confusing at the first glance. Problem is that COP_UPDATE + opration that updates delimiting keys is sometimes called with two nodes + (when data are moved between two nodes) and sometimes with only one node + (when leftmost item is deleted in a node). In any case operation is + supplied with at least node whose left delimiting key is to be updated + (that is "right" node). + +*/ +reiser4_internal reiser4_pool_header * +add_obj(reiser4_pool * pool /* pool from which to + * allocate new object */ , + pool_level_list_head * list /* list where to add + * object */ , + pool_ordering order /* where to add */ , + reiser4_pool_header * reference /* after (or + * before) which + * existing + * object to + * add */ ) +{ + reiser4_pool_header *result; + + assert("nikita-972", pool != NULL); + + result = reiser4_pool_alloc(pool); + if (IS_ERR(result)) + return result; + + assert("nikita-973", result != NULL); + + switch (order) { + case POOLO_BEFORE: + pool_level_list_insert_before(reference, result); + break; + case POOLO_AFTER: + pool_level_list_insert_after(reference, result); + break; + case POOLO_LAST: + pool_level_list_push_back(list, result); + break; + case POOLO_FIRST: + pool_level_list_push_front(list, result); + break; + default: + wrong_return_value("nikita-927", "order"); + } + return result; +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/pool.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/pool.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,69 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Fast pool allocation */ + +#ifndef __REISER4_POOL_H__ +#define __REISER4_POOL_H__ + +#include "type_safe_list.h" +#include + +/* each pool object is either on a "used" or "free" list. */ +TYPE_SAFE_LIST_DECLARE(pool_usage); + +/* list of extra pool objects */ +TYPE_SAFE_LIST_DECLARE(pool_extra); + +/* list of pool objects on a given level */ +TYPE_SAFE_LIST_DECLARE(pool_level); + +typedef struct reiser4_pool { + size_t obj_size; + int objs; + char *data; + pool_usage_list_head free; + pool_usage_list_head used; + pool_extra_list_head extra; +} reiser4_pool; + +typedef struct reiser4_pool_header { + /* object is either on free or "used" lists */ + pool_usage_list_link usage_linkage; + pool_level_list_link level_linkage; + pool_extra_list_link extra_linkage; +} reiser4_pool_header; + +typedef enum { + POOLO_BEFORE, + POOLO_AFTER, + POOLO_LAST, + POOLO_FIRST +} pool_ordering; + +/* each pool object is either on a "used" or "free" list. */ +TYPE_SAFE_LIST_DEFINE(pool_usage, reiser4_pool_header, usage_linkage); +/* list of extra pool objects */ +TYPE_SAFE_LIST_DEFINE(pool_extra, reiser4_pool_header, extra_linkage); +/* list of pool objects on a given level */ +TYPE_SAFE_LIST_DEFINE(pool_level, reiser4_pool_header, level_linkage); + +/* pool manipulation functions */ + +extern void reiser4_init_pool(reiser4_pool * pool, size_t obj_size, int num_of_objs, char *data); +extern void reiser4_done_pool(reiser4_pool * pool); +extern void reiser4_pool_free(reiser4_pool * pool, reiser4_pool_header * h); +reiser4_pool_header *add_obj(reiser4_pool * pool, pool_level_list_head * list, + pool_ordering order, reiser4_pool_header * reference); + +/* __REISER4_POOL_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/readahead.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/readahead.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,378 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +#include "forward.h" +#include "tree.h" +#include "tree_walk.h" +#include "super.h" +#include "inode.h" +#include "key.h" +#include "znode.h" + +#include /* for totalram_pages */ + +reiser4_internal void init_ra_info(ra_info_t * rai) +{ + rai->key_to_stop = *min_key(); +} + +/* global formatted node readahead parameter. It can be set by mount option -o readahead:NUM:1 */ +static inline int ra_adjacent_only(int flags) +{ + return flags & RA_ADJACENT_ONLY; +} + +/* this is used by formatted_readahead to decide whether read for right neighbor of node is to be issued. It returns 1 + if right neighbor's first key is less or equal to readahead's stop key */ +static int +should_readahead_neighbor(znode *node, ra_info_t *info) +{ + return (UNDER_RW(dk, ZJNODE(node)->tree, read, + keyle(znode_get_rd_key(node), &info->key_to_stop))); +} + +#define LOW_MEM_PERCENTAGE (5) + +static int +low_on_memory(void) +{ + unsigned int freepages; + + freepages = nr_free_pages(); + return freepages < (totalram_pages * LOW_MEM_PERCENTAGE / 100); +} + +/* start read for @node and for a few of its right neighbors */ +reiser4_internal void +formatted_readahead(znode *node, ra_info_t *info) +{ + ra_params_t *ra_params; + znode *cur; + int i; + int grn_flags; + lock_handle next_lh; + + /* do nothing if node block number has not been assigned to node (which means it is still in cache). */ + if (blocknr_is_fake(znode_get_block(node))) + return; + + ra_params = get_current_super_ra_params(); + + if (znode_page(node) == NULL) + jstartio(ZJNODE(node)); + + if (znode_get_level(node) != LEAF_LEVEL) + return; + + /* don't waste memory for read-ahead when low on memory */ + if (low_on_memory()) + return; + + /* We can have locked nodes on upper tree levels, in this situation lock + priorities do not help to resolve deadlocks, we have to use TRY_LOCK + here. */ + grn_flags = (GN_CAN_USE_UPPER_LEVELS | GN_TRY_LOCK); + + i = 0; + cur = zref(node); + init_lh(&next_lh); + while (i < ra_params->max) { + const reiser4_block_nr *nextblk; + + if (!should_readahead_neighbor(cur, info)) + break; + + if (reiser4_get_right_neighbor(&next_lh, cur, ZNODE_READ_LOCK, grn_flags)) + break; + + if (JF_ISSET(ZJNODE(next_lh.node), JNODE_EFLUSH)) { + /* emergency flushed znode is encountered. That means we are low on memory. Do not readahead + then */ + break; + } + + nextblk = znode_get_block(next_lh.node); + if (blocknr_is_fake(nextblk) || + (ra_adjacent_only(ra_params->flags) && *nextblk != *znode_get_block(cur) + 1)) { + break; + } + + zput(cur); + cur = zref(next_lh.node); + done_lh(&next_lh); + if (znode_page(cur) == NULL) + jstartio(ZJNODE(cur)); + else + /* Do not scan read-ahead window if pages already + * allocated (and i/o already started). */ + break; + + i ++; + } + zput(cur); + done_lh(&next_lh); +} + +static inline loff_t get_max_readahead(struct reiser4_file_ra_state *ra) +{ + /* NOTE: ra->max_window_size is initialized in + * reiser4_get_file_fsdata(). */ + return ra->max_window_size; +} + +static inline loff_t get_min_readahead(struct reiser4_file_ra_state *ra) +{ + return VM_MIN_READAHEAD * 1024; +} + + +/* Start read for the given window. */ +static loff_t do_reiser4_file_readahead (struct inode * inode, loff_t offset, loff_t size) +{ + reiser4_tree * tree = current_tree; + reiser4_inode * object; + reiser4_key start_key; + reiser4_key stop_key; + + lock_handle lock; + lock_handle next_lock; + + coord_t coord; + tap_t tap; + + loff_t result; + + assert("zam-994", lock_stack_isclean(get_current_lock_stack())); + + object = reiser4_inode_data(inode); + key_by_inode_unix_file(inode, offset, &start_key); + key_by_inode_unix_file(inode, offset + size, &stop_key); + + init_lh(&lock); + init_lh(&next_lock); + + /* Stop on twig level */ + result = coord_by_key( + current_tree, &start_key, &coord, &lock, ZNODE_READ_LOCK, + FIND_EXACT, TWIG_LEVEL, TWIG_LEVEL, 0, NULL); + if (result < 0) + goto error; + if (result != CBK_COORD_FOUND) { + result = 0; + goto error; + } + + tap_init(&tap, &coord, &lock, ZNODE_WRITE_LOCK); + result = tap_load(&tap); + if (result) + goto error0; + + /* Advance coord to right (even across node boundaries) while coord key + * less than stop_key. */ + while (1) { + reiser4_key key; + znode * child; + reiser4_block_nr blk; + + /* Currently this read-ahead is for formatted nodes only */ + if (!item_is_internal(&coord)) + break; + + item_key_by_coord(&coord, &key); + if (keyge(&key, &stop_key)) + break; + + result = item_utmost_child_real_block(&coord, LEFT_SIDE, &blk); + if (result || blk == 0) + break; + + child = zget(tree, &blk, lock.node, LEAF_LEVEL, GFP_KERNEL); + + if (IS_ERR(child)) { + result = PTR_ERR(child); + break; + } + + /* If znode's page is present that usually means that i/o was + * already started for the page. */ + if (znode_page(child) == NULL) { + result = jstartio(ZJNODE(child)); + if (result) { + zput(child); + break; + } + } + zput(child); + + /* Advance coord by one unit ... */ + result = coord_next_unit(&coord); + if (result == 0) + continue; + + /* ... and continue on the right neighbor if needed. */ + result = reiser4_get_right_neighbor ( + &next_lock, lock.node, ZNODE_READ_LOCK, + GN_CAN_USE_UPPER_LEVELS); + if (result) + break; + + if (znode_page(next_lock.node) == NULL) { + loff_t end_offset; + + result = jstartio(ZJNODE(next_lock.node)); + if (result) + break; + + read_lock_dk(tree); + end_offset = get_key_offset(znode_get_ld_key(next_lock.node)); + read_unlock_dk(tree); + + result = end_offset - offset; + break; + } + + result = tap_move(&tap, &next_lock); + if (result) + break; + + done_lh(&next_lock); + coord_init_first_unit(&coord, lock.node); + } + + if (! result || result == -E_NO_NEIGHBOR) + result = size; + error0: + tap_done(&tap); + error: + done_lh(&lock); + done_lh(&next_lock); + return result; +} + +typedef unsigned long long int ull_t; +#define PRINTK(...) noop +/* This is derived from the linux original read-ahead code (mm/readahead.c), and + * cannot be licensed from Namesys in its current state. */ +int reiser4_file_readahead (struct file * file, loff_t offset, size_t size) +{ + loff_t min; + loff_t max; + loff_t orig_next_size; + loff_t actual; + struct reiser4_file_ra_state * ra; + struct inode * inode = file->f_dentry->d_inode; + + assert ("zam-995", inode != NULL); + + PRINTK ("R/A REQ: off=%llu, size=%llu\n", (ull_t)offset, (ull_t)size); + ra = &reiser4_get_file_fsdata(file)->ra1; + + max = get_max_readahead(ra); + if (max == 0) + goto out; + + min = get_min_readahead(ra); + orig_next_size = ra->next_size; + + if (!ra->slow_start) { + ra->slow_start = 1; + /* + * Special case - first read from first page. + * We'll assume it's a whole-file read, and + * grow the window fast. + */ + ra->next_size = max / 2; + goto do_io; + + } + + /* + * Is this request outside the current window? + */ + if (offset < ra->start || offset > (ra->start + ra->size)) { + /* R/A miss. */ + + /* next r/a window size is shrunk by fixed offset and enlarged + * by 2 * size of read request. This makes r/a window smaller + * for small unordered requests and larger for big read + * requests. */ + ra->next_size += -2 * PAGE_CACHE_SIZE + 2 * size ; + if (ra->next_size < 0) + ra->next_size = 0; +do_io: + ra->start = offset; + ra->size = size + orig_next_size; + actual = do_reiser4_file_readahead(inode, offset, ra->size); + if (actual > 0) + ra->size = actual; + + ra->ahead_start = ra->start + ra->size; + ra->ahead_size = ra->next_size; + + actual = do_reiser4_file_readahead(inode, ra->ahead_start, ra->ahead_size); + if (actual > 0) + ra->ahead_size = actual; + + PRINTK ("R/A MISS: cur = [%llu, +%llu[, ahead = [%llu, +%llu[\n", + (ull_t)ra->start, (ull_t)ra->size, + (ull_t)ra->ahead_start, (ull_t)ra->ahead_size); + } else { + /* R/A hit. */ + + /* Enlarge r/a window size. */ + ra->next_size += 2 * size; + if (ra->next_size > max) + ra->next_size = max; + + PRINTK("R/A HIT\n"); + while (offset + size >= ra->ahead_start) { + ra->start = ra->ahead_start; + ra->size = ra->ahead_size; + + ra->ahead_start = ra->start + ra->size; + ra->ahead_size = ra->next_size; + + actual = do_reiser4_file_readahead( + inode, ra->ahead_start, ra->ahead_size); + if (actual > 0) { + ra->ahead_size = actual; + } + + PRINTK ("R/A ADVANCE: cur = [%llu, +%llu[, ahead = [%llu, +%llu[\n", + (ull_t)ra->start, (ull_t)ra->size, + (ull_t)ra->ahead_start, (ull_t)ra->ahead_size); + + } + } + +out: + return 0; +} + +reiser4_internal void +reiser4_readdir_readahead_init(struct inode *dir, tap_t *tap) +{ + reiser4_key *stop_key; + + assert("nikita-3542", dir != NULL); + assert("nikita-3543", tap != NULL); + + stop_key = &tap->ra_info.key_to_stop; + /* initialize readdir readahead information: include into readahead + * stat data of all files of the directory */ + set_key_locality(stop_key, get_inode_oid(dir)); + set_key_type(stop_key, KEY_SD_MINOR); + set_key_ordering(stop_key, get_key_ordering(max_key())); + set_key_objectid(stop_key, get_key_objectid(max_key())); + set_key_offset(stop_key, get_key_offset(max_key())); +} + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 80 + End: +*/ diff -puN /dev/null fs/reiser4/readahead.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/readahead.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,50 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#ifndef __READAHEAD_H__ +#define __READAHEAD_H__ + +#include "key.h" + +typedef enum { + RA_ADJACENT_ONLY = 1, /* only requests nodes which are adjacent. Default is NO (not only adjacent) */ +} ra_global_flags; + +/* reiser4 super block has a field of this type. It controls readahead during tree traversals */ +typedef struct formatted_read_ahead_params { + unsigned long max; /* request not more than this amount of nodes. Default is totalram_pages / 4 */ + int flags; +} ra_params_t; + + +typedef struct { + reiser4_key key_to_stop; +} ra_info_t; + +void formatted_readahead(znode *, ra_info_t *); +void init_ra_info(ra_info_t * rai); + +struct reiser4_file_ra_state { + loff_t start; /* Current window */ + loff_t size; + loff_t next_size; /* Next window size */ + loff_t ahead_start; /* Ahead window */ + loff_t ahead_size; + loff_t max_window_size; /* Maximum readahead window */ + loff_t slow_start; /* enlarging r/a size algorithm. */ +}; + +extern int reiser4_file_readahead(struct file *, loff_t, size_t); +extern void reiser4_readdir_readahead_init(struct inode *dir, tap_t *tap); + +/* __READAHEAD_H__ */ +#endif + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/README --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/README Mon Jun 13 15:05:23 2005 @@ -0,0 +1,125 @@ +[LICENSING] + +Reiser4 is hereby licensed under the GNU General +Public License version 2. + +Source code files that contain the phrase "licensing governed by +reiser4/README" are "governed files" throughout this file. Governed +files are licensed under the GPL. The portions of them owned by Hans +Reiser, or authorized to be licensed by him, have been in the past, +and likely will be in the future, licensed to other parties under +other licenses. If you add your code to governed files, and don't +want it to be owned by Hans Reiser, put your copyright label on that +code so the poor blight and his customers can keep things straight. +All portions of governed files not labeled otherwise are owned by Hans +Reiser, and by adding your code to it, widely distributing it to +others or sending us a patch, and leaving the sentence in stating that +licensing is governed by the statement in this file, you accept this. +It will be a kindness if you identify whether Hans Reiser is allowed +to license code labeled as owned by you on your behalf other than +under the GPL, because he wants to know if it is okay to do so and put +a check in the mail to you (for non-trivial improvements) when he +makes his next sale. He makes no guarantees as to the amount if any, +though he feels motivated to motivate contributors, and you can surely +discuss this with him before or after contributing. You have the +right to decline to allow him to license your code contribution other +than under the GPL. + +Further licensing options are available for commercial and/or other +interests directly from Hans Reiser: reiser@namesys.com. If you interpret +the GPL as not allowing those additional licensing options, you read +it wrongly, and Richard Stallman agrees with me, when carefully read +you can see that those restrictions on additional terms do not apply +to the owner of the copyright, and my interpretation of this shall +govern for this license. + +[END LICENSING] + +Reiser4 is a file system based on dancing tree algorithms, and is +described at http://www.namesys.com + +mkfs.reiser4 and other utilities are on our webpage or wherever your +Linux provider put them. You really want to be running the latest +version off the website if you use fsck. + +Yes, if you update your reiser4 kernel module you do have to +recompile your kernel, most of the time. The errors you get will be +quite cryptic if your forget to do so. + +Hideous Commercial Pitch: Spread your development costs across other OS +vendors. Select from the best in the world, not the best in your +building, by buying from third party OS component suppliers. Leverage +the software component development power of the internet. Be the most +aggressive in taking advantage of the commercial possibilities of +decentralized internet development, and add value through your branded +integration that you sell as an operating system. Let your competitors +be the ones to compete against the entire internet by themselves. Be +hip, get with the new economic trend, before your competitors do. Send +email to reiser@namesys.com + +Hans Reiser was the primary architect of Reiser4, but a whole team +chipped their ideas in. He invested everything he had into Namesys +for 5.5 dark years of no money before Reiser3 finally started to work well +enough to bring in money. He owns the copyright. + +DARPA was the primary sponsor of Reiser4. DARPA does not endorse +Reiser4, it merely sponsors it. DARPA is, in solely Hans's personal +opinion, unique in its willingness to invest into things more +theoretical than the VC community can readily understand, and more +longterm than allows them to be sure that they will be the ones to +extract the economic benefits from. DARPA also integrated us into a +security community that transformed our security worldview. + +Vladimir Saveliev is our lead programmer, with us from the beginning, +and he worked long hours writing the cleanest code. This is why he is +now the lead programmer after years of commitment to our work. He +always made the effort to be the best he could be, and to make his +code the best that it could be. What resulted was quite remarkable. I +don't think that money can ever motivate someone to work the way he +did, he is one of the most selfless men I know. + +Alexander Lyamin was our sysadmin, and helped to educate us in +security issues. Moscow State University and IMT were very generous +in the internet access they provided us, and in lots of other little +ways that a generous institution can be. + +Alexander Zarochentcev (sometimes known as zam, or sasha), wrote the +locking code, the block allocator, and finished the flushing code. +His code is always crystal clean and well structured. + +Nikita Danilov wrote the core of the balancing code, the core of the +plugins code, and the directory code. He worked a steady pace of long +hours that produced a whole lot of well abstracted code. He is our +senior computer scientist. + +Vladimir Demidov wrote the parser. Writing an in kernel parser is +something very few persons have the skills for, and it is thanks to +him that we can say that the parser is really not so big compared to +various bits of our other code, and making a parser work in the kernel +was not so complicated as everyone would imagine mainly because it was +him doing it... + +Joshua McDonald wrote the transaction manager, and the flush code. +The flush code unexpectedly turned out be extremely hairy for reasons +you can read about on our web page, and he did a great job on an +extremely difficult task. + +Nina Reiser handled our accounting, government relations, and much +more. + +Ramon Reiser developed our website. + +Beverly Palmer drew our graphics. + +Vitaly Fertman developed librepair, userspace plugins repair code, fsck +and worked with Umka on developing libreiser4 and userspace plugins. + +Yury Umanets (aka Umka) developed libreiser4, userspace plugins and +userspace tools (reiser4progs). + +Oleg Drokin (aka Green) is the release manager who fixes everything. +It is so nice to have someone like that on the team. He (plus Chris +and Jeff) make it possible for the entire rest of the Namesys team to +focus on Reiser4, and he fixed a whole lot of Reiser4 bugs also. It +is just amazing to watch his talent for spotting bugs in action. + diff -puN /dev/null fs/reiser4/reiser4.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/reiser4.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,277 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* definitions of common constants used by reiser4 */ + +#if !defined( __REISER4_H__ ) +#define __REISER4_H__ + +#include +#include /* for HZ */ +#include +#include +#include +#include +#include + +/* + * reiser4 compilation options. + */ + +#if defined(CONFIG_REISER4_DEBUG) +/* turn on assertion checks */ +#define REISER4_DEBUG (1) +#else +#define REISER4_DEBUG (0) +#endif + + +#if defined(CONFIG_REISER4_COPY_ON_CAPTURE) +/* + * Turns on copy-on-capture (COC) optimization. See + * http://www.namesys.com/v4/v4.html#cp_on_capture + */ +#define REISER4_COPY_ON_CAPTURE (1) +#else +#define REISER4_COPY_ON_CAPTURE (0) +#endif + +/* + * Turn on large keys mode. In his mode (which is default), reiser4 key has 4 + * 8-byte components. In the old "small key" mode, it's 3 8-byte + * components. Additional component, referred to as "ordering" is used to + * order items from which given object is composed of. As such, ordering is + * placed between locality and objectid. For directory item ordering contains + * initial prefix of the file name this item is for. This sorts all directory + * items within given directory lexicographically (but see + * fibration.[ch]). For file body and stat-data, ordering contains initial + * prefix of the name file was initially created with. In the common case + * (files with single name) this allows to order file bodies and stat-datas in + * the same order as their respective directory entries, thus speeding up + * readdir. + * + * Note, that kernel can only mount file system with the same key size as one + * it is compiled for, so flipping this option may render your data + * inaccessible. + */ +#define REISER4_LARGE_KEY (1) +/*#define REISER4_LARGE_KEY (0)*/ + +/* + * This will be turned on automatically when viewmasks are for + * obvious reasons. + */ +#define ENABLE_REISER4_PSEUDO (0) + +/* + * PLEASE update fs/reiser4/kattr.c:show_options() when adding new compilation + * option + */ + + +extern const char *REISER4_SUPER_MAGIC_STRING; +extern const int REISER4_MAGIC_OFFSET; /* offset to magic string from the + * beginning of device */ + +/* here go tunable parameters that are not worth special entry in kernel + configuration */ + +/* default number of slots in coord-by-key caches */ +#define CBK_CACHE_SLOTS (16) +/* how many elementary tree operation to carry on the next level */ +#define CARRIES_POOL_SIZE (5) +/* size of pool of preallocated nodes for carry process. */ +#define NODES_LOCKED_POOL_SIZE (5) + +#define REISER4_NEW_NODE_FLAGS (COPI_LOAD_LEFT | COPI_LOAD_RIGHT | COPI_GO_LEFT) +#define REISER4_NEW_EXTENT_FLAGS (COPI_LOAD_LEFT | COPI_LOAD_RIGHT | COPI_GO_LEFT) +#define REISER4_PASTE_FLAGS (COPI_GO_LEFT) +#define REISER4_INSERT_FLAGS (COPI_GO_LEFT) + +/* we are supporting reservation of disk space on uid basis */ +#define REISER4_SUPPORT_UID_SPACE_RESERVATION (0) +/* we are supporting reservation of disk space for groups */ +#define REISER4_SUPPORT_GID_SPACE_RESERVATION (0) +/* we are supporting reservation of disk space for root */ +#define REISER4_SUPPORT_ROOT_SPACE_RESERVATION (0) +/* we use rapid flush mode, see flush.c for comments. */ +#define REISER4_USE_RAPID_FLUSH (1) + +/* + * set this to 0 if you don't want to use wait-for-flush in ->writepage(). + */ +#define REISER4_USE_ENTD (1) + +/* Using of emergency flush is an option. */ +#define REISER4_USE_EFLUSH (1) + +/* key allocation is Plan-A */ +#define REISER4_PLANA_KEY_ALLOCATION (1) +/* key allocation follows good old 3.x scheme */ +#define REISER4_3_5_KEY_ALLOCATION (0) + +/* size of hash-table for znodes */ +#define REISER4_ZNODE_HASH_TABLE_SIZE (1 << 13) + +/* number of buckets in lnode hash-table */ +#define LNODE_HTABLE_BUCKETS (1024) + +/* some ridiculously high maximal limit on height of znode tree. This + is used in declaration of various per level arrays and + to allocate stattistics gathering array for per-level stats. */ +#define REISER4_MAX_ZTREE_HEIGHT (8) + +#define REISER4_PANIC_MSG_BUFFER_SIZE (1024) + +/* If array contains less than REISER4_SEQ_SEARCH_BREAK elements then, + sequential search is on average faster than binary. This is because + of better optimization and because sequential search is more CPU + cache friendly. This number (25) was found by experiments on dual AMD + Athlon(tm), 1400MHz. + + NOTE: testing in kernel has shown that binary search is more effective than + implied by results of the user level benchmarking. Probably because in the + node keys are separated by other data. So value was adjusted after few + tests. More thorough tuning is needed. +*/ +#define REISER4_SEQ_SEARCH_BREAK (3) + +/* don't allow tree to be lower than this */ +#define REISER4_MIN_TREE_HEIGHT (TWIG_LEVEL) + +/* NOTE NIKITA this is no longer used: maximal atom size is auto-adjusted to + * available memory. */ +/* Default value of maximal atom size. Can be ovewritten by + tmgr.atom_max_size mount option. By default infinity. */ +#define REISER4_ATOM_MAX_SIZE ((unsigned)(~0)) + +/* Default value of maximal atom age (in jiffies). After reaching this age + atom will be forced to commit, either synchronously or asynchronously. Can + be overwritten by tmgr.atom_max_age mount option. */ +#define REISER4_ATOM_MAX_AGE (600 * HZ) + +/* sleeping period for ktxnmrgd */ +#define REISER4_TXNMGR_TIMEOUT (5 * HZ) + +/* timeout to wait for ent thread in writepage. Default: 3 milliseconds. */ +#define REISER4_ENTD_TIMEOUT (3 * HZ / 1000) + +/* start complaining after that many restarts in coord_by_key(). + + This either means incredibly heavy contention for this part of a tree, or + some corruption or bug. +*/ +#define REISER4_CBK_ITERATIONS_LIMIT (100) + +/* return -EIO after that many iterations in coord_by_key(). + + I have witnessed more than 800 iterations (in 30 thread test) before cbk + finished. --nikita +*/ +#define REISER4_MAX_CBK_ITERATIONS 500000 + +/* put a per-inode limit on maximal number of directory entries with identical + keys in hashed directory. + + Disable this until inheritance interfaces stabilize: we need some way to + set per directory limit. +*/ +#define REISER4_USE_COLLISION_LIMIT (0) + +/* If flush finds more than FLUSH_RELOCATE_THRESHOLD adjacent dirty leaf-level blocks it + will force them to be relocated. */ +#define FLUSH_RELOCATE_THRESHOLD 64 +/* If flush finds can find a block allocation closer than at most FLUSH_RELOCATE_DISTANCE + from the preceder it will relocate to that position. */ +#define FLUSH_RELOCATE_DISTANCE 64 + +/* If we have written this much or more blocks before encountering busy jnode + in flush list - abort flushing hoping that next time we get called + this jnode will be clean already, and we will save some seeks. */ +#define FLUSH_WRITTEN_THRESHOLD 50 + +/* The maximum number of nodes to scan left on a level during flush. */ +#define FLUSH_SCAN_MAXNODES 10000 + +/* per-atom limit of flushers */ +#define ATOM_MAX_FLUSHERS (1) + +/* default tracing buffer size */ +#define REISER4_TRACE_BUF_SIZE (1 << 15) + +/* what size units of IO we would like cp, etc., to use, in writing to + reiser4. In bytes. + + Can be overwritten by optimal_io_size mount option. +*/ +#define REISER4_OPTIMAL_IO_SIZE (64 * 1024) + +/* see comments in inode.c:oid_to_uino() */ +#define REISER4_UINO_SHIFT (1 << 30) + +/* Mark function argument as unused to avoid compiler warnings. */ +#define UNUSED_ARG __attribute__((unused)) + +#if ((__GNUC__ == 3) && (__GNUC_MINOR__ >= 3)) || (__GNUC__ > 3) +#define NONNULL __attribute__((nonnull)) +#else +#define NONNULL +#endif + +/* master super block offset in bytes.*/ +#define REISER4_MASTER_OFFSET 65536 + +/* size of VFS block */ +#define VFS_BLKSIZE 512 +/* number of bits in size of VFS block (512==2^9) */ +#define VFS_BLKSIZE_BITS 9 + +#define REISER4_I reiser4_inode_data + +/* implication */ +#define ergo( antecedent, consequent ) ( !( antecedent ) || ( consequent ) ) +/* logical equivalence */ +#define equi( p1, p2 ) ( ergo( ( p1 ), ( p2 ) ) && ergo( ( p2 ), ( p1 ) ) ) + +#define sizeof_array(x) ((int) (sizeof(x) / sizeof(x[0]))) + + +#define NOT_YET (0) + +/** Reiser4 specific error codes **/ + +#define REISER4_ERROR_CODE_BASE 500 + +/* Neighbor is not available (side neighbor or parent) */ +#define E_NO_NEIGHBOR (REISER4_ERROR_CODE_BASE) + +/* Node was not found in cache */ +#define E_NOT_IN_CACHE (REISER4_ERROR_CODE_BASE + 1) + +/* node has no free space enough for completion of balancing operation */ +#define E_NODE_FULL (REISER4_ERROR_CODE_BASE + 2) + +/* repeat operation */ +#define E_REPEAT (REISER4_ERROR_CODE_BASE + 3) + +/* deadlock happens */ +#define E_DEADLOCK (REISER4_ERROR_CODE_BASE + 4) + +/* operation cannot be performed, because it would block and non-blocking mode + * was requested. */ +#define E_BLOCK (REISER4_ERROR_CODE_BASE + 5) + +/* wait some event (depends on context), then repeat */ +#define E_WAIT (REISER4_ERROR_CODE_BASE + 6) + +#endif /* __REISER4_H__ */ + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/safe_link.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/safe_link.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,352 @@ +/* Copyright 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Safe-links. */ + +/* + * Safe-links are used to maintain file system consistency during operations + * that spawns multiple transactions. For example: + * + * 1. Unlink. UNIX supports "open-but-unlinked" files, that is files + * without user-visible names in the file system, but still opened by some + * active process. What happens here is that unlink proper (i.e., removal + * of the last file name) and file deletion (truncate of file body to zero + * and deletion of stat-data, that happens when last file descriptor is + * closed), may belong to different transactions T1 and T2. If a crash + * happens after T1 commit, but before T2 commit, on-disk file system has + * a file without name, that is, disk space leak. + * + * 2. Truncate. Truncate of large file may spawn multiple transactions. If + * system crashes while truncate was in-progress, file is left partially + * truncated, which violates "atomicity guarantees" of reiser4, viz. that + * every system is atomic. + * + * Safe-links address both above cases. Basically, safe-link is a way post + * some operation to be executed during commit of some other transaction than + * current one. (Another way to look at the safe-link is to interpret it as a + * logical logging.) + * + * Specifically, at the beginning of unlink safe-link in inserted in the + * tree. This safe-link is normally removed by file deletion code (during + * transaction T2 in the above terms). Truncate also inserts safe-link that is + * normally removed when truncate operation is finished. + * + * This means, that in the case of "clean umount" there are no safe-links in + * the tree. If safe-links are observed during mount, it means that (a) system + * was terminated abnormally, and (b) safe-link correspond to the "pending" + * (i.e., not finished) operations that were in-progress during system + * termination. Each safe-link record enough information to complete + * corresponding operation, and mount simply "replays" them (hence, the + * analogy with the logical logging). + * + * Safe-links are implemented as blackbox items (see + * plugin/item/blackbox.[ch]). + * + * For the reference: ext3 also has similar mechanism, it's called "an orphan + * list" there. + */ + +#include "safe_link.h" +#include "debug.h" +#include "inode.h" + +#include "plugin/item/blackbox.h" + +#include + +/* + * On-disk format of safe-link. + */ +typedef struct safelink { + reiser4_key sdkey; /* key of stat-data for the file safe-link is + * for */ + d64 size; /* size to which file should be truncated */ +} safelink_t; + +/* + * locality where safe-link items are stored. Next to the objectid of root + * directory. + */ +static oid_t +safe_link_locality(reiser4_tree *tree) +{ + return get_key_objectid(get_super_private(tree->super)->df_plug->root_dir_key(tree->super)) + 1; +} + +/* + Construct a key for the safe-link. Key has the following format: + +| 60 | 4 | 64 | 4 | 60 | 64 | ++---------------+---+------------------+---+---------------+------------------+ +| locality | 0 | 0 | 0 | objectid | link type | ++---------------+---+------------------+---+---------------+------------------+ +| | | | | +| 8 bytes | 8 bytes | 8 bytes | 8 bytes | + + This is in large keys format. In small keys format second 8 byte chunk is + out. Locality is a constant returned by safe_link_locality(). objectid is + an oid of a file on which operation protected by this safe-link is + performed. link-type is used to distinguish safe-links for different + operations. + + */ +static reiser4_key * +build_link_key(struct inode *inode, reiser4_safe_link_t link, reiser4_key *key) +{ + reiser4_key_init(key); + set_key_locality(key, safe_link_locality(tree_by_inode(inode))); + set_key_objectid(key, get_inode_oid(inode)); + set_key_offset(key, link); + return key; +} + +/* + * how much disk space is necessary to insert and remove (in the + * error-handling path) safe-link. + */ +static __u64 +safe_link_tograb(reiser4_tree *tree) +{ + return + /* insert safe link */ + estimate_one_insert_item(tree) + + /* remove safe link */ + estimate_one_item_removal(tree) + + /* drill to the leaf level during insertion */ + 1 + estimate_one_insert_item(tree) + + /* + * possible update of existing safe-link. Actually, if + * safe-link existed already (we failed to remove it), then no + * insertion is necessary, so this term is already "covered", + * but for simplicity let's left it. + */ + 1; +} + +/* + * grab enough disk space to insert and remove (in the error-handling path) + * safe-link. + */ +reiser4_internal int safe_link_grab(reiser4_tree *tree, reiser4_ba_flags_t flags) +{ + int result; + + grab_space_enable(); + /* The sbinfo->delete semaphore can be taken here. + * safe_link_release() should be called before leaving reiser4 + * context. */ + result = reiser4_grab_reserved(tree->super, safe_link_tograb(tree), flags); + grab_space_enable(); + return result; +} + +/* + * release unused disk space reserved by safe_link_grab(). + */ +reiser4_internal void safe_link_release(reiser4_tree * tree) +{ + reiser4_release_reserved(tree->super); +} + +/* + * insert into tree safe-link for operation @link on inode @inode. + */ +reiser4_internal int safe_link_add(struct inode *inode, reiser4_safe_link_t link) +{ + reiser4_key key; + safelink_t sl; + int length; + int result; + reiser4_tree *tree; + + build_sd_key(inode, &sl.sdkey); + length = sizeof sl.sdkey; + + if (link == SAFE_TRUNCATE) { + /* + * for truncate we have to store final file length also, + * expand item. + */ + length += sizeof(sl.size); + cputod64(inode->i_size, &sl.size); + } + tree = tree_by_inode(inode); + build_link_key(inode, link, &key); + + result = store_black_box(tree, &key, &sl, length); + if (result == -EEXIST) + result = update_black_box(tree, &key, &sl, length); + return result; +} + +/* + * remove safe-link corresponding to the operation @link on inode @inode from + * the tree. + */ +reiser4_internal int safe_link_del(struct inode *inode, reiser4_safe_link_t link) +{ + reiser4_key key; + + return kill_black_box(tree_by_inode(inode), + build_link_key(inode, link, &key)); +} + +/* + * in-memory structure to keep information extracted from safe-link. This is + * used to iterate over all safe-links. + */ +typedef struct { + reiser4_tree *tree; /* internal tree */ + reiser4_key key; /* safe-link key*/ + reiser4_key sdkey; /* key of object stat-data */ + reiser4_safe_link_t link; /* safe-link type */ + oid_t oid; /* object oid */ + __u64 size; /* final size for truncate */ +} safe_link_context; + +/* + * start iterating over all safe-links. + */ +static void safe_link_iter_begin(reiser4_tree *tree, safe_link_context *ctx) +{ + ctx->tree = tree; + reiser4_key_init(&ctx->key); + set_key_locality(&ctx->key, safe_link_locality(tree)); + set_key_objectid(&ctx->key, get_key_objectid(max_key())); + set_key_offset(&ctx->key, get_key_offset(max_key())); +} + +/* + * return next safe-link. + */ +static int safe_link_iter_next(safe_link_context *ctx) +{ + int result; + safelink_t sl; + + result = load_black_box(ctx->tree, + &ctx->key, &sl, sizeof sl, 0); + if (result == 0) { + ctx->oid = get_key_objectid(&ctx->key); + ctx->link = get_key_offset(&ctx->key); + ctx->sdkey = sl.sdkey; + if (ctx->link == SAFE_TRUNCATE) + ctx->size = d64tocpu(&sl.size); + } + return result; +} + +/* + * check are there any more safe-links left in the tree. + */ +static int safe_link_iter_finished(safe_link_context *ctx) +{ + return get_key_locality(&ctx->key) != safe_link_locality(ctx->tree); +} + + +/* + * finish safe-link iteration. + */ +static void safe_link_iter_end(safe_link_context *ctx) +{ + /* nothing special */ +} + +/* + * process single safe-link. + */ +static int process_safelink(struct super_block *super, reiser4_safe_link_t link, + reiser4_key *sdkey, oid_t oid, __u64 size) +{ + struct inode *inode; + int result; + + /* + * obtain object inode by reiser4_iget(), then call object plugin + * ->safelink() method to do actual work, then delete safe-link on + * success. + */ + inode = reiser4_iget(super, sdkey, 1); + if (!IS_ERR(inode)) { + file_plugin *fplug; + + fplug = inode_file_plugin(inode); + assert("nikita-3428", fplug != NULL); + if (fplug->safelink != NULL) { + /* txn_restart_current is not necessary because + * mounting is signle thread. However, without it + * deadlock detection code will complain (see + * nikita-3361). */ + txn_restart_current(); + result = fplug->safelink(inode, link, size); + } else { + warning("nikita-3430", + "Cannot handle safelink for %lli", + (unsigned long long)oid); + print_key("key", sdkey); + result = 0; + } + if (result != 0) { + warning("nikita-3431", + "Error processing safelink for %lli: %i", + (unsigned long long)oid, result); + } + reiser4_iget_complete(inode); + iput(inode); + if (result == 0) { + result = safe_link_grab(tree_by_inode(inode), + BA_CAN_COMMIT); + if (result == 0) + result = safe_link_del(inode, link); + safe_link_release(tree_by_inode(inode)); + /* + * restart transaction: if there was large number of + * safe-links, their processing may fail to fit into + * single transaction. + */ + if (result == 0) + txn_restart_current(); + } + } else + result = PTR_ERR(inode); + return result; +} + +/* + * iterate over all safe-links in the file-system processing them one by one. + */ +reiser4_internal int process_safelinks(struct super_block *super) +{ + safe_link_context ctx; + int result; + + if (rofs_super(super)) + /* do nothing on the read-only file system */ + return 0; + safe_link_iter_begin(&get_super_private(super)->tree, &ctx); + result = 0; + do { + result = safe_link_iter_next(&ctx); + if (safe_link_iter_finished(&ctx) || result == -ENOENT) { + result = 0; + break; + } + if (result == 0) + result = process_safelink(super, ctx.link, + &ctx.sdkey, ctx.oid, ctx.size); + } while (result == 0); + safe_link_iter_end(&ctx); + return result; +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/safe_link.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/safe_link.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,29 @@ +/* Copyright 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Safe-links. See safe_link.c for details. */ + +#if !defined( __FS_SAFE_LINK_H__ ) +#define __FS_SAFE_LINK_H__ + +#include "tree.h" + +int safe_link_grab(reiser4_tree *tree, reiser4_ba_flags_t flags); +void safe_link_release(reiser4_tree *tree); +int safe_link_add(struct inode *inode, reiser4_safe_link_t link); +int safe_link_del(struct inode *inode, reiser4_safe_link_t link); + +int process_safelinks(struct super_block *super); + +/* __FS_SAFE_LINK_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/seal.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/seal.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,238 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ +/* Seals implementation. */ +/* Seals are "weak" tree pointers. They are analogous to tree coords in + allowing to bypass tree traversal. But normal usage of coords implies that + node pointed to by coord is locked, whereas seals don't keep a lock (or + even a reference) to znode. In stead, each znode contains a version number, + increased on each znode modification. This version number is copied into a + seal when seal is created. Later, one can "validate" seal by calling + seal_validate(). If znode is in cache and its version number is still the + same, seal is "pristine" and coord associated with it can be re-used + immediately. + + If, on the other hand, znode is out of cache, or it is obviously different + one from the znode seal was initially attached to (for example, it is on + the different level, or is being removed from the tree), seal is + irreparably invalid ("burned") and tree traversal has to be repeated. + + Otherwise, there is some hope, that while znode was modified (and seal was + "broken" as a result), key attached to the seal is still in the node. This + is checked by first comparing this key with delimiting keys of node and, if + key is ok, doing intra-node lookup. + + Znode version is maintained in the following way: + + there is reiser4_tree.znode_epoch counter. Whenever new znode is created, + znode_epoch is incremented and its new value is stored in ->version field + of new znode. Whenever znode is dirtied (which means it was probably + modified), znode_epoch is also incremented and its new value is stored in + znode->version. This is done so, because just incrementing znode->version + on each update is not enough: it may so happen, that znode get deleted, new + znode is allocated for the same disk block and gets the same version + counter, tricking seal code into false positive. +*/ + +#include "forward.h" +#include "debug.h" +#include "key.h" +#include "coord.h" +#include "seal.h" +#include "plugin/item/item.h" +#include "plugin/node/node.h" +#include "jnode.h" +#include "znode.h" +#include "super.h" + +static znode *seal_node(const seal_t * seal); +static int seal_matches(const seal_t * seal, znode * node); + +/* initialise seal. This can be called several times on the same seal. @coord + and @key can be NULL. */ +reiser4_internal void +seal_init(seal_t * seal /* seal to initialise */ , + const coord_t * coord /* coord @seal will be attached to */ , + const reiser4_key * key UNUSED_ARG /* key @seal will be + * attached to */ ) +{ + assert("nikita-1886", seal != NULL); + memset(seal, 0, sizeof *seal); + if (coord != NULL) { + znode *node; + + node = coord->node; + assert("nikita-1987", node != NULL); + spin_lock_znode(node); + seal->version = node->version; + assert("nikita-1988", seal->version != 0); + seal->block = *znode_get_block(node); +#if REISER4_DEBUG + seal->coord1 = *coord; + if (key != NULL) + seal->key = *key; +#ifdef CONFIG_FRAME_POINTER + seal->bt[0] = __builtin_return_address(0); + seal->bt[1] = __builtin_return_address(1); + seal->bt[2] = __builtin_return_address(2); + seal->bt[3] = __builtin_return_address(3); + seal->bt[4] = __builtin_return_address(4); +#endif +#endif + spin_unlock_znode(node); + } +} + +/* finish with seal */ +reiser4_internal void +seal_done(seal_t * seal /* seal to clear */) +{ + assert("nikita-1887", seal != NULL); + seal->version = 0; +} + +/* true if seal was initialised */ +reiser4_internal int +seal_is_set(const seal_t * seal /* seal to query */ ) +{ + assert("nikita-1890", seal != NULL); + return seal->version != 0; +} + +#if REISER4_DEBUG +/* helper function for seal_validate(). It checks that item at @coord has + * expected key. This is to detect cases where node was modified but wasn't + * marked dirty. */ +static inline int +check_seal_match(const coord_t * coord /* coord to check */, + const reiser4_key * k /* expected key */) +{ + reiser4_key ukey; + + return (coord->between != AT_UNIT) || + /* FIXME-VS: we only can compare keys for items whose units + represent exactly one key */ + ((coord_is_existing_unit(coord)) && (item_is_extent(coord) || keyeq(k, unit_key_by_coord(coord, &ukey)))) || + ((coord_is_existing_unit(coord)) && (item_is_ctail(coord)) && keyge(k, unit_key_by_coord(coord, &ukey))); +} +#endif + + +/* this is used by seal_validate. It accepts return value of + * longterm_lock_znode and returns 1 if it can be interpreted as seal + * validation failure. For instance, when longterm_lock_znode returns -EINVAL, + * seal_validate returns -E_REPEAT and caller will call tre search. We cannot + * do this in longterm_lock_znode(), because sometimes we want to distinguish + * between -EINVAL and -E_REPEAT. */ +static int +should_repeat(int return_code) +{ + return return_code == -EINVAL; +} + +/* (re-)validate seal. + + Checks whether seal is pristine, and try to revalidate it if possible. + + If seal was burned, or broken irreparably, return -E_REPEAT. + + NOTE-NIKITA currently seal_validate() returns -E_REPEAT if key we are + looking for is in range of keys covered by the sealed node, but item wasn't + found by node ->lookup() method. Alternative is to return -ENOENT in this + case, but this would complicate callers logic. + +*/ +reiser4_internal int +seal_validate(seal_t * seal /* seal to validate */ , + coord_t * coord /* coord to validate against */ , + const reiser4_key * key /* key to validate against */ , + lock_handle * lh /* resulting lock handle */ , + znode_lock_mode mode /* lock node */ , + znode_lock_request request /* locking priority */ ) +{ + znode *node; + int result; + + assert("nikita-1889", seal != NULL); + assert("nikita-1881", seal_is_set(seal)); + assert("nikita-1882", key != NULL); + assert("nikita-1883", coord != NULL); + assert("nikita-1884", lh != NULL); + assert("nikita-1885", keyeq(&seal->key, key)); + assert("nikita-1989", coords_equal(&seal->coord1, coord)); + + /* obtain znode by block number */ + node = seal_node(seal); + if (node != NULL) { + /* znode was in cache, lock it */ + result = longterm_lock_znode(lh, node, mode, request); + zput(node); + if (result == 0) { + if (seal_matches(seal, node)) { + /* if seal version and znode version + coincide */ + ON_DEBUG(coord_update_v(coord)); + assert("nikita-1990", node == seal->coord1.node); + assert("nikita-1898", WITH_DATA_RET(coord->node, 1, check_seal_match(coord, key))); + } else + result = RETERR(-E_REPEAT); + } + if (result != 0) { + if (should_repeat(result)) + result = RETERR(-E_REPEAT); + /* unlock node on failure */ + done_lh(lh); + } + } else { + /* znode wasn't in cache */ + result = RETERR(-E_REPEAT); + } + return result; +} + +/* helpers functions */ + +/* obtain reference to znode seal points to, if in cache */ +static znode * +seal_node(const seal_t * seal /* seal to query */ ) +{ + assert("nikita-1891", seal != NULL); + return zlook(current_tree, &seal->block); +} + +/* true if @seal version and @node version coincide */ +static int +seal_matches(const seal_t * seal /* seal to check */ , + znode * node /* node to check */ ) +{ + assert("nikita-1991", seal != NULL); + assert("nikita-1993", node != NULL); + + return UNDER_SPIN(jnode, ZJNODE(node), (seal->version == node->version)); +} + +#if REISER4_DEBUG_OUTPUT +/* debugging function: print human readable form of @seal. */ +reiser4_internal void +print_seal(const char *prefix, const seal_t * seal) +{ + if (seal == NULL) { + printk("%s: null seal\n", prefix); + } else { + printk("%s: version: %llu, block: %llu\n", prefix, seal->version, seal->block); +#if REISER4_DEBUG + print_key("seal key", &seal->key); + print_coord("seal coord", &seal->coord1, 0); +#endif + } +} +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/seal.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/seal.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,51 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Declaration of seals: "weak" tree pointers. See seal.c for comments. */ + +#ifndef __SEAL_H__ +#define __SEAL_H__ + +#include "forward.h" +#include "debug.h" +#include "dformat.h" +#include "key.h" +#include "coord.h" + +/* for __u?? types */ +/*#include */ + +/* seal. See comment at the top of seal.c */ +typedef struct seal_s { + /* version of znode recorder at the time of seal creation */ + __u64 version; + /* block number of znode attached to this seal */ + reiser4_block_nr block; +#if REISER4_DEBUG + /* coord this seal is attached to. For debugging. */ + coord_t coord1; + /* key this seal is attached to. For debugging. */ + reiser4_key key; + void *bt[5]; +#endif +} seal_t; + +extern void seal_init(seal_t *, const coord_t *, const reiser4_key *); +extern void seal_done(seal_t *); +extern int seal_is_set(const seal_t *); +extern int seal_validate(seal_t *, coord_t *, + const reiser4_key *, lock_handle *, + znode_lock_mode mode, znode_lock_request request); + + +/* __SEAL_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/search.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/search.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1633 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +#include "forward.h" +#include "debug.h" +#include "dformat.h" +#include "key.h" +#include "coord.h" +#include "seal.h" +#include "plugin/item/item.h" +#include "plugin/node/node.h" +#include "plugin/plugin.h" +#include "jnode.h" +#include "znode.h" +#include "block_alloc.h" +#include "tree_walk.h" +#include "tree.h" +#include "reiser4.h" +#include "super.h" +#include "inode.h" + +#include + +static const char * bias_name(lookup_bias bias); + +/* tree searching algorithm, intranode searching algorithms are in + plugin/node/ */ + +/* tree lookup cache + * + * The coord by key cache consists of small list of recently accessed nodes + * maintained according to the LRU discipline. Before doing real top-to-down + * tree traversal this cache is scanned for nodes that can contain key + * requested. + * + * The efficiency of coord cache depends heavily on locality of reference for + * tree accesses. Our user level simulations show reasonably good hit ratios + * for coord cache under most loads so far. + */ + +/* Initialise coord cache slot */ +static void +cbk_cache_init_slot(cbk_cache_slot * slot) +{ + assert("nikita-345", slot != NULL); + + cbk_cache_list_clean(slot); + slot->node = NULL; +} + +/* Initialise coord cache */ +reiser4_internal int +cbk_cache_init(cbk_cache * cache /* cache to init */ ) +{ + int i; + + assert("nikita-346", cache != NULL); + + cache->slot = kmalloc(sizeof (cbk_cache_slot) * cache->nr_slots, GFP_KERNEL); + if (cache->slot == NULL) + return RETERR(-ENOMEM); + + cbk_cache_list_init(&cache->lru); + for (i = 0; i < cache->nr_slots; ++i) { + cbk_cache_init_slot(cache->slot + i); + cbk_cache_list_push_back(&cache->lru, cache->slot + i); + } + rw_cbk_cache_init(cache); + return 0; +} + +/* free cbk cache data */ +reiser4_internal void +cbk_cache_done(cbk_cache * cache /* cache to release */ ) +{ + assert("nikita-2493", cache != NULL); + if (cache->slot != NULL) { + kfree(cache->slot); + cache->slot = NULL; + } +} + +/* macro to iterate over all cbk cache slots */ +#define for_all_slots( cache, slot ) \ + for( ( slot ) = cbk_cache_list_front( &( cache ) -> lru ) ; \ + !cbk_cache_list_end( &( cache ) -> lru, ( slot ) ) ; \ + ( slot ) = cbk_cache_list_next( slot ) ) + +#if REISER4_DEBUG +/* this function assures that [cbk-cache-invariant] invariant holds */ +static int +cbk_cache_invariant(const cbk_cache * cache) +{ + cbk_cache_slot *slot; + int result; + int unused; + + if (cache->nr_slots == 0) + return 1; + + assert("nikita-2469", cache != NULL); + unused = 0; + result = 1; + read_lock_cbk_cache((cbk_cache *) cache); + for_all_slots(cache, slot) { + /* in LRU first go all `used' slots followed by `unused' */ + if (unused && (slot->node != NULL)) + result = 0; + if (slot->node == NULL) + unused = 1; + else { + cbk_cache_slot *scan; + + /* all cached nodes are different */ + scan = slot; + while (result) { + scan = cbk_cache_list_next(scan); + if (cbk_cache_list_end(&cache->lru, scan)) + break; + if (slot->node == scan->node) + result = 0; + } + } + if (!result) + break; + } + read_unlock_cbk_cache((cbk_cache *) cache); + return result; +} + +#endif + +/* Remove references, if any, to @node from coord cache */ +reiser4_internal void +cbk_cache_invalidate(const znode * node /* node to remove from cache */ , + reiser4_tree * tree /* tree to remove node from */ ) +{ + cbk_cache_slot *slot; + cbk_cache *cache; + int i; + + assert("nikita-350", node != NULL); + assert("nikita-1479", LOCK_CNT_GTZ(rw_locked_tree)); + + cache = &tree->cbk_cache; + assert("nikita-2470", cbk_cache_invariant(cache)); + + write_lock_cbk_cache(cache); + for (i = 0, slot = cache->slot; i < cache->nr_slots; ++ i, ++ slot) { + if (slot->node == node) { + cbk_cache_list_remove(slot); + cbk_cache_list_push_back(&cache->lru, slot); + slot->node = NULL; + break; + } + } + write_unlock_cbk_cache(cache); + assert("nikita-2471", cbk_cache_invariant(cache)); +} + +/* add to the cbk-cache in the "tree" information about "node". This + can actually be update of existing slot in a cache. */ +static void +cbk_cache_add(const znode * node /* node to add to the cache */ ) +{ + cbk_cache *cache; + cbk_cache_slot *slot; + int i; + + assert("nikita-352", node != NULL); + + cache = &znode_get_tree(node)->cbk_cache; + assert("nikita-2472", cbk_cache_invariant(cache)); + + if (cache->nr_slots == 0) + return; + + write_lock_cbk_cache(cache); + /* find slot to update/add */ + for (i = 0, slot = cache->slot; i < cache->nr_slots; ++ i, ++ slot) { + /* oops, this node is already in a cache */ + if (slot->node == node) + break; + } + /* if all slots are used, reuse least recently used one */ + if (i == cache->nr_slots) { + slot = cbk_cache_list_back(&cache->lru); + slot->node = (znode *) node; + } + cbk_cache_list_remove(slot); + cbk_cache_list_push_front(&cache->lru, slot); + write_unlock_cbk_cache(cache); + assert("nikita-2473", cbk_cache_invariant(cache)); +} + +static int setup_delimiting_keys(cbk_handle * h); +static lookup_result coord_by_handle(cbk_handle * handle); +static lookup_result traverse_tree(cbk_handle * h); +static int cbk_cache_search(cbk_handle * h); + +static level_lookup_result cbk_level_lookup(cbk_handle * h); +static level_lookup_result cbk_node_lookup(cbk_handle * h); + +/* helper functions */ + +static void update_stale_dk(reiser4_tree *tree, znode *node); + +/* release parent node during traversal */ +static void put_parent(cbk_handle * h); +/* check consistency of fields */ +static int sanity_check(cbk_handle * h); +/* release resources in handle */ +static void hput(cbk_handle * h); + +static level_lookup_result search_to_left(cbk_handle * h); + +/* pack numerous (numberous I should say) arguments of coord_by_key() into + * cbk_handle */ +static cbk_handle * +cbk_pack(cbk_handle *handle, + reiser4_tree * tree, + const reiser4_key * key, + coord_t * coord, + lock_handle * active_lh, + lock_handle * parent_lh, + znode_lock_mode lock_mode, + lookup_bias bias, + tree_level lock_level, + tree_level stop_level, + __u32 flags, + ra_info_t *info) +{ + memset(handle, 0, sizeof *handle); + + handle->tree = tree; + handle->key = key; + handle->lock_mode = lock_mode; + handle->bias = bias; + handle->lock_level = lock_level; + handle->stop_level = stop_level; + handle->coord = coord; + /* set flags. See comment in tree.h:cbk_flags */ + handle->flags = flags | CBK_TRUST_DK | CBK_USE_CRABLOCK; + + handle->active_lh = active_lh; + handle->parent_lh = parent_lh; + handle->ra_info = info; + return handle; +} + +/* main tree lookup procedure + + Check coord cache. If key we are looking for is not found there, call cbk() + to do real tree traversal. + + As we have extents on the twig level, @lock_level and @stop_level can + be different from LEAF_LEVEL and each other. + + Thread cannot keep any reiser4 locks (tree, znode, dk spin-locks, or znode + long term locks) while calling this. +*/ +reiser4_internal lookup_result +coord_by_key(reiser4_tree * tree /* tree to perform search + * in. Usually this tree is + * part of file-system + * super-block */ , + const reiser4_key * key /* key to look for */ , + coord_t * coord /* where to store found + * position in a tree. Fields + * in "coord" are only valid if + * coord_by_key() returned + * "CBK_COORD_FOUND" */ , + lock_handle * lh, /* resulting lock handle */ + znode_lock_mode lock_mode /* type of lookup we + * want on node. Pass + * ZNODE_READ_LOCK here + * if you only want to + * read item found and + * ZNODE_WRITE_LOCK if + * you want to modify + * it */ , + lookup_bias bias /* what to return if coord + * with exactly the @key is + * not in the tree */ , + tree_level lock_level /* tree level where to start + * taking @lock type of + * locks */ , + tree_level stop_level /* tree level to stop. Pass + * LEAF_LEVEL or TWIG_LEVEL + * here Item being looked + * for has to be between + * @lock_level and + * @stop_level, inclusive */ , + __u32 flags /* search flags */, + ra_info_t *info /* information about desired tree traversal readahead */) +{ + cbk_handle handle; + lock_handle parent_lh; + lookup_result result; + + init_lh(lh); + init_lh(&parent_lh); + + assert("nikita-3023", schedulable()); + + assert("nikita-353", tree != NULL); + assert("nikita-354", key != NULL); + assert("nikita-355", coord != NULL); + assert("nikita-356", (bias == FIND_EXACT) || (bias == FIND_MAX_NOT_MORE_THAN)); + assert("nikita-357", stop_level >= LEAF_LEVEL); + /* no locks can be held during tree traversal */ + assert("nikita-2104", lock_stack_isclean(get_current_lock_stack())); + + cbk_pack(&handle, + tree, + key, + coord, + lh, + &parent_lh, + lock_mode, + bias, + lock_level, + stop_level, + flags, + info); + + result = coord_by_handle(&handle); + assert("nikita-3247", ergo(!IS_CBKERR(result), coord->node == lh->node)); + return result; +} + +/* like coord_by_key(), but starts traversal from vroot of @object rather than + * from tree root. */ +reiser4_internal lookup_result +object_lookup(struct inode *object, + const reiser4_key * key, + coord_t * coord, + lock_handle * lh, + znode_lock_mode lock_mode, + lookup_bias bias, + tree_level lock_level, + tree_level stop_level, + __u32 flags, + ra_info_t *info) +{ + cbk_handle handle; + lock_handle parent_lh; + lookup_result result; + + init_lh(lh); + init_lh(&parent_lh); + + assert("nikita-3023", schedulable()); + + assert("nikita-354", key != NULL); + assert("nikita-355", coord != NULL); + assert("nikita-356", (bias == FIND_EXACT) || (bias == FIND_MAX_NOT_MORE_THAN)); + assert("nikita-357", stop_level >= LEAF_LEVEL); + /* no locks can be held during tree search by key */ + assert("nikita-2104", lock_stack_isclean(get_current_lock_stack())); + + cbk_pack(&handle, + object != NULL ? tree_by_inode(object) : current_tree, + key, + coord, + lh, + &parent_lh, + lock_mode, + bias, + lock_level, + stop_level, + flags, + info); + handle.object = object; + + result = coord_by_handle(&handle); + assert("nikita-3247", ergo(!IS_CBKERR(result), coord->node == lh->node)); + return result; +} + +/* lookup by cbk_handle. Common part of coord_by_key() and object_lookup(). */ +static lookup_result +coord_by_handle(cbk_handle * handle) +{ + /* + * first check cbk_cache (which is look-aside cache for our tree) and + * of this fails, start traversal. + */ + /* first check whether "key" is in cache of recent lookups. */ + if (cbk_cache_search(handle) == 0) + return handle->result; + else + return traverse_tree(handle); +} + +/* Execute actor for each item (or unit, depending on @through_units_p), + starting from @coord, right-ward, until either: + + - end of the tree is reached + - unformatted node is met + - error occurred + - @actor returns 0 or less + + Error code, or last actor return value is returned. + + This is used by plugin/dir/hashe_dir.c:find_entry() to move through + sequence of entries with identical keys and alikes. +*/ +reiser4_internal int +iterate_tree(reiser4_tree * tree /* tree to scan */ , + coord_t * coord /* coord to start from */ , + lock_handle * lh /* lock handle to start with and to + * update along the way */ , + tree_iterate_actor_t actor /* function to call on each + * item/unit */ , + void *arg /* argument to pass to @actor */ , + znode_lock_mode mode /* lock mode on scanned nodes */ , + int through_units_p /* call @actor on each item or on each + * unit */ ) +{ + int result; + + assert("nikita-1143", tree != NULL); + assert("nikita-1145", coord != NULL); + assert("nikita-1146", lh != NULL); + assert("nikita-1147", actor != NULL); + + result = zload(coord->node); + coord_clear_iplug(coord); + if (result != 0) + return result; + if (!coord_is_existing_unit(coord)) { + zrelse(coord->node); + return -ENOENT; + } + while ((result = actor(tree, coord, lh, arg)) > 0) { + /* move further */ + if ((through_units_p && coord_next_unit(coord)) || + (!through_units_p && coord_next_item(coord))) { + do { + lock_handle couple; + + /* move to the next node */ + init_lh(&couple); + result = reiser4_get_right_neighbor( + &couple, coord->node, (int) mode, GN_CAN_USE_UPPER_LEVELS); + zrelse(coord->node); + if (result == 0) { + + result = zload(couple.node); + if (result != 0) { + done_lh(&couple); + return result; + } + + coord_init_first_unit(coord, couple.node); + done_lh(lh); + move_lh(lh, &couple); + } else + return result; + } while (node_is_empty(coord->node)); + } + + assert("nikita-1149", coord_is_existing_unit(coord)); + } + zrelse(coord->node); + return result; +} + +/* return locked uber znode for @tree */ +reiser4_internal int get_uber_znode(reiser4_tree * tree, znode_lock_mode mode, + znode_lock_request pri, lock_handle *lh) +{ + int result; + + result = longterm_lock_znode(lh, tree->uber, mode, pri); + return result; +} + +/* true if @key is strictly within @node + + we are looking for possibly non-unique key and it is item is at the edge of + @node. May be it is in the neighbor. +*/ +static int +znode_contains_key_strict(znode * node /* node to check key + * against */ , + const reiser4_key * key /* key to check */, + int isunique) +{ + int answer; + + assert("nikita-1760", node != NULL); + assert("nikita-1722", key != NULL); + + if (keyge(key, &node->rd_key)) + return 0; + + answer = keycmp(&node->ld_key, key); + + if (isunique) + return answer != GREATER_THAN; + else + return answer == LESS_THAN; +} + +/* + * Virtual Root (vroot) code. + * + * For given file system object (e.g., regular file or directory) let's + * define its "virtual root" as lowest in the tree (that is, furtherest + * from the tree root) node such that all body items of said object are + * located in a tree rooted at this node. + * + * Once vroot of object is found all tree lookups for items within body of + * this object ("object lookups") can be started from its vroot rather + * than from real root. This has following advantages: + * + * 1. amount of nodes traversed during lookup (and, hence, amount of + * key comparisons made) decreases, and + * + * 2. contention on tree root is decreased. This latter was actually + * motivating reason behind vroot, because spin lock of root node, + * which is taken when acquiring long-term lock on root node is the + * hottest lock in the reiser4. + * + * How to find vroot. + * + * When vroot of object F is not yet determined, all object lookups start + * from the root of the tree. At each tree level during traversal we have + * a node N such that a key we are looking for (which is the key inside + * object's body) is located within N. In function handle_vroot() called + * from cbk_level_lookup() we check whether N is possible vroot for + * F. Check is trivial---if neither leftmost nor rightmost item of N + * belongs to F (and we already have helpful ->owns_item() method of + * object plugin for this), then N is possible vroot of F. This, of + * course, relies on the assumption that each object occupies contiguous + * range of keys in the tree. + * + * Thus, traversing tree downward and checking each node as we go, we can + * find lowest such node, which, by definition, is vroot. + * + * How to track vroot. + * + * Nohow. If actual vroot changes, next object lookup will just restart + * from the actual tree root, refreshing object's vroot along the way. + * + */ + +/* + * Check whether @node is possible vroot of @object. + */ +static void +handle_vroot(struct inode *object, znode *node) +{ + file_plugin *fplug; + coord_t coord; + + fplug = inode_file_plugin(object); + assert("nikita-3353", fplug != NULL); + assert("nikita-3354", fplug->owns_item != NULL); + + if (unlikely(node_is_empty(node))) + return; + + coord_init_first_unit(&coord, node); + /* + * if leftmost item of @node belongs to @object, we cannot be sure + * that @node is vroot of @object, because, some items of @object are + * probably in the sub-tree rooted at the left neighbor of @node. + */ + if (fplug->owns_item(object, &coord)) + return; + coord_init_last_unit(&coord, node); + /* mutatis mutandis for the rightmost item */ + if (fplug->owns_item(object, &coord)) + return; + /* otherwise, @node is possible vroot of @object */ + inode_set_vroot(object, node); +} + +/* + * helper function used by traverse tree to start tree traversal not from the + * tree root, but from @h->object's vroot, if possible. + */ +static int +prepare_object_lookup(cbk_handle * h) +{ + znode *vroot; + int result; + + vroot = inode_get_vroot(h->object); + if (vroot == NULL) { + /* + * object doesn't have known vroot, start from real tree root. + */ + return LOOKUP_CONT; + } + + h->level = znode_get_level(vroot); + /* take a long-term lock on vroot */ + h->result = longterm_lock_znode(h->active_lh, vroot, + cbk_lock_mode(h->level, h), + ZNODE_LOCK_LOPRI); + result = LOOKUP_REST; + if (h->result == 0) { + int isunique; + int inside; + + isunique = h->flags & CBK_UNIQUE; + /* check that key is inside vroot */ + inside = + UNDER_RW(dk, h->tree, read, + znode_contains_key_strict(vroot, + h->key, + isunique)) && + !ZF_ISSET(vroot, JNODE_HEARD_BANSHEE); + if (inside) { + h->result = zload(vroot); + if (h->result == 0) { + /* search for key in vroot. */ + result = cbk_node_lookup(h); + zrelse(vroot);/*h->active_lh->node);*/ + if (h->active_lh->node != vroot) { + result = LOOKUP_REST; + } else if (result == LOOKUP_CONT) { + move_lh(h->parent_lh, h->active_lh); + h->flags &= ~CBK_DKSET; + } + } + } + } else + /* long-term locking failed. Restart. */ + ; + + zput(vroot); + + if (IS_CBKERR(h->result) || result == LOOKUP_REST) + hput(h); + return result; +} + +/* main function that handles common parts of tree traversal: starting + (fake znode handling), restarts, error handling, completion */ +static lookup_result +traverse_tree(cbk_handle * h /* search handle */ ) +{ + int done; + int iterations; + int vroot_used; + + assert("nikita-365", h != NULL); + assert("nikita-366", h->tree != NULL); + assert("nikita-367", h->key != NULL); + assert("nikita-368", h->coord != NULL); + assert("nikita-369", (h->bias == FIND_EXACT) || (h->bias == FIND_MAX_NOT_MORE_THAN)); + assert("nikita-370", h->stop_level >= LEAF_LEVEL); + assert("nikita-2949", !(h->flags & CBK_DKSET)); + assert("zam-355", lock_stack_isclean(get_current_lock_stack())); + + done = 0; + iterations = 0; + vroot_used = 0; + + /* loop for restarts */ +restart: + + assert("nikita-3024", schedulable()); + + h->result = CBK_COORD_FOUND; + /* connect_znode() needs it */ + h->ld_key = *min_key(); + h->rd_key = *max_key(); + h->flags |= CBK_DKSET; + h->error = NULL; + + if (!vroot_used && h->object != NULL) { + vroot_used = 1; + done = prepare_object_lookup(h); + if (done == LOOKUP_REST) { + goto restart; + } else if (done == LOOKUP_DONE) + return h->result; + } + if (h->parent_lh->node == NULL) { + done = get_uber_znode(h->tree, ZNODE_READ_LOCK, ZNODE_LOCK_LOPRI, + h->parent_lh); + + assert("nikita-1637", done != -E_DEADLOCK); + + h->block = h->tree->root_block; + h->level = h->tree->height; + h->coord->node = h->parent_lh->node; + + if (done != 0) + return done; + } + + /* loop descending a tree */ + while (!done) { + + if (unlikely((iterations > REISER4_CBK_ITERATIONS_LIMIT) && + IS_POW(iterations))) { + warning("nikita-1481", "Too many iterations: %i", iterations); + print_key("key", h->key); + ++iterations; + } else if (unlikely(iterations > REISER4_MAX_CBK_ITERATIONS)) { + h->error = + "reiser-2018: Too many iterations. Tree corrupted, or (less likely) starvation occurring."; + h->result = RETERR(-EIO); + break; + } + switch (cbk_level_lookup(h)) { + case LOOKUP_CONT: + move_lh(h->parent_lh, h->active_lh); + continue; + default: + wrong_return_value("nikita-372", "cbk_level"); + case LOOKUP_DONE: + done = 1; + break; + case LOOKUP_REST: + hput(h); + /* deadlock avoidance is normal case. */ + if (h->result != -E_DEADLOCK) + ++iterations; + preempt_point(); + goto restart; + } + } + /* that's all. The rest is error handling */ + if (unlikely(h->error != NULL)) { + warning("nikita-373", "%s: level: %i, " + "lock_level: %i, stop_level: %i " + "lock_mode: %s, bias: %s", + h->error, h->level, h->lock_level, h->stop_level, + lock_mode_name(h->lock_mode), bias_name(h->bias)); + print_address("block", &h->block); + print_key("key", h->key); + print_coord_content("coord", h->coord); + print_znode("active", h->active_lh->node); + print_znode("parent", h->parent_lh->node); + } + /* `unlikely' error case */ + if (unlikely(IS_CBKERR(h->result))) { + /* failure. do cleanup */ + hput(h); + } else { + assert("nikita-1605", WITH_DATA_RET + (h->coord->node, 1, + ergo((h->result == CBK_COORD_FOUND) && + (h->bias == FIND_EXACT) && + (!node_is_empty(h->coord->node)), coord_is_existing_item(h->coord)))); + } + return h->result; +} + +/* find delimiting keys of child + + Determine left and right delimiting keys for child pointed to by + @parent_coord. + +*/ +static void +find_child_delimiting_keys(znode * parent /* parent znode, passed + * locked */ , + const coord_t * parent_coord /* coord where + * pointer to + * child is + * stored */ , + reiser4_key * ld /* where to store left + * delimiting key */ , + reiser4_key * rd /* where to store right + * delimiting key */ ) +{ + coord_t neighbor; + + assert("nikita-1484", parent != NULL); + assert("nikita-1485", rw_dk_is_locked(znode_get_tree(parent))); + + coord_dup(&neighbor, parent_coord); + + if (neighbor.between == AT_UNIT) + /* imitate item ->lookup() behavior. */ + neighbor.between = AFTER_UNIT; + + if (coord_set_to_left(&neighbor) == 0) + unit_key_by_coord(&neighbor, ld); + else { + assert("nikita-14851", 0); + *ld = *znode_get_ld_key(parent); + } + + coord_dup(&neighbor, parent_coord); + if (neighbor.between == AT_UNIT) + neighbor.between = AFTER_UNIT; + if (coord_set_to_right(&neighbor) == 0) + unit_key_by_coord(&neighbor, rd); + else + *rd = *znode_get_rd_key(parent); +} + +/* + * setup delimiting keys for a child + * + * @parent parent node + * + * @coord location in @parent where pointer to @child is + * + * @child child node + */ +reiser4_internal int +set_child_delimiting_keys(znode * parent, + const coord_t * coord, znode * child) +{ + reiser4_tree *tree; + + assert("nikita-2952", + znode_get_level(parent) == znode_get_level(coord->node)); + + /* fast check without taking dk lock. This is safe, because + * JNODE_DKSET is never cleared once set. */ + if (!ZF_ISSET(child, JNODE_DKSET)) { + tree = znode_get_tree(parent); + WLOCK_DK(tree); + if (likely(!ZF_ISSET(child, JNODE_DKSET))) { + find_child_delimiting_keys(parent, coord, + &child->ld_key, + &child->rd_key); + ON_DEBUG( + child->ld_key_version = atomic_inc_return(&delim_key_version); + child->rd_key_version = atomic_inc_return(&delim_key_version); + ); + ZF_SET(child, JNODE_DKSET); + } + WUNLOCK_DK(tree); + return 1; + } + return 0; +} + +/* Perform tree lookup at one level. This is called from cbk_traverse() + function that drives lookup through tree and calls cbk_node_lookup() to + perform lookup within one node. + + See comments in a code. +*/ +static level_lookup_result +cbk_level_lookup(cbk_handle * h /* search handle */ ) +{ + int ret; + int setdk; + int ldkeyset = 0; + reiser4_key ldkey; + reiser4_key key; + znode *active; + + assert("nikita-3025", schedulable()); + + /* acquire reference to @active node */ + active = zget(h->tree, &h->block, h->parent_lh->node, h->level, GFP_KERNEL); + + if (IS_ERR(active)) { + h->result = PTR_ERR(active); + return LOOKUP_DONE; + } + + /* lock @active */ + h->result = longterm_lock_znode(h->active_lh, + active, + cbk_lock_mode(h->level, h), + ZNODE_LOCK_LOPRI); + /* longterm_lock_znode() acquires additional reference to znode (which + will be later released by longterm_unlock_znode()). Release + reference acquired by zget(). + */ + zput(active); + if (unlikely(h->result != 0)) + goto fail_or_restart; + + setdk = 0; + /* if @active is accessed for the first time, setup delimiting keys on + it. Delimiting keys are taken from the parent node. See + setup_delimiting_keys() for details. + */ + if (h->flags & CBK_DKSET) { + setdk = setup_delimiting_keys(h); + h->flags &= ~CBK_DKSET; + } else { + znode *parent; + + parent = h->parent_lh->node; + h->result = zload(parent); + if (unlikely(h->result != 0)) + goto fail_or_restart; + + if (!ZF_ISSET(active, JNODE_DKSET)) + setdk = set_child_delimiting_keys(parent, + h->coord, active); + else { + UNDER_RW_VOID(dk, h->tree, read, + find_child_delimiting_keys(parent, + h->coord, + &ldkey, &key)); + ldkeyset = 1; + } + zrelse(parent); + } + + /* this is ugly kludge. Reminder: this is necessary, because + ->lookup() method returns coord with ->between field probably set + to something different from AT_UNIT. + */ + h->coord->between = AT_UNIT; + + if (znode_just_created(active) && (h->coord->node != NULL)) { + WLOCK_TREE(h->tree); + /* if we are going to load znode right now, setup + ->in_parent: coord where pointer to this node is stored in + parent. + */ + coord_to_parent_coord(h->coord, &active->in_parent); + WUNLOCK_TREE(h->tree); + } + + /* check connectedness without holding tree lock---false negatives + * will be re-checked by connect_znode(), and false positives are + * impossible---@active cannot suddenly turn into unconnected + * state. */ + if (!znode_is_connected(active)) { + h->result = connect_znode(h->coord, active); + if (unlikely(h->result != 0)) { + put_parent(h); + goto fail_or_restart; + } + } + + jload_prefetch(ZJNODE(active)); + + if (setdk) + update_stale_dk(h->tree, active); + + /* put_parent() cannot be called earlier, because connect_znode() + assumes parent node is referenced; */ + put_parent(h); + + if ((!znode_contains_key_lock(active, h->key) && + (h->flags & CBK_TRUST_DK)) || ZF_ISSET(active, JNODE_HEARD_BANSHEE)) { + /* 1. key was moved out of this node while this thread was + waiting for the lock. Restart. More elaborate solution is + to determine where key moved (to the left, or to the right) + and try to follow it through sibling pointers. + + 2. or, node itself is going to be removed from the + tree. Release lock and restart. + */ + h->result = -E_REPEAT; + } + if (h->result == -E_REPEAT) + return LOOKUP_REST; + + h->result = zload_ra(active, h->ra_info); + if (h->result) { + return LOOKUP_DONE; + } + + /* sanity checks */ + if (sanity_check(h)) { + zrelse(active); + return LOOKUP_DONE; + } + + /* check that key of leftmost item in the @active is the same as in + * its parent */ + if (ldkeyset && !node_is_empty(active) && + !keyeq(leftmost_key_in_node(active, &key), &ldkey)) { + warning("vs-3533", "Keys are inconsistent. Fsck?"); + print_key("inparent", &ldkey); + print_key("inchild", &key); + h->result = RETERR(-EIO); + zrelse(active); + return LOOKUP_DONE; + } + + if (h->object != NULL) + handle_vroot(h->object, active); + + ret = cbk_node_lookup(h); + + /* reget @active from handle, because it can change in + cbk_node_lookup() */ + /*active = h->active_lh->node;*/ + zrelse(active); + + return ret; + +fail_or_restart: + if (h->result == -E_DEADLOCK) + return LOOKUP_REST; + return LOOKUP_DONE; +} + +#if REISER4_DEBUG +/* check left and right delimiting keys of a znode */ +void +check_dkeys(znode *node) +{ + znode *left; + znode *right; + + RLOCK_TREE(current_tree); + RLOCK_DK(current_tree); + + assert("vs-1710", znode_is_any_locked(node)); + assert("vs-1197", !keygt(znode_get_ld_key(node), znode_get_rd_key(node))); + + left = node->left; + right = node->right; + + if (ZF_ISSET(node, JNODE_LEFT_CONNECTED) && ZF_ISSET(node, JNODE_DKSET) && + left != NULL && ZF_ISSET(left, JNODE_DKSET)) + /* check left neighbor. Note that left neighbor is not locked, + so it might get wrong delimiting keys therefore */ + assert("vs-1198", (keyeq(znode_get_rd_key(left), znode_get_ld_key(node)) || + ZF_ISSET(left, JNODE_HEARD_BANSHEE))); + + if (ZF_ISSET(node, JNODE_RIGHT_CONNECTED) && ZF_ISSET(node, JNODE_DKSET) && + right != NULL && ZF_ISSET(right, JNODE_DKSET)) + /* check right neighbor. Note that right neighbor is not + locked, so it might get wrong delimiting keys therefore */ + assert("vs-1199", (keyeq(znode_get_rd_key(node), znode_get_ld_key(right)) || + ZF_ISSET(right, JNODE_HEARD_BANSHEE))); + + RUNLOCK_DK(current_tree); + RUNLOCK_TREE(current_tree); +} +#endif + +/* true if @key is left delimiting key of @node */ +static int key_is_ld(znode * node, const reiser4_key * key) +{ + int ld; + + assert("nikita-1716", node != NULL); + assert("nikita-1758", key != NULL); + + RLOCK_DK(znode_get_tree(node)); + assert("nikita-1759", znode_contains_key(node, key)); + ld = keyeq(znode_get_ld_key(node), key); + RUNLOCK_DK(znode_get_tree(node)); + return ld; +} + +/* Process one node during tree traversal. + + This is called by cbk_level_lookup(). */ +static level_lookup_result +cbk_node_lookup(cbk_handle * h /* search handle */ ) +{ + /* node plugin of @active */ + node_plugin *nplug; + /* item plugin of item that was found */ + item_plugin *iplug; + /* search bias */ + lookup_bias node_bias; + /* node we are operating upon */ + znode *active; + /* tree we are searching in */ + reiser4_tree *tree; + /* result */ + int result; + + assert("nikita-379", h != NULL); + + active = h->active_lh->node; + tree = h->tree; + + nplug = active->nplug; + assert("nikita-380", nplug != NULL); + + ON_DEBUG(check_dkeys(active)); + + /* return item from "active" node with maximal key not greater than + "key" */ + node_bias = h->bias; + result = nplug->lookup(active, h->key, node_bias, h->coord); + if (unlikely(result != NS_FOUND && result != NS_NOT_FOUND)) { + /* error occurred */ + h->result = result; + return LOOKUP_DONE; + } + if (h->level == h->stop_level) { + /* welcome to the stop level */ + assert("nikita-381", h->coord->node == active); + if (result == NS_FOUND) { + /* success of tree lookup */ + if (!(h->flags & CBK_UNIQUE) && key_is_ld(active, h->key)) { + return search_to_left(h); + } else + h->result = CBK_COORD_FOUND; + } else { + h->result = CBK_COORD_NOTFOUND; + } + if (!(h->flags & CBK_IN_CACHE)) + cbk_cache_add(active); + return LOOKUP_DONE; + } + + if (h->level > TWIG_LEVEL && result == NS_NOT_FOUND) { + h->error = "not found on internal node"; + h->result = result; + return LOOKUP_DONE; + } + + assert("vs-361", h->level > h->stop_level); + + if (handle_eottl(h, &result)) { + /**/ + assert("vs-1674", result == LOOKUP_DONE || result == LOOKUP_REST); + return result; + } + + assert("nikita-2116", item_is_internal(h->coord)); + iplug = item_plugin_by_coord(h->coord); + + /* go down to next level */ + assert("vs-515", item_is_internal(h->coord)); + iplug->s.internal.down_link(h->coord, h->key, &h->block); + --h->level; + return LOOKUP_CONT; /* continue */ +} + +/* scan cbk_cache slots looking for a match for @h */ +static int +cbk_cache_scan_slots(cbk_handle * h /* cbk handle */ ) +{ + level_lookup_result llr; + znode *node; + reiser4_tree *tree; + cbk_cache_slot *slot; + cbk_cache *cache; + tree_level level; + int isunique; + const reiser4_key *key; + int result; + + assert("nikita-1317", h != NULL); + assert("nikita-1315", h->tree != NULL); + assert("nikita-1316", h->key != NULL); + + tree = h->tree; + cache = &tree->cbk_cache; + if (cache->nr_slots == 0) + /* size of cbk cache was set to 0 by mount time option. */ + return RETERR(-ENOENT); + + assert("nikita-2474", cbk_cache_invariant(cache)); + node = NULL; /* to keep gcc happy */ + level = h->level; + key = h->key; + isunique = h->flags & CBK_UNIQUE; + result = RETERR(-ENOENT); + + /* + * this is time-critical function and dragons had, hence, been settled + * here. + * + * Loop below scans cbk cache slots trying to find matching node with + * suitable range of delimiting keys and located at the h->level. + * + * Scan is done under cbk cache spin lock that protects slot->node + * pointers. If suitable node is found we want to pin it in + * memory. But slot->node can point to the node with x_count 0 + * (unreferenced). Such node can be recycled at any moment, or can + * already be in the process of being recycled (within jput()). + * + * As we found node in the cbk cache, it means that jput() hasn't yet + * called cbk_cache_invalidate(). + * + * We acquire reference to the node without holding tree lock, and + * later, check node's RIP bit. This avoids races with jput(). + * + */ + + rcu_read_lock(); + read_lock_cbk_cache(cache); + slot = cbk_cache_list_prev(cbk_cache_list_front(&cache->lru)); + while (1) { + + slot = cbk_cache_list_next(slot); + + if (!cbk_cache_list_end(&cache->lru, slot)) + node = slot->node; + else + node = NULL; + + if (unlikely(node == NULL)) + break; + + /* + * this is (hopefully) the only place in the code where we are + * working with delimiting keys without holding dk lock. This + * is fine here, because this is only "guess" anyway---keys + * are rechecked under dk lock below. + */ + if (znode_get_level(node) == level && + /* min_key < key < max_key */ + znode_contains_key_strict(node, key, isunique)) { + zref(node); + result = 0; + spin_lock_prefetch(&tree->tree_lock.lock); + break; + } + } + read_unlock_cbk_cache(cache); + + assert("nikita-2475", cbk_cache_invariant(cache)); + + if (unlikely(result == 0 && ZF_ISSET(node, JNODE_RIP))) + result = -ENOENT; + + rcu_read_unlock(); + + if (result != 0) { + h->result = CBK_COORD_NOTFOUND; + return RETERR(-ENOENT); + } + + result = longterm_lock_znode(h->active_lh, node, cbk_lock_mode(level, h), ZNODE_LOCK_LOPRI); + zput(node); + if (result != 0) + return result; + result = zload(node); + if (result != 0) + return result; + + /* recheck keys */ + result = + UNDER_RW(dk, tree, read, + znode_contains_key_strict(node, key, isunique)) && + !ZF_ISSET(node, JNODE_HEARD_BANSHEE); + + if (result) { + /* do lookup inside node */ + llr = cbk_node_lookup(h); + /* if cbk_node_lookup() wandered to another node (due to eottl + or non-unique keys), adjust @node */ + /*node = h->active_lh->node;*/ + + if (llr != LOOKUP_DONE) { + /* restart or continue on the next level */ + result = RETERR(-ENOENT); + } else if (IS_CBKERR(h->result)) + /* io or oom */ + result = RETERR(-ENOENT); + else { + /* good. Either item found or definitely not found. */ + result = 0; + + write_lock_cbk_cache(cache); + if (slot->node == h->active_lh->node/*node*/) { + /* if this node is still in cbk cache---move + its slot to the head of the LRU list. */ + cbk_cache_list_remove(slot); + cbk_cache_list_push_front(&cache->lru, slot); + } + write_unlock_cbk_cache(cache); + } + } else { + /* race. While this thread was waiting for the lock, node was + rebalanced and item we are looking for, shifted out of it + (if it ever was here). + + Continuing scanning is almost hopeless: node key range was + moved to, is almost certainly at the beginning of the LRU + list at this time, because it's hot, but restarting + scanning from the very beginning is complex. Just return, + so that cbk() will be performed. This is not that + important, because such races should be rare. Are they? + */ + result = RETERR(-ENOENT); /* -ERAUGHT */ + } + zrelse(node); + assert("nikita-2476", cbk_cache_invariant(cache)); + return result; +} + +/* look for item with given key in the coord cache + + This function, called by coord_by_key(), scans "coord cache" (&cbk_cache) + which is a small LRU list of znodes accessed lately. For each znode in + znode in this list, it checks whether key we are looking for fits into key + range covered by this node. If so, and in addition, node lies at allowed + level (this is to handle extents on a twig level), node is locked, and + lookup inside it is performed. + + we need a measurement of the cost of this cache search compared to the cost + of coord_by_key. + +*/ +static int +cbk_cache_search(cbk_handle * h /* cbk handle */ ) +{ + int result = 0; + tree_level level; + + /* add CBK_IN_CACHE to the handle flags. This means that + * cbk_node_lookup() assumes that cbk_cache is scanned and would add + * found node to the cache. */ + h->flags |= CBK_IN_CACHE; + for (level = h->stop_level; level <= h->lock_level; ++level) { + h->level = level; + result = cbk_cache_scan_slots(h); + if (result != 0) { + done_lh(h->active_lh); + done_lh(h->parent_lh); + } else { + assert("nikita-1319", !IS_CBKERR(h->result)); + break; + } + } + h->flags &= ~CBK_IN_CACHE; + return result; +} + +/* type of lock we want to obtain during tree traversal. On stop level + we want type of lock user asked for, on upper levels: read lock. */ +reiser4_internal znode_lock_mode cbk_lock_mode(tree_level level, cbk_handle * h) +{ + assert("nikita-382", h != NULL); + + return (level <= h->lock_level) ? h->lock_mode : ZNODE_READ_LOCK; +} + +/* update outdated delimiting keys */ +static void stale_dk(reiser4_tree *tree, znode *node) +{ + znode *right; + + RLOCK_TREE(tree); + WLOCK_DK(tree); + right = node->right; + + if (ZF_ISSET(node, JNODE_RIGHT_CONNECTED) && + right && ZF_ISSET(right, JNODE_DKSET) && + !keyeq(znode_get_rd_key(node), znode_get_ld_key(right))) + znode_set_rd_key(node, znode_get_ld_key(right)); + + WUNLOCK_DK(tree); + RUNLOCK_TREE(tree); +} + +/* check for possibly outdated delimiting keys, and update them if + * necessary. */ +static void update_stale_dk(reiser4_tree *tree, znode *node) +{ + znode *right; + reiser4_key rd; + + RLOCK_TREE(tree); + RLOCK_DK(tree); + rd = *znode_get_rd_key(node); + right = node->right; + if (unlikely(ZF_ISSET(node, JNODE_RIGHT_CONNECTED) && + right && ZF_ISSET(right, JNODE_DKSET) && + !keyeq(&rd, znode_get_ld_key(right)))) { + /* does this ever happen? */ + warning("nikita-38210", "stale dk"); + assert("nikita-38211", ZF_ISSET(node, JNODE_DKSET)); + RUNLOCK_DK(tree); + RUNLOCK_TREE(tree); + stale_dk(tree, node); + return; + } + RUNLOCK_DK(tree); + RUNLOCK_TREE(tree); +} + +/* + * handle searches a the non-unique key. + * + * Suppose that we are looking for an item with possibly non-unique key 100. + * + * Root node contains two pointers: one to a node with left delimiting key 0, + * and another to a node with left delimiting key 100. Item we interested in + * may well happen in the sub-tree rooted at the first pointer. + * + * To handle this search_to_left() is called when search reaches stop + * level. This function checks it is _possible_ that item we are looking for + * is in the left neighbor (this can be done by comparing delimiting keys) and + * if so, tries to lock left neighbor (this is low priority lock, so it can + * deadlock, tree traversal is just restarted if it did) and then checks + * whether left neighbor actually contains items with our key. + * + * Note that this is done on the stop level only. It is possible to try such + * left-check on each level, but as duplicate keys are supposed to be rare + * (very unlikely that more than one node is completely filled with items with + * duplicate keys), it sis cheaper to scan to the left on the stop level once. + * + */ +static level_lookup_result +search_to_left(cbk_handle * h /* search handle */ ) +{ + level_lookup_result result; + coord_t *coord; + znode *node; + znode *neighbor; + + lock_handle lh; + + assert("nikita-1761", h != NULL); + assert("nikita-1762", h->level == h->stop_level); + + init_lh(&lh); + coord = h->coord; + node = h->active_lh->node; + assert("nikita-1763", coord_is_leftmost_unit(coord)); + + h->result = reiser4_get_left_neighbor( + &lh, node, (int) h->lock_mode, GN_CAN_USE_UPPER_LEVELS); + neighbor = NULL; + switch (h->result) { + case -E_DEADLOCK: + result = LOOKUP_REST; + break; + case 0:{ + node_plugin *nplug; + coord_t crd; + lookup_bias bias; + + neighbor = lh.node; + h->result = zload(neighbor); + if (h->result != 0) { + result = LOOKUP_DONE; + break; + } + + nplug = neighbor->nplug; + + coord_init_zero(&crd); + bias = h->bias; + h->bias = FIND_EXACT; + h->result = nplug->lookup(neighbor, h->key, h->bias, &crd); + h->bias = bias; + + if (h->result == NS_NOT_FOUND) { + case -E_NO_NEIGHBOR: + h->result = CBK_COORD_FOUND; + if (!(h->flags & CBK_IN_CACHE)) + cbk_cache_add(node); + default: /* some other error */ + result = LOOKUP_DONE; + } else if (h->result == NS_FOUND) { + RLOCK_DK(znode_get_tree(neighbor)); + h->rd_key = *znode_get_ld_key(node); + leftmost_key_in_node(neighbor, &h->ld_key); + RUNLOCK_DK(znode_get_tree(neighbor)); + h->flags |= CBK_DKSET; + + h->block = *znode_get_block(neighbor); + /* clear coord -> node so that cbk_level_lookup() + wouldn't overwrite parent hint in neighbor. + + Parent hint was set up by + reiser4_get_left_neighbor() + */ + UNDER_RW_VOID(tree, znode_get_tree(neighbor), write, + h->coord->node = NULL); + result = LOOKUP_CONT; + } else { + result = LOOKUP_DONE; + } + if (neighbor != NULL) + zrelse(neighbor); + } + } + done_lh(&lh); + return result; +} + +/* debugging aid: return symbolic name of search bias */ +static const char * +bias_name(lookup_bias bias /* bias to get name of */ ) +{ + if (bias == FIND_EXACT) + return "exact"; + else if (bias == FIND_MAX_NOT_MORE_THAN) + return "left-slant"; +/* else if( bias == RIGHT_SLANT_BIAS ) */ +/* return "right-bias"; */ + else { + static char buf[30]; + + sprintf(buf, "unknown: %i", bias); + return buf; + } +} + +#if REISER4_DEBUG +/* debugging aid: print human readable information about @p */ +reiser4_internal void +print_coord_content(const char *prefix /* prefix to print */ , + coord_t * p /* coord to print */ ) +{ + reiser4_key key; + + if (p == NULL) { + printk("%s: null\n", prefix); + return; + } + if ((p->node != NULL) && znode_is_loaded(p->node) && coord_is_existing_item(p)) + printk("%s: data: %p, length: %i\n", prefix, item_body_by_coord(p), item_length_by_coord(p)); + print_znode(prefix, p->node); + if (znode_is_loaded(p->node)) { + item_key_by_coord(p, &key); + print_key(prefix, &key); + } +} + +/* debugging aid: print human readable information about @block */ +reiser4_internal void +print_address(const char *prefix /* prefix to print */ , + const reiser4_block_nr * block /* block number to print */ ) +{ + printk("%s: %s\n", prefix, sprint_address(block)); +} +#endif + +/* return string containing human readable representation of @block */ +reiser4_internal char * +sprint_address(const reiser4_block_nr * block /* block number to print */ ) +{ + static char address[30]; + + if (block == NULL) + sprintf(address, "null"); + else if (blocknr_is_fake(block)) + sprintf(address, "%llx", (unsigned long long)(*block)); + else + sprintf(address, "%llu", (unsigned long long)(*block)); + return address; +} + +/* release parent node during traversal */ +static void +put_parent(cbk_handle * h /* search handle */ ) +{ + assert("nikita-383", h != NULL); + if (h->parent_lh->node != NULL) { + longterm_unlock_znode(h->parent_lh); + } +} + +/* helper function used by coord_by_key(): release reference to parent znode + stored in handle before processing its child. */ +static void +hput(cbk_handle * h /* search handle */ ) +{ + assert("nikita-385", h != NULL); + done_lh(h->parent_lh); + done_lh(h->active_lh); +} + +/* Helper function used by cbk(): update delimiting keys of child node (stored + in h->active_lh->node) using key taken from parent on the parent level. */ +static int +setup_delimiting_keys(cbk_handle * h /* search handle */) +{ + znode *active; + reiser4_tree *tree; + + assert("nikita-1088", h != NULL); + + active = h->active_lh->node; + + /* fast check without taking dk lock. This is safe, because + * JNODE_DKSET is never cleared once set. */ + if (!ZF_ISSET(active, JNODE_DKSET)) { + tree = znode_get_tree(active); + WLOCK_DK(tree); + if (!ZF_ISSET(active, JNODE_DKSET)) { + znode_set_ld_key(active, &h->ld_key); + znode_set_rd_key(active, &h->rd_key); + ZF_SET(active, JNODE_DKSET); + } + WUNLOCK_DK(tree); + return 1; + } + return 0; +} + +/* true if @block makes sense for the @tree. Used to detect corrupted node + * pointers */ +static int +block_nr_is_correct(reiser4_block_nr * block /* block number to check */ , + reiser4_tree * tree /* tree to check against */ ) +{ + assert("nikita-757", block != NULL); + assert("nikita-758", tree != NULL); + + /* check to see if it exceeds the size of the device. */ + return reiser4_blocknr_is_sane_for(tree->super, block); +} + +/* check consistency of fields */ +static int +sanity_check(cbk_handle * h /* search handle */ ) +{ + assert("nikita-384", h != NULL); + + if (h->level < h->stop_level) { + h->error = "Buried under leaves"; + h->result = RETERR(-EIO); + return LOOKUP_DONE; + } else if (!block_nr_is_correct(&h->block, h->tree)) { + h->error = "bad block number"; + h->result = RETERR(-EIO); + return LOOKUP_DONE; + } else + return 0; +} + + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/spin_macros.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/spin_macros.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,477 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Wrapper functions/macros for spin locks. */ + +/* + * This file implements wrapper functions and macros to work with spin locks + * and read write locks embedded into kernel objects. Wrapper functions + * provide following functionality: + * + * (1) encapsulation of locks: in stead of writing spin_lock(&obj->lock), + * where obj is object of type foo, one writes spin_lock_foo(obj). + * + * (2) optional keeping (in per-thread reiser4_context->locks) information + * about number of locks of particular type currently held by thread. This + * is done if REISER4_DEBUG is on. + * + * (3) optional checking of lock ordering. For object type foo, it is + * possible to provide "lock ordering predicate" (possibly using + * information stored in reiser4_context->locks) checking that locks are + * acquired in the proper order. This is done if REISER4_DEBUG is on. + * + * (4) optional collection of spin lock contention statistics. In this mode + * two sysfs objects (located in /sys/profregion) are associated with each + * spin lock type. One object (foo_t) shows how much time was spent trying + * to acquire spin locks of foo type. Another (foo_h) shows how much time + * spin locks of the type foo were held locked. See spinprof.h for more + * details on this. + * + */ + +#ifndef __SPIN_MACROS_H__ +#define __SPIN_MACROS_H__ + +#include +#include + +#include "debug.h" + +/* Checks that read write lock @s is locked (or not) by the -current- + * thread. not yet implemented */ +#define check_is_write_locked(s) ((void)(s), 1) +#define check_is_read_locked(s) ((void)(s), 1) +#define check_is_not_read_locked(s) ((void)(s), 1) +#define check_is_not_write_locked(s) ((void)(s), 1) + +/* Checks that spin lock @s is locked (or not) by the -current- thread. */ +#define check_spin_is_not_locked(s) ((void)(s), 1) +#define spin_is_not_locked(s) ((void)(s), 1) +#if defined(CONFIG_SMP) +# define check_spin_is_locked(s) spin_is_locked(s) +#else +# define check_spin_is_locked(s) ((void)(s), 1) +#endif + +/* + * Data structure embedded into kernel objects together with spin lock. + */ +typedef struct reiser4_spin_data { + /* spin lock proper */ + spinlock_t lock; +} reiser4_spin_data; + +/* + * Data structure embedded into kernel objects together with read write lock. + */ +typedef struct reiser4_rw_data { + /* read write lock proper */ + rwlock_t lock; +} reiser4_rw_data; + +#if REISER4_DEBUG +#define __ODCA(l, e) ON_DEBUG_CONTEXT(assert(l, e)) +#else +#define __ODCA(l, e) noop +#endif + +/* Define several inline functions for each type of spinlock. This is long + * monster macro definition. */ +#define SPIN_LOCK_FUNCTIONS(NAME,TYPE,FIELD) \ + \ +/* Initialize spin lock embedded in @x */ \ +static inline void spin_ ## NAME ## _init(TYPE *x) \ +{ \ + __ODCA("nikita-2987", x != NULL); \ + memset(& x->FIELD, 0, sizeof x->FIELD); \ + spin_lock_init(& x->FIELD.lock); \ +} \ + \ +/* Increment per-thread lock counter for this lock type and total counter */ \ +/* of acquired spin locks. This is helper function used by spin lock */ \ +/* acquiring functions below */ \ +static inline void spin_ ## NAME ## _inc(void) \ +{ \ + LOCK_CNT_INC(spin_locked_ ## NAME); \ + LOCK_CNT_INC(spin_locked); \ +} \ + \ +/* Decrement per-thread lock counter and total counter of acquired spin */ \ +/* locks. This is helper function used by spin lock releasing functions */ \ +/* below. */ \ +static inline void spin_ ## NAME ## _dec(void) \ +{ \ + LOCK_CNT_DEC(spin_locked_ ## NAME); \ + LOCK_CNT_DEC(spin_locked); \ +} \ + \ +/* Return true of spin lock embedded in @x is acquired by -current- */ \ +/* thread */ \ +static inline int spin_ ## NAME ## _is_locked (const TYPE *x) \ +{ \ + return check_spin_is_locked (& x->FIELD.lock) && \ + LOCK_CNT_GTZ(spin_locked_ ## NAME); \ +} \ + \ +/* Return true of spin lock embedded in @x is not acquired by -current- */ \ +/* thread */ \ +static inline int spin_ ## NAME ## _is_not_locked (TYPE *x) \ +{ \ + return check_spin_is_not_locked (& x->FIELD.lock); \ +} \ + \ +/* Acquire spin lock embedded in @x without checking lock ordering. */ \ +/* This is useful when, for example, locking just created object. */ \ +static inline void spin_lock_ ## NAME ## _no_ord (TYPE *x) \ +{ \ + __ODCA("nikita-2703", spin_ ## NAME ## _is_not_locked(x)); \ + spin_lock(&x->FIELD.lock); \ + spin_ ## NAME ## _inc(); \ +} \ + \ +/* Account for spin lock acquired by some other means. For example */ \ +/* through atomic_dec_and_lock() or similar. */ \ +static inline void spin_lock_ ## NAME ## _acc (TYPE *x) \ +{ \ + spin_ ## NAME ## _inc(); \ +} \ + \ +/* Lock @x with explicit indication of spin lock profiling "sites". */ \ +/* Locksite is used by spin lock profiling code (spinprof.[ch]) to */ \ +/* identify fragment of code that locks @x. */ \ +/* */ \ +/* If clock interrupt finds that current thread is spinning waiting for */ \ +/* the lock on @x, counters in @t will be incremented. */ \ +/* */ \ +/* If clock interrupt finds that current thread holds the lock on @x, */ \ +/* counters in @h will be incremented. */ \ +/* */ \ +static inline void spin_lock_ ## NAME ## _at (TYPE *x) \ +{ \ + __ODCA("nikita-1383", spin_ordering_pred_ ## NAME(x)); \ + spin_lock_ ## NAME ## _no_ord(x); \ +} \ + \ +/* Lock @x. */ \ +static inline void spin_lock_ ## NAME (TYPE *x) \ +{ \ + __ODCA("nikita-1383", spin_ordering_pred_ ## NAME(x)); \ + spin_lock_ ## NAME ## _no_ord(x); \ +} \ + \ +/* Try to obtain lock @x. On success, returns 1 with @x locked. */ \ +/* If @x is already locked, return 0 immediately. */ \ +static inline int spin_trylock_ ## NAME (TYPE *x) \ +{ \ + if (spin_trylock (& x->FIELD.lock)) { \ + spin_ ## NAME ## _inc(); \ + return 1; \ + } \ + return 0; \ +} \ + \ +/* Unlock @x. */ \ +static inline void spin_unlock_ ## NAME (TYPE *x) \ +{ \ + __ODCA("nikita-1375", LOCK_CNT_GTZ(spin_locked_ ## NAME)); \ + __ODCA("nikita-1376", LOCK_CNT_GTZ(spin_locked > 0)); \ + __ODCA("nikita-2703", spin_ ## NAME ## _is_locked(x)); \ + \ + spin_ ## NAME ## _dec(); \ + spin_unlock (& x->FIELD.lock); \ +} \ + \ +typedef struct { int foo; } NAME ## _spin_dummy + +/* + * Helper macro to perform a simple operation that requires taking of spin + * lock. + * + * 1. Acquire spin lock on object @obj of type @obj_type. + * + * 2. Execute @exp under spin lock, and store result. + * + * 3. Release spin lock. + * + * 4. Return result of @exp. + * + * Example: + * + * right_delimiting_key = UNDER_SPIN(dk, current_tree, *znode_get_rd_key(node)); + * + */ +#define UNDER_SPIN(obj_type, obj, exp) \ +({ \ + typeof (obj) __obj; \ + typeof (exp) __result; \ + \ + __obj = (obj); \ + __ODCA("nikita-2492", __obj != NULL); \ + spin_lock_ ## obj_type ## _at (__obj); \ + __result = exp; \ + spin_unlock_ ## obj_type (__obj); \ + __result; \ +}) + +/* + * The same as UNDER_SPIN, but without storing and returning @exp's result. + */ +#define UNDER_SPIN_VOID(obj_type, obj, exp) \ +({ \ + typeof (obj) __obj; \ + \ + __obj = (obj); \ + __ODCA("nikita-2492", __obj != NULL); \ + spin_lock_ ## obj_type ## _at (__obj); \ + exp; \ + spin_unlock_ ## obj_type (__obj); \ +}) + + +/* Define several inline functions for each type of read write lock. This is + * insanely long macro definition. */ +#define RW_LOCK_FUNCTIONS(NAME,TYPE,FIELD) \ + \ + \ +/* Initialize read write lock embedded into @x. */ \ +static inline void rw_ ## NAME ## _init(TYPE *x) \ +{ \ + __ODCA("nikita-2988", x != NULL); \ + memset(& x->FIELD, 0, sizeof x->FIELD); \ + rwlock_init(& x->FIELD.lock); \ +} \ + \ +/* True, if @x is read locked by the -current- thread. */ \ +static inline int rw_ ## NAME ## _is_read_locked (const TYPE *x) \ +{ \ + return check_is_read_locked (& x->FIELD.lock); \ +} \ + \ +/* True, if @x is write locked by the -current- thread. */ \ +static inline int rw_ ## NAME ## _is_write_locked (const TYPE *x) \ +{ \ + return check_is_write_locked (& x->FIELD.lock); \ +} \ + \ +/* True, if @x is not read locked by the -current- thread. */ \ +static inline int rw_ ## NAME ## _is_not_read_locked (TYPE *x) \ +{ \ + return check_is_not_read_locked (& x->FIELD.lock); \ +} \ + \ +/* True, if @x is not write locked by the -current- thread. */ \ +static inline int rw_ ## NAME ## _is_not_write_locked (TYPE *x) \ +{ \ + return check_is_not_write_locked (& x->FIELD.lock); \ +} \ + \ +/* True, if @x is either read or write locked by the -current- thread. */ \ +static inline int rw_ ## NAME ## _is_locked (const TYPE *x) \ +{ \ + return check_is_read_locked (& x->FIELD.lock) || \ + check_is_write_locked (& x->FIELD.lock); \ +} \ + \ +/* True, if @x is neither read nor write locked by the -current- thread. */ \ +static inline int rw_ ## NAME ## _is_not_locked (const TYPE *x) \ +{ \ + return check_is_not_read_locked (& x->FIELD.lock) && \ + check_is_not_write_locked (& x->FIELD.lock); \ +} \ + \ +/* This is helper function used by lock acquiring functions below */ \ +static inline void read_ ## NAME ## _inc(void) \ +{ \ + LOCK_CNT_INC(read_locked_ ## NAME); \ + LOCK_CNT_INC(rw_locked_ ## NAME); \ + LOCK_CNT_INC(spin_locked); \ +} \ + \ +/* This is helper function used by lock acquiring functions below */ \ +static inline void read_ ## NAME ## _dec(void) \ +{ \ + LOCK_CNT_DEC(read_locked_ ## NAME); \ + LOCK_CNT_DEC(rw_locked_ ## NAME); \ + LOCK_CNT_DEC(spin_locked); \ +} \ + \ +/* This is helper function used by lock acquiring functions below */ \ +static inline void write_ ## NAME ## _inc(void) \ +{ \ + LOCK_CNT_INC(write_locked_ ## NAME); \ + LOCK_CNT_INC(rw_locked_ ## NAME); \ + LOCK_CNT_INC(spin_locked); \ +} \ + \ +/* This is helper function used by lock acquiring functions below */ \ +static inline void write_ ## NAME ## _dec(void) \ +{ \ + LOCK_CNT_DEC(write_locked_ ## NAME); \ + LOCK_CNT_DEC(rw_locked_ ## NAME); \ + LOCK_CNT_DEC(spin_locked); \ +} \ + \ +/* Acquire read lock on @x without checking lock ordering predicates. */ \ +/* This is useful when, for example, locking just created object. */ \ +static inline void read_lock_ ## NAME ## _no_ord (TYPE *x) \ +{ \ + __ODCA("nikita-2976", rw_ ## NAME ## _is_not_read_locked(x)); \ + read_lock(&x->FIELD.lock); \ + read_ ## NAME ## _inc(); \ +} \ + \ +/* Acquire write lock on @x without checking lock ordering predicates. */ \ +/* This is useful when, for example, locking just created object. */ \ +static inline void write_lock_ ## NAME ## _no_ord (TYPE *x) \ +{ \ + __ODCA("nikita-2977", rw_ ## NAME ## _is_not_write_locked(x)); \ + write_lock(&x->FIELD.lock); \ + write_ ## NAME ## _inc(); \ +} \ + \ +/* Read lock @x with explicit indication of spin lock profiling "sites". */ \ +/* See spin_lock_foo_at() above for more information. */ \ +static inline void read_lock_ ## NAME ## _at (TYPE *x) \ +{ \ + __ODCA("nikita-2975", rw_ordering_pred_ ## NAME(x)); \ + read_lock_ ## NAME ## _no_ord(x); \ +} \ + \ +/* Write lock @x with explicit indication of spin lock profiling "sites". */ \ +/* See spin_lock_foo_at() above for more information. */ \ +static inline void write_lock_ ## NAME ## _at (TYPE *x) \ +{ \ + __ODCA("nikita-2978", rw_ordering_pred_ ## NAME(x)); \ + write_lock_ ## NAME ## _no_ord(x); \ +} \ + \ +/* Read lock @x. */ \ +static inline void read_lock_ ## NAME (TYPE *x) \ +{ \ + __ODCA("nikita-2975", rw_ordering_pred_ ## NAME(x)); \ + read_lock_ ## NAME ## _no_ord(x); \ +} \ + \ +/* Write lock @x. */ \ +static inline void write_lock_ ## NAME (TYPE *x) \ +{ \ + __ODCA("nikita-2978", rw_ordering_pred_ ## NAME(x)); \ + write_lock_ ## NAME ## _no_ord(x); \ +} \ + \ +/* Release read lock on @x. */ \ +static inline void read_unlock_ ## NAME (TYPE *x) \ +{ \ + __ODCA("nikita-2979", LOCK_CNT_GTZ(read_locked_ ## NAME)); \ + __ODCA("nikita-2980", LOCK_CNT_GTZ(rw_locked_ ## NAME)); \ + __ODCA("nikita-2980", LOCK_CNT_GTZ(spin_locked)); \ + read_ ## NAME ## _dec(); \ + __ODCA("nikita-2703", rw_ ## NAME ## _is_read_locked(x)); \ + read_unlock (& x->FIELD.lock); \ +} \ + \ +/* Release write lock on @x. */ \ +static inline void write_unlock_ ## NAME (TYPE *x) \ +{ \ + __ODCA("nikita-2979", LOCK_CNT_GTZ(write_locked_ ## NAME)); \ + __ODCA("nikita-2980", LOCK_CNT_GTZ(rw_locked_ ## NAME)); \ + __ODCA("nikita-2980", LOCK_CNT_GTZ(spin_locked)); \ + write_ ## NAME ## _dec(); \ + __ODCA("nikita-2703", rw_ ## NAME ## _is_write_locked(x)); \ + write_unlock (& x->FIELD.lock); \ +} \ + \ +/* Try to obtain write lock on @x. On success, returns 1 with @x locked. */ \ +/* If @x is already locked, return 0 immediately. */ \ +static inline int write_trylock_ ## NAME (TYPE *x) \ +{ \ + if (write_trylock (& x->FIELD.lock)) { \ + write_ ## NAME ## _inc(); \ + return 1; \ + } \ + return 0; \ +} \ + \ + \ +typedef struct { int foo; } NAME ## _rw_dummy + +/* + * Helper macro to perform a simple operation that requires taking of read + * write lock. + * + * 1. Acquire read or write (depending on @rw parameter) lock on object @obj + * of type @obj_type. + * + * 2. Execute @exp under lock, and store result. + * + * 3. Release lock. + * + * 4. Return result of @exp. + * + * Example: + * + * tree_height = UNDER_RW(tree, current_tree, read, current_tree->height); + */ +#define UNDER_RW(obj_type, obj, rw, exp) \ +({ \ + typeof (obj) __obj; \ + typeof (exp) __result; \ + \ + __obj = (obj); \ + __ODCA("nikita-2981", __obj != NULL); \ + rw ## _lock_ ## obj_type ## _at (__obj); \ + __result = exp; \ + rw ## _unlock_ ## obj_type (__obj); \ + __result; \ +}) + +/* + * The same as UNDER_RW, but without storing and returning @exp's result. + */ +#define UNDER_RW_VOID(obj_type, obj, rw, exp) \ +({ \ + typeof (obj) __obj; \ + \ + __obj = (obj); \ + __ODCA("nikita-2982", __obj != NULL); \ + rw ## _lock_ ## obj_type ## _at (__obj); \ + exp; \ + rw ## _unlock_ ## obj_type (__obj); \ +}) + +#define LOCK_JNODE(node) spin_lock_jnode(node) +#define LOCK_JLOAD(node) spin_lock_jload(node) +#define LOCK_ATOM(atom) spin_lock_atom(atom) +#define LOCK_TXNH(txnh) spin_lock_txnh(txnh) +#define LOCK_INODE(inode) spin_lock_inode_object(inode) +#define RLOCK_TREE(tree) read_lock_tree(tree) +#define WLOCK_TREE(tree) write_lock_tree(tree) +#define RLOCK_DK(tree) read_lock_dk(tree) +#define WLOCK_DK(tree) write_lock_dk(tree) +#define RLOCK_ZLOCK(lock) read_lock_zlock(lock) +#define WLOCK_ZLOCK(lock) write_lock_zlock(lock) + +#define UNLOCK_JNODE(node) spin_unlock_jnode(node) +#define UNLOCK_JLOAD(node) spin_unlock_jload(node) +#define UNLOCK_ATOM(atom) spin_unlock_atom(atom) +#define UNLOCK_TXNH(txnh) spin_unlock_txnh(txnh) +#define UNLOCK_INODE(inode) spin_unlock_inode_object(inode) +#define RUNLOCK_TREE(tree) read_unlock_tree(tree) +#define WUNLOCK_TREE(tree) write_unlock_tree(tree) +#define RUNLOCK_DK(tree) read_unlock_dk(tree) +#define WUNLOCK_DK(tree) write_unlock_dk(tree) +#define RUNLOCK_ZLOCK(lock) read_unlock_zlock(lock) +#define WUNLOCK_ZLOCK(lock) write_unlock_zlock(lock) + +/* __SPIN_MACROS_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/status_flags.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/status_flags.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,195 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Functions that deal with reiser4 status block, query status and update it, if needed */ + +#include +#include +#include +#include +#include +#include "debug.h" +#include "dformat.h" +#include "status_flags.h" +#include "super.h" + +/* This is our end I/O handler that marks page uptodate if IO was successful. It also + unconditionally unlocks the page, so we can see that io was done. + We do not free bio, because we hope to reuse that. */ +static int reiser4_status_endio(struct bio *bio, unsigned int bytes_done, int err) +{ + if (bio->bi_size) + return 1; + + if (test_bit(BIO_UPTODATE, &bio->bi_flags)) { + SetPageUptodate(bio->bi_io_vec->bv_page); + } else { + ClearPageUptodate(bio->bi_io_vec->bv_page); + SetPageError(bio->bi_io_vec->bv_page); + } + unlock_page(bio->bi_io_vec->bv_page); +// bio_put(bio); + return 0; +} + +/* Initialise status code. This is expected to be called from the disk format + code. block paremeter is where status block lives. */ +reiser4_internal int reiser4_status_init(reiser4_block_nr block) +{ + struct super_block *sb = reiser4_get_current_sb(); + struct reiser4_status *statuspage; + struct bio *bio; + struct page *page; + + get_super_private(sb)->status_page = NULL; + get_super_private(sb)->status_bio = NULL; + + page = alloc_pages(GFP_KERNEL, 0); + if (!page) + return -ENOMEM; + + bio = bio_alloc(GFP_KERNEL, 1); + if (bio != NULL) { + bio->bi_sector = block * (sb->s_blocksize >> 9); + bio->bi_bdev = sb->s_bdev; + bio->bi_io_vec[0].bv_page = page; + bio->bi_io_vec[0].bv_len = sb->s_blocksize; + bio->bi_io_vec[0].bv_offset = 0; + bio->bi_vcnt = 1; + bio->bi_size = sb->s_blocksize; + bio->bi_end_io = reiser4_status_endio; + } else { + __free_pages(page, 0); + return -ENOMEM; + } + lock_page(page); + submit_bio(READ, bio); + blk_run_address_space(get_super_fake(sb)->i_mapping); + /*blk_run_queues();*/ + wait_on_page_locked(page); + if ( !PageUptodate(page) ) { + warning("green-2007", "I/O error while tried to read status page\n"); + return -EIO; + } + + statuspage = (struct reiser4_status *)kmap_atomic(page, KM_USER0); + if ( memcmp( statuspage->magic, REISER4_STATUS_MAGIC, sizeof(REISER4_STATUS_MAGIC)) ) { + /* Magic does not match. */ + kunmap_atomic((char *)statuspage, KM_USER0); + warning("green-2008", "Wrong magic in status block\n"); + __free_pages(page, 0); + bio_put(bio); + return -EINVAL; + } + kunmap_atomic((char *)statuspage, KM_USER0); + + get_super_private(sb)->status_page = page; + get_super_private(sb)->status_bio = bio; + return 0; +} + +/* Query the status of fs. Returns if the FS can be safely mounted. + Also if "status" and "extended" parameters are given, it will fill + actual parts of status from disk there. */ +reiser4_internal int reiser4_status_query(u64 *status, u64 *extended) +{ + struct super_block *sb = reiser4_get_current_sb(); + struct reiser4_status *statuspage; + int retval; + + if ( !get_super_private(sb)->status_page ) { // No status page? + return REISER4_STATUS_MOUNT_UNKNOWN; + } + statuspage = (struct reiser4_status *) + kmap_atomic(get_super_private(sb)->status_page, KM_USER0); + switch ( (long)d64tocpu(&statuspage->status) ) { // FIXME: this cast is a hack for 32 bit arches to work. + case REISER4_STATUS_OK: + retval = REISER4_STATUS_MOUNT_OK; + break; + case REISER4_STATUS_CORRUPTED: + retval = REISER4_STATUS_MOUNT_WARN; + break; + case REISER4_STATUS_DAMAGED: + case REISER4_STATUS_DESTROYED: + case REISER4_STATUS_IOERROR: + retval = REISER4_STATUS_MOUNT_RO; + break; + default: + retval = REISER4_STATUS_MOUNT_UNKNOWN; + break; + } + + if ( status ) + *status = d64tocpu(&statuspage->status); + if ( extended ) + *extended = d64tocpu(&statuspage->extended_status); + + kunmap_atomic((char *)statuspage, KM_USER0); + return retval; +} + +/* This function should be called when something bad happens (e.g. from reiser4_panic). + It fills the status structure and tries to push it to disk. */ +reiser4_internal int +reiser4_status_write(u64 status, u64 extended_status, char *message) +{ + struct super_block *sb = reiser4_get_current_sb(); + struct reiser4_status *statuspage; + struct bio *bio = get_super_private(sb)->status_bio; + + if ( !get_super_private(sb)->status_page ) { // No status page? + return -1; + } + statuspage = (struct reiser4_status *) + kmap_atomic(get_super_private(sb)->status_page, KM_USER0); + + cputod64(status, &statuspage->status); + cputod64(extended_status, &statuspage->extended_status); + strncpy(statuspage->texterror, message, REISER4_TEXTERROR_LEN); + +#ifdef CONFIG_FRAME_POINTER +#define GETFRAME(no) \ + cputod64((unsigned long)__builtin_return_address(no), \ + &statuspage->stacktrace[no]) + + GETFRAME(0); + GETFRAME(1); + GETFRAME(2); + GETFRAME(3); + GETFRAME(4); + GETFRAME(5); + GETFRAME(6); + GETFRAME(7); + GETFRAME(8); + GETFRAME(9); + +#undef GETFRAME +#endif + kunmap_atomic((char *)statuspage, KM_USER0); + bio->bi_bdev = sb->s_bdev; + bio->bi_io_vec[0].bv_page = get_super_private(sb)->status_page; + bio->bi_io_vec[0].bv_len = sb->s_blocksize; + bio->bi_io_vec[0].bv_offset = 0; + bio->bi_vcnt = 1; + bio->bi_size = sb->s_blocksize; + bio->bi_end_io = reiser4_status_endio; + lock_page(get_super_private(sb)->status_page); // Safe as nobody should touch our page. + /* We can block now, but we have no other choice anyway */ + submit_bio(WRITE, bio); + blk_run_address_space(get_super_fake(sb)->i_mapping); + /*blk_run_queues();*/ // Now start the i/o. + return 0; // We do not wait for io to finish. +} + +/* Frees the page with status and bio structure. Should be called by disk format at umount time */ +reiser4_internal int reiser4_status_finish(void) +{ + struct super_block *sb = reiser4_get_current_sb(); + + __free_pages(get_super_private(sb)->status_page, 0); + get_super_private(sb)->status_page = NULL; + bio_put(get_super_private(sb)->status_bio); + get_super_private(sb)->status_bio = NULL; + return 0; +} + diff -puN /dev/null fs/reiser4/status_flags.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/status_flags.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,43 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Here we declare structures and flags that store reiser4 status on disk. + The status that helps us to find out if the filesystem is valid or if it + contains some critical, or not so critical errors */ + +#if !defined( __REISER4_STATUS_FLAGS_H__ ) +#define __REISER4_STATUS_FLAGS_H__ + +#include "dformat.h" +/* These are major status flags */ +#define REISER4_STATUS_OK 0 +#define REISER4_STATUS_CORRUPTED 0x1 +#define REISER4_STATUS_DAMAGED 0x2 +#define REISER4_STATUS_DESTROYED 0x4 +#define REISER4_STATUS_IOERROR 0x8 + +/* Return values for reiser4_status_query() */ +#define REISER4_STATUS_MOUNT_OK 0 +#define REISER4_STATUS_MOUNT_WARN 1 +#define REISER4_STATUS_MOUNT_RO 2 +#define REISER4_STATUS_MOUNT_UNKNOWN -1 + +#define REISER4_TEXTERROR_LEN 256 + +#define REISER4_STATUS_MAGIC "ReiSeR4StATusBl" +/* We probably need to keep its size under sector size which is 512 bytes */ +struct reiser4_status { + char magic[16]; + d64 status; /* Current FS state */ + d64 extended_status; /* Any additional info that might have sense in addition to "status". E.g. + last sector where io error happened if status is "io error encountered" */ + d64 stacktrace[10]; /* Last ten functional calls made (addresses)*/ + char texterror[REISER4_TEXTERROR_LEN]; /* Any error message if appropriate, otherwise filled with zeroes */ +}; + +int reiser4_status_init(reiser4_block_nr block); +int reiser4_status_query(u64 *status, u64 *extended); +int reiser4_status_write(u64 status, u64 extended_status, char *message); +int reiser4_status_finish(void); + +#endif diff -puN /dev/null fs/reiser4/super.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/super.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,480 @@ +/* Copyright 2001, 2002, 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Super-block manipulations. */ + +#include "debug.h" +#include "dformat.h" +#include "key.h" +#include "plugin/security/perm.h" +#include "plugin/space/space_allocator.h" +#include "plugin/plugin.h" +#include "tree.h" +#include "vfs_ops.h" +#include "super.h" +#include "reiser4.h" + +#include /* for __u?? */ +#include /* for struct super_block */ + +/*const __u32 REISER4_SUPER_MAGIC = 0x52345362;*/ /* (*(__u32 *)"R4Sb"); */ + +static __u64 reserved_for_gid(const struct super_block *super, gid_t gid); +static __u64 reserved_for_uid(const struct super_block *super, uid_t uid); +static __u64 reserved_for_root(const struct super_block *super); + +/* Return reiser4-specific part of super block */ +reiser4_internal reiser4_super_info_data * +get_super_private_nocheck(const struct super_block *super /* super block + * queried */ ) +{ + return (reiser4_super_info_data *) super->s_fs_info; +} + + +/* Return reiser4 fstype: value that is returned in ->f_type field by statfs() */ +reiser4_internal long +statfs_type(const struct super_block *super UNUSED_ARG /* super block + * queried */ ) +{ + assert("nikita-448", super != NULL); + assert("nikita-449", is_reiser4_super(super)); + return (long) REISER4_SUPER_MAGIC; +} + +/* functions to read/modify fields of reiser4_super_info_data */ + +/* get number of blocks in file system */ +reiser4_internal __u64 +reiser4_block_count(const struct super_block * super /* super block + queried */ ) +{ + assert("vs-494", super != NULL); + assert("vs-495", is_reiser4_super(super)); + return get_super_private(super)->block_count; +} + +/* + * number of blocks in the current file system + */ +reiser4_internal __u64 reiser4_current_block_count(void) +{ + return get_current_super_private()->block_count; +} + + +/* set number of block in filesystem */ +reiser4_internal void +reiser4_set_block_count(const struct super_block *super, __u64 nr) +{ + assert("vs-501", super != NULL); + assert("vs-502", is_reiser4_super(super)); + get_super_private(super)->block_count = nr; + /* The proper calculation of the reserved space counter (%5 of device + block counter) we need a 64 bit division which is missing in Linux on + i386 platform. Because we do not need a precise calculation here we + can replace a div64 operation by this combination of multiplication + and shift: 51. / (2^10) == .0498 .*/ + /* FIXME: this is a bug. It comes up only for very small filesystems + which probably are never user. Nevertheless, it is a bug. Number of + reserved blocks must be not less than maximal number of blocks which + get grabbed with BA_RESERVED. */ + get_super_private(super)->blocks_reserved = ((nr * 51) >> 10); +} + +/* amount of blocks used (allocated for data) in file system */ +reiser4_internal __u64 +reiser4_data_blocks(const struct super_block *super /* super block + queried */ ) +{ + assert("nikita-452", super != NULL); + assert("nikita-453", is_reiser4_super(super)); + return get_super_private(super)->blocks_used; +} + +/* set number of block used in filesystem */ +reiser4_internal void +reiser4_set_data_blocks(const struct super_block *super, __u64 nr) +{ + assert("vs-503", super != NULL); + assert("vs-504", is_reiser4_super(super)); + get_super_private(super)->blocks_used = nr; +} + +/* amount of free blocks in file system */ +reiser4_internal __u64 +reiser4_free_blocks(const struct super_block *super /* super block + queried */ ) +{ + assert("nikita-454", super != NULL); + assert("nikita-455", is_reiser4_super(super)); + return get_super_private(super)->blocks_free; +} + +/* set number of blocks free in filesystem */ +reiser4_internal void +reiser4_set_free_blocks(const struct super_block *super, __u64 nr) +{ + assert("vs-505", super != NULL); + assert("vs-506", is_reiser4_super(super)); + get_super_private(super)->blocks_free = nr; +} + +/* get mkfs unique identifier */ +reiser4_internal __u32 +reiser4_mkfs_id(const struct super_block *super /* super block + queried */ ) +{ + assert("vpf-221", super != NULL); + assert("vpf-222", is_reiser4_super(super)); + return get_super_private(super)->mkfs_id; +} + +/* set mkfs unique identifier */ +reiser4_internal void +reiser4_set_mkfs_id(const struct super_block *super, __u32 id) +{ + assert("vpf-223", super != NULL); + assert("vpf-224", is_reiser4_super(super)); + get_super_private(super)->mkfs_id = id; +} + +/* amount of free blocks in file system */ +reiser4_internal __u64 +reiser4_free_committed_blocks(const struct super_block *super) +{ + assert("vs-497", super != NULL); + assert("vs-498", is_reiser4_super(super)); + return get_super_private(super)->blocks_free_committed; +} + +/* amount of blocks in the file system reserved for @uid and @gid */ +reiser4_internal long +reiser4_reserved_blocks(const struct super_block *super /* super block + queried */ , + uid_t uid /* user id */ , gid_t gid /* group id */ ) +{ + long reserved; + + assert("nikita-456", super != NULL); + assert("nikita-457", is_reiser4_super(super)); + + reserved = 0; + if (REISER4_SUPPORT_GID_SPACE_RESERVATION) + reserved += reserved_for_gid(super, gid); + if (REISER4_SUPPORT_UID_SPACE_RESERVATION) + reserved += reserved_for_uid(super, uid); + if (REISER4_SUPPORT_ROOT_SPACE_RESERVATION && (uid == 0)) + reserved += reserved_for_root(super); + return reserved; +} + +/* get/set value of/to grabbed blocks counter */ +reiser4_internal __u64 reiser4_grabbed_blocks(const struct super_block * super) +{ + assert("zam-512", super != NULL); + assert("zam-513", is_reiser4_super(super)); + + return get_super_private(super)->blocks_grabbed; +} + +reiser4_internal __u64 flush_reserved (const struct super_block *super) +{ + assert ("vpf-285", super != NULL); + assert ("vpf-286", is_reiser4_super (super)); + + return get_super_private(super)->blocks_flush_reserved; +} + +/* get/set value of/to counter of fake allocated formatted blocks */ +reiser4_internal __u64 reiser4_fake_allocated(const struct super_block *super) +{ + assert("zam-516", super != NULL); + assert("zam-517", is_reiser4_super(super)); + + return get_super_private(super)->blocks_fake_allocated; +} + +/* get/set value of/to counter of fake allocated unformatted blocks */ +reiser4_internal __u64 +reiser4_fake_allocated_unformatted(const struct super_block *super) +{ + assert("zam-516", super != NULL); + assert("zam-517", is_reiser4_super(super)); + + return get_super_private(super)->blocks_fake_allocated_unformatted; +} + +/* get/set value of/to counter of clustered blocks */ +reiser4_internal __u64 reiser4_clustered_blocks(const struct super_block *super) +{ + assert("edward-601", super != NULL); + assert("edward-602", is_reiser4_super(super)); + + return get_super_private(super)->blocks_clustered; +} + +/* space allocator used by this file system */ +reiser4_internal reiser4_space_allocator * +get_space_allocator(const struct super_block * super) +{ + assert("nikita-1965", super != NULL); + assert("nikita-1966", is_reiser4_super(super)); + return &get_super_private(super)->space_allocator; +} + +/* return fake inode used to bind formatted nodes in the page cache */ +reiser4_internal struct inode * +get_super_fake(const struct super_block *super /* super block + queried */ ) +{ + assert("nikita-1757", super != NULL); + return get_super_private(super)->fake; +} + +/* return fake inode used to bind copied on capture nodes in the page cache */ +reiser4_internal struct inode * +get_cc_fake(const struct super_block *super /* super block + queried */ ) +{ + assert("nikita-1757", super != NULL); + return get_super_private(super)->cc; +} + +/* tree used by this file system */ +reiser4_internal reiser4_tree * +get_tree(const struct super_block * super /* super block + * queried */ ) +{ + assert("nikita-460", super != NULL); + assert("nikita-461", is_reiser4_super(super)); + return &get_super_private(super)->tree; +} + +/* Check that @super is (looks like) reiser4 super block. This is mainly for + use in assertions. */ +reiser4_internal int +is_reiser4_super(const struct super_block *super /* super block + * queried */ ) +{ + return + super != NULL && + get_super_private(super) != NULL && + super->s_op == &get_super_private(super)->ops.super; +} + +reiser4_internal int +reiser4_is_set(const struct super_block *super, reiser4_fs_flag f) +{ + return test_bit((int) f, &get_super_private(super)->fs_flags); +} + +/* amount of blocks reserved for given group in file system */ +static __u64 +reserved_for_gid(const struct super_block *super UNUSED_ARG /* super + * block + * queried */ , + gid_t gid UNUSED_ARG /* group id */ ) +{ + return 0; +} + +/* amount of blocks reserved for given user in file system */ +static __u64 +reserved_for_uid(const struct super_block *super UNUSED_ARG /* super + block + queried */ , + uid_t uid UNUSED_ARG /* user id */ ) +{ + return 0; +} + +/* amount of blocks reserved for super user in file system */ +static __u64 +reserved_for_root(const struct super_block *super UNUSED_ARG /* super + block + queried */ ) +{ + return 0; +} + +/* + * true if block number @blk makes sense for the file system at @super. + */ +reiser4_internal int +reiser4_blocknr_is_sane_for(const struct super_block *super, + const reiser4_block_nr *blk) +{ + reiser4_super_info_data *sbinfo; + + assert("nikita-2957", super != NULL); + assert("nikita-2958", blk != NULL); + + if (blocknr_is_fake(blk)) + return 1; + + sbinfo = get_super_private(super); + return *blk < sbinfo->block_count; +} + +/* + * true, if block number @blk makes sense for the current file system + */ +reiser4_internal int +reiser4_blocknr_is_sane(const reiser4_block_nr *blk) +{ + return reiser4_blocknr_is_sane_for(reiser4_get_current_sb(), blk); +} + +/* + * construct various VFS related operation vectors that are embedded into @ops + * inside of @super. + */ +reiser4_internal void +build_object_ops(struct super_block *super, object_ops *ops) +{ + struct inode_operations iops; + + assert("nikita-3248", super != NULL); + assert("nikita-3249", ops != NULL); + + iops = reiser4_inode_operations; + + /* setup super_operations... */ + ops->super = reiser4_super_operations; + /* ...and export operations for NFS */ + ops->export = reiser4_export_operations; + + /* install pointers to the per-super-block vectors into super-block + * fields */ + super->s_op = &ops->super; + super->s_export_op = &ops->export; + + /* cleanup XATTR related fields in inode operations---we don't support + * Linux xattr API... */ + iops.setxattr = NULL; + iops.getxattr = NULL; + iops.listxattr = NULL; + iops.removexattr = NULL; + + /* ...and we don't need ->clear_inode, because its only user was + * xattrs */ + /*ops->super.clear_inode = NULL;*/ + + ops->regular = iops; + ops->dir = iops; + + ops->file = reiser4_file_operations; + ops->symlink = reiser4_symlink_inode_operations; + ops->special = reiser4_special_inode_operations; + ops->dentry = reiser4_dentry_operations; + ops->as = reiser4_as_operations; + +#if !ENABLE_REISER4_PSEUDO + /* if we don't support pseudo files, we need neither ->open, + * nor ->lookup on regular files */ + ops->regular.lookup = NULL; + ops->file.open = NULL; +#endif +} + +#if REISER4_DEBUG_OUTPUT +/* + * debugging function: output human readable information about file system + * parameters + */ +reiser4_internal void +print_fs_info(const char *prefix, const struct super_block *s) +{ + reiser4_super_info_data *sbinfo; + + sbinfo = get_super_private(s); + + printk("================ fs info (%s) =================\n", prefix); + printk("root block: %lli\ntree height: %i\n", sbinfo->tree.root_block, sbinfo->tree.height); + sa_print_info("", get_space_allocator(s)); + + printk("Oids: next to use %llu, in use %llu\n", sbinfo->next_to_use, sbinfo->oids_in_use); + printk("Block counters:\n\tblock count\t%llu\n\tfree blocks\t%llu\n" + "\tused blocks\t%llu\n\tgrabbed\t%llu\n\tfake allocated formatted\t%llu\n" + "\tfake allocated unformatted\t%llu\n", + reiser4_block_count(s), reiser4_free_blocks(s), + reiser4_data_blocks(s), reiser4_grabbed_blocks(s), + reiser4_fake_allocated(s), reiser4_fake_allocated_unformatted(s)); + print_key("Root directory key", sbinfo->df_plug->root_dir_key(s)); + + if ( sbinfo->diskmap_block) + printk("Diskmap is present in %llu block\n", sbinfo->diskmap_block); + else + printk("Diskmap is not present\n"); + + if (sbinfo->df_plug->print_info) { + printk("=========== disk format info (%s) =============\n", sbinfo->df_plug->h.label); + sbinfo->df_plug->print_info(s); + } + +} +#endif + + +#if REISER4_DEBUG + +/* this is caller when unallocated extent pointer is added */ +void +inc_unalloc_unfm_ptr(void) +{ + reiser4_super_info_data *sbinfo; + + sbinfo = get_super_private(get_current_context()->super); + reiser4_spin_lock_sb(sbinfo); + sbinfo->unalloc_extent_pointers ++; + reiser4_spin_unlock_sb(sbinfo); +} + +/* this is called when unallocated extent is converted to allocated */ +void +dec_unalloc_unfm_ptrs(int nr) +{ + reiser4_super_info_data *sbinfo; + + sbinfo = get_super_private(get_current_context()->super); + reiser4_spin_lock_sb(sbinfo); + BUG_ON(sbinfo->unalloc_extent_pointers < nr); + sbinfo->unalloc_extent_pointers -= nr; + reiser4_spin_unlock_sb(sbinfo); +} + +void +inc_unfm_ef(void) +{ + reiser4_super_info_data *sbinfo; + + sbinfo = get_super_private(get_current_context()->super); + reiser4_spin_lock_sb(sbinfo); + sbinfo->eflushed_unformatted ++; + reiser4_spin_unlock_sb(sbinfo); +} + +void +dec_unfm_ef(void) +{ + reiser4_super_info_data *sbinfo; + + sbinfo = get_super_private(get_current_context()->super); + reiser4_spin_lock_sb(sbinfo); + BUG_ON(sbinfo->eflushed_unformatted == 0); + sbinfo->eflushed_unformatted --; + reiser4_spin_unlock_sb(sbinfo); +} + +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/super.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/super.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,538 @@ +/* Copyright 2001, 2002, 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Super-block functions. See super.c for details. */ + +#if !defined( __REISER4_SUPER_H__ ) +#define __REISER4_SUPER_H__ + +#include "forward.h" +#include "debug.h" +#include "tree.h" +#include "context.h" +#include "entd.h" +#include "plugin/plugin.h" +#include "wander.h" + +#include "plugin/space/space_allocator.h" + +#include "plugin/disk_format/disk_format40.h" +#include "plugin/security/perm.h" +#include "plugin/dir/dir.h" + +#include "emergency_flush.h" + +#include +#include /* for __u??, etc. */ +#include /* for struct super_block, etc. */ +#include /* for struct list_head */ +#include /* for kobject */ + +/* + * Flush algorithms parameters. + */ +typedef struct { + unsigned relocate_threshold; + unsigned relocate_distance; + unsigned written_threshold; + unsigned scan_maxnodes; +} flush_params; + +typedef enum { + /* True if this file system doesn't support hard-links (multiple + names) for directories: this is default UNIX behavior. + + If hard-links on directoires are not allowed, file system is + Acyclic Directed Graph (modulo dot, and dotdot, of course). + + This is used by reiser4_link(). + */ + REISER4_ADG = 0, + /* set if all nodes in internal tree have the same node layout plugin. + If so, znode_guess_plugin() will return tree->node_plugin in stead + of guessing plugin by plugin id stored in the node. + */ + REISER4_ONE_NODE_PLUGIN = 1, + /* if set, bsd gid assignment is supported. */ + REISER4_BSD_GID = 2, + /* [mac]_time are 32 bit in inode */ + REISER4_32_BIT_TIMES = 3, + /* allow concurrent flushes */ + REISER4_MTFLUSH = 4, + /* disable support for pseudo files. Don't treat regular files as + * directories. */ + REISER4_NO_PSEUDO = 5, + /* load all bitmap blocks at mount time */ + REISER4_DONT_LOAD_BITMAP = 6 +} reiser4_fs_flag; + +/* + * VFS related operation vectors. + * + * Usually file system has one instance of those, but in reiser4 we sometimes + * want to be able to modify vectors on per-mount basis. For example, reiser4 + * needs ->open method to handle pseudo files correctly, but if file system is + * mounted with "nopseudo" mount option, it's better to have ->open set to + * NULL, as this makes sys_open() a little bit more efficient. + * + */ +typedef struct object_ops { + struct super_operations super; + struct file_operations file; + struct dentry_operations dentry; + struct address_space_operations as; + + struct inode_operations regular; + struct inode_operations dir; + struct inode_operations symlink; + struct inode_operations special; + + struct export_operations export; +} object_ops; + +/* reiser4-specific part of super block + + Locking + + Fields immutable after mount: + + ->oid* + ->space* + ->default_[ug]id + ->mkfs_id + ->trace_flags + ->debug_flags + ->fs_flags + ->df_plug + ->optimal_io_size + ->plug + ->flush + ->u (bad name) + ->txnmgr + ->ra_params + ->fsuid + ->journal_header + ->journal_footer + + Fields protected by ->lnode_guard + + ->lnode_htable + + Fields protected by per-super block spin lock + + ->block_count + ->blocks_used + ->blocks_free + ->blocks_free_committed + ->blocks_grabbed + ->blocks_fake_allocated_unformatted + ->blocks_fake_allocated + ->blocks_flush_reserved + ->eflushed + ->blocknr_hint_default + + After journal replaying during mount, + + ->last_committed_tx + + is protected by ->tmgr.commit_semaphore + + Invariants involving this data-type: + + [sb-block-counts] + [sb-grabbed] + [sb-fake-allocated] +*/ +struct reiser4_super_info_data { + /* guard spinlock which protects reiser4 super + block fields (currently blocks_free, + blocks_free_committed) + */ + reiser4_spin_data guard; + + /* + * object id manager + */ + /* next oid that will be returned by oid_allocate() */ + oid_t next_to_use; + /* total number of used oids */ + oid_t oids_in_use; + + /* space manager plugin */ + reiser4_space_allocator space_allocator; + + /* reiser4 internal tree */ + reiser4_tree tree; + + /* default user id used for light-weight files without their own + stat-data. */ + uid_t default_uid; + + /* default group id used for light-weight files without their own + stat-data. */ + gid_t default_gid; + + /* mkfs identifier generated at mkfs time. */ + __u32 mkfs_id; + /* amount of blocks in a file system */ + __u64 block_count; + + /* inviolable reserve */ + __u64 blocks_reserved; + + /* amount of blocks used by file system data and meta-data. */ + __u64 blocks_used; + + /* amount of free blocks. This is "working" free blocks counter. It is + like "working" bitmap, please see block_alloc.c for description. */ + __u64 blocks_free; + + /* free block count for fs committed state. This is "commit" version + of free block counter. */ + __u64 blocks_free_committed; + + /* number of blocks reserved for further allocation, for all threads. */ + __u64 blocks_grabbed; + + /* number of fake allocated unformatted blocks in tree. */ + __u64 blocks_fake_allocated_unformatted; + + /* number of fake allocated formatted blocks in tree. */ + __u64 blocks_fake_allocated; + + /* number of blocks reserved for flush operations. */ + __u64 blocks_flush_reserved; + + /* number of blocks reserved for cluster operations. */ + __u64 blocks_clustered; + + /* unique file-system identifier */ + /* does this conform to Andreas Dilger UUID stuff? */ + __u32 fsuid; + + /* per-fs tracing flags. Use reiser4_trace_flags enum to set + bits in it. */ + __u32 trace_flags; + + /* per-fs log flags. Use reiser4_log_flags enum to set + bits in it. */ + __u32 log_flags; + __u32 oid_to_log; + + /* per-fs debugging flags. This is bitmask populated from + reiser4_debug_flags enum. */ + __u32 debug_flags; + + /* super block flags */ + + /* file-system wide flags. See reiser4_fs_flag enum */ + unsigned long fs_flags; + + /* transaction manager */ + txn_mgr tmgr; + + /* ent thread */ + entd_context entd; + + /* fake inode used to bind formatted nodes */ + struct inode *fake; + /* inode used to bind bitmaps (and journal heads) */ + struct inode *bitmap; + /* inode used to bind copied on capture nodes */ + struct inode *cc; + + /* disk layout plugin */ + disk_format_plugin *df_plug; + + /* disk layout specific part of reiser4 super info data */ + union { + format40_super_info format40; + } u; + + /* + * value we return in st_blksize on stat(2). + */ + unsigned long optimal_io_size; + + /* parameters for the flush algorithm */ + flush_params flush; + + /* see emergency_flush.c for details */ + reiser4_spin_data eflush_guard; + /* number of emergency flushed nodes */ + int eflushed; +#if REISER4_USE_EFLUSH + /* hash table used by emergency flush. Protected by ->eflush_guard */ + ef_hash_table efhash_table; +#endif + /* pointers to jnodes for journal header and footer */ + jnode *journal_header; + jnode *journal_footer; + + journal_location jloc; + + /* head block number of last committed transaction */ + __u64 last_committed_tx; + + /* we remember last written location for using as a hint for + new block allocation */ + __u64 blocknr_hint_default; + + /* committed number of files (oid allocator state variable ) */ + __u64 nr_files_committed; + + ra_params_t ra_params; + + /* A semaphore for serializing cut tree operation if + out-of-free-space: the only one cut_tree thread is allowed to grab + space from reserved area (it is 5% of disk space) */ + struct semaphore delete_sema; + /* task owning ->delete_sema */ + struct task_struct *delete_sema_owner; + + /* serialize semaphore */ + struct semaphore flush_sema; + + /* Diskmap's blocknumber */ + __u64 diskmap_block; + + /* What to do in case of error */ + int onerror; + + /* operations for objects on this file system */ + object_ops ops; + + /* dir_cursor_info see plugin/dir/dir.[ch] for more details */ + d_cursor_info d_info; + +#ifdef CONFIG_REISER4_BADBLOCKS + /* Alternative master superblock offset (in bytes) */ + unsigned long altsuper; +#endif +#if REISER4_DEBUG + /* minimum used blocks value (includes super blocks, bitmap blocks and + * other fs reserved areas), depends on fs format and fs size. */ + __u64 min_blocks_used; + /* number of space allocated by kmalloc. For debugging. */ + int kmallocs; + + /* + * when debugging is on, all jnodes (including znodes, bitmaps, etc.) + * are kept on a list anchored at sbinfo->all_jnodes. This list is + * protected by sbinfo->all_guard spin lock. This lock should be taken + * with _irq modifier, because it is also modified from interrupt + * contexts (by RCU). + */ + + spinlock_t all_guard; + /* list of all jnodes */ + struct list_head all_jnodes; + + /*XXX debugging code */ + __u64 eflushed_unformatted; /* number of eflushed unformatted nodes */ + __u64 unalloc_extent_pointers; /* number of unallocated extent pointers in the tree */ +#endif + struct repacker * repacker; + struct page * status_page; + struct bio * status_bio; +}; + + + +extern reiser4_super_info_data *get_super_private_nocheck(const struct + super_block *super); + +extern struct super_operations reiser4_super_operations; + +/* Return reiser4-specific part of super block */ +static inline reiser4_super_info_data * +get_super_private(const struct super_block * super) +{ + assert("nikita-447", super != NULL); + + return (reiser4_super_info_data *) super->s_fs_info; +} + +/* "Current" super-block: main super block used during current system + call. Reference to this super block is stored in reiser4_context. */ +static inline struct super_block * +reiser4_get_current_sb(void) +{ + return get_current_context()->super; +} + +/* Reiser4-specific part of "current" super-block: main super block used + during current system call. Reference to this super block is stored in + reiser4_context. */ +static inline reiser4_super_info_data * +get_current_super_private(void) +{ + return get_super_private(reiser4_get_current_sb()); +} + +static inline ra_params_t * +get_current_super_ra_params(void) +{ + return &(get_current_super_private()->ra_params); +} + +/* + * true, if file system on @super is read-only + */ +static inline int rofs_super(struct super_block *super) +{ + return super->s_flags & MS_RDONLY; +} + +/* + * true, if @tree represents read-only file system + */ +static inline int rofs_tree(reiser4_tree *tree) +{ + return rofs_super(tree->super); +} + +/* + * true, if file system where @inode lives on, is read-only + */ +static inline int rofs_inode(struct inode *inode) +{ + return rofs_super(inode->i_sb); +} + +/* + * true, if file system where @node lives on, is read-only + */ +static inline int rofs_jnode(jnode *node) +{ + return rofs_tree(jnode_get_tree(node)); +} + +extern __u64 reiser4_current_block_count(void); + +extern void build_object_ops(struct super_block *super, object_ops *ops); + +#define REISER4_SUPER_MAGIC 0x52345362 /* (*(__u32 *)"R4Sb"); */ + +#define spin_ordering_pred_super(private) (1) +SPIN_LOCK_FUNCTIONS(super, reiser4_super_info_data, guard); + +#define spin_ordering_pred_super_eflush(private) (1) +SPIN_LOCK_FUNCTIONS(super_eflush, reiser4_super_info_data, eflush_guard); + +/* + * lock reiser4-specific part of super block + */ +static inline void reiser4_spin_lock_sb(reiser4_super_info_data *sbinfo) +{ + spin_lock_super(sbinfo); +} + +/* + * unlock reiser4-specific part of super block + */ +static inline void reiser4_spin_unlock_sb(reiser4_super_info_data *sbinfo) +{ + spin_unlock_super(sbinfo); +} + +/* + * lock emergency flush data-structures for super block @s + */ +static inline void spin_lock_eflush(const struct super_block * s) +{ + reiser4_super_info_data * sbinfo = get_super_private (s); + spin_lock_super_eflush(sbinfo); +} + +/* + * unlock emergency flush data-structures for super block @s + */ +static inline void spin_unlock_eflush(const struct super_block * s) +{ + reiser4_super_info_data * sbinfo = get_super_private (s); + spin_unlock_super_eflush(sbinfo); +} + + +extern __u64 flush_reserved ( const struct super_block*); +extern int reiser4_is_set(const struct super_block *super, reiser4_fs_flag f); +extern long statfs_type(const struct super_block *super); +extern __u64 reiser4_block_count(const struct super_block *super); +extern void reiser4_set_block_count(const struct super_block *super, __u64 nr); +extern __u64 reiser4_data_blocks(const struct super_block *super); +extern void reiser4_set_data_blocks(const struct super_block *super, __u64 nr); +extern __u64 reiser4_free_blocks(const struct super_block *super); +extern void reiser4_set_free_blocks(const struct super_block *super, __u64 nr); +extern __u32 reiser4_mkfs_id(const struct super_block *super); +extern void reiser4_set_mkfs_id(const struct super_block *super, __u32 id); + +extern __u64 reiser4_free_committed_blocks(const struct super_block *super); + +extern __u64 reiser4_grabbed_blocks(const struct super_block *); +extern __u64 reiser4_fake_allocated(const struct super_block *); +extern __u64 reiser4_fake_allocated_unformatted(const struct super_block *); +extern __u64 reiser4_clustered_blocks(const struct super_block *); + +extern long reiser4_reserved_blocks(const struct super_block *super, uid_t uid, gid_t gid); + +extern reiser4_space_allocator *get_space_allocator(const struct super_block + *super); +extern reiser4_oid_allocator *get_oid_allocator(const struct super_block + *super); +extern struct inode *get_super_fake(const struct super_block *super); +extern struct inode *get_cc_fake(const struct super_block *super); +extern reiser4_tree *get_tree(const struct super_block *super); +extern int is_reiser4_super(const struct super_block *super); + +extern int reiser4_blocknr_is_sane(const reiser4_block_nr *blk); +extern int reiser4_blocknr_is_sane_for(const struct super_block *super, + const reiser4_block_nr *blk); + +/* Maximal possible object id. */ +#define ABSOLUTE_MAX_OID ((oid_t)~0) + +#define OIDS_RESERVED ( 1 << 16 ) +int oid_init_allocator(struct super_block *, oid_t nr_files, oid_t next); +oid_t oid_allocate(struct super_block *); +int oid_release(struct super_block *, oid_t); +oid_t oid_next(const struct super_block *); +void oid_count_allocated(void); +void oid_count_released(void); +long oids_used(const struct super_block *); +long oids_free(const struct super_block *); + + +#if REISER4_DEBUG +void print_fs_info(const char *prefix, const struct super_block *); +#endif + +#if REISER4_DEBUG + +void inc_unalloc_unfm_ptr(void); +void dec_unalloc_unfm_ptrs(int nr); +void inc_unfm_ef(void); +void dec_unfm_ef(void); + +#else + +#define inc_unalloc_unfm_ptr() noop +#define dec_unalloc_unfm_ptrs(nr) noop +#define inc_unfm_ef() noop +#define dec_unfm_ef() noop + +#endif + + +/* __REISER4_SUPER_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/tap.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/tap.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,389 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* + Tree Access Pointer (tap). + + tap is data structure combining coord and lock handle (mostly). It is + useful when one has to scan tree nodes (for example, in readdir, or flush), + for tap functions allow to move tap in either direction transparently + crossing unit/item/node borders. + + Tap doesn't provide automatic synchronization of its fields as it is + supposed to be per-thread object. +*/ + +#include "forward.h" +#include "debug.h" +#include "coord.h" +#include "tree.h" +#include "context.h" +#include "tap.h" +#include "znode.h" +#include "tree_walk.h" + +#if REISER4_DEBUG +static int tap_invariant(const tap_t * tap); +static void tap_check(const tap_t * tap); +#else +#define tap_check(tap) noop +#endif + +/** load node tap is pointing to, if not loaded already */ +reiser4_internal int +tap_load(tap_t * tap) +{ + tap_check(tap); + if (tap->loaded == 0) { + int result; + + result = zload_ra(tap->coord->node, &tap->ra_info); + if (result != 0) + return result; + coord_clear_iplug(tap->coord); + } + ++tap->loaded; + tap_check(tap); + return 0; +} + +/** release node tap is pointing to. Dual to tap_load() */ +reiser4_internal void +tap_relse(tap_t * tap) +{ + tap_check(tap); + if (tap->loaded > 0) { + --tap->loaded; + if (tap->loaded == 0) { + zrelse(tap->coord->node); + } + } + tap_check(tap); +} + +/** + * init tap to consist of @coord and @lh. Locks on nodes will be acquired with + * @mode + */ +reiser4_internal void +tap_init(tap_t * tap, coord_t * coord, lock_handle * lh, znode_lock_mode mode) +{ + tap->coord = coord; + tap->lh = lh; + tap->mode = mode; + tap->loaded = 0; + tap_list_clean(tap); + init_ra_info(&tap->ra_info); +} + +/** add @tap to the per-thread list of all taps */ +reiser4_internal void +tap_monitor(tap_t * tap) +{ + assert("nikita-2623", tap != NULL); + tap_check(tap); + tap_list_push_front(taps_list(), tap); + tap_check(tap); +} + +/* duplicate @src into @dst. Copy lock handle. @dst is not initially + * loaded. */ +reiser4_internal void +tap_copy(tap_t * dst, tap_t * src) +{ + assert("nikita-3193", src != NULL); + assert("nikita-3194", dst != NULL); + + *dst->coord = *src->coord; + if (src->lh->node) + copy_lh(dst->lh, src->lh); + dst->mode = src->mode; + dst->loaded = 0; + tap_list_clean(dst); + dst->ra_info = src->ra_info; +} + +/** finish with @tap */ +reiser4_internal void +tap_done(tap_t * tap) +{ + assert("nikita-2565", tap != NULL); + tap_check(tap); + if (tap->loaded > 0) + zrelse(tap->coord->node); + done_lh(tap->lh); + tap->loaded = 0; + tap_list_remove_clean(tap); + tap->coord->node = NULL; +} + +/** + * move @tap to the new node, locked with @target. Load @target, if @tap was + * already loaded. + */ +reiser4_internal int +tap_move(tap_t * tap, lock_handle * target) +{ + int result = 0; + + assert("nikita-2567", tap != NULL); + assert("nikita-2568", target != NULL); + assert("nikita-2570", target->node != NULL); + assert("nikita-2569", tap->coord->node == tap->lh->node); + + tap_check(tap); + if (tap->loaded > 0) + result = zload_ra(target->node, &tap->ra_info); + + if (result == 0) { + if (tap->loaded > 0) + zrelse(tap->coord->node); + done_lh(tap->lh); + copy_lh(tap->lh, target); + tap->coord->node = target->node; + coord_clear_iplug(tap->coord); + } + tap_check(tap); + return result; +} + +/** + * move @tap to @target. Acquire lock on @target, if @tap was already + * loaded. + */ +static int +tap_to(tap_t * tap, znode * target) +{ + int result; + + assert("nikita-2624", tap != NULL); + assert("nikita-2625", target != NULL); + + tap_check(tap); + result = 0; + if (tap->coord->node != target) { + lock_handle here; + + init_lh(&here); + result = longterm_lock_znode(&here, target, + tap->mode, ZNODE_LOCK_HIPRI); + if (result == 0) { + result = tap_move(tap, &here); + done_lh(&here); + } + } + tap_check(tap); + return result; +} + +/** + * move @tap to given @target, loading and locking @target->node if + * necessary + */ +reiser4_internal int +tap_to_coord(tap_t * tap, coord_t * target) +{ + int result; + + tap_check(tap); + result = tap_to(tap, target->node); + if (result == 0) + coord_dup(tap->coord, target); + tap_check(tap); + return result; +} + +/** return list of all taps */ +reiser4_internal tap_list_head * +taps_list(void) +{ + return &get_current_context()->taps; +} + +/** helper function for go_{next,prev}_{item,unit,node}() */ +reiser4_internal int +go_dir_el(tap_t * tap, sideof dir, int units_p) +{ + coord_t dup; + coord_t *coord; + int result; + + int (*coord_dir) (coord_t *); + int (*get_dir_neighbor) (lock_handle *, znode *, int, int); + void (*coord_init) (coord_t *, const znode *); + ON_DEBUG(int (*coord_check) (const coord_t *)); + + assert("nikita-2556", tap != NULL); + assert("nikita-2557", tap->coord != NULL); + assert("nikita-2558", tap->lh != NULL); + assert("nikita-2559", tap->coord->node != NULL); + + tap_check(tap); + if (dir == LEFT_SIDE) { + coord_dir = units_p ? coord_prev_unit : coord_prev_item; + get_dir_neighbor = reiser4_get_left_neighbor; + coord_init = coord_init_last_unit; + } else { + coord_dir = units_p ? coord_next_unit : coord_next_item; + get_dir_neighbor = reiser4_get_right_neighbor; + coord_init = coord_init_first_unit; + } + ON_DEBUG(coord_check = units_p ? coord_is_existing_unit : coord_is_existing_item); + assert("nikita-2560", coord_check(tap->coord)); + + coord = tap->coord; + coord_dup(&dup, coord); + if (coord_dir(&dup) != 0) { + do { + /* move to the left neighboring node */ + lock_handle dup; + + init_lh(&dup); + result = get_dir_neighbor( + &dup, coord->node, (int) tap->mode, GN_CAN_USE_UPPER_LEVELS); + if (result == 0) { + result = tap_move(tap, &dup); + if (result == 0) + coord_init(tap->coord, dup.node); + done_lh(&dup); + } + /* skip empty nodes */ + } while ((result == 0) && node_is_empty(coord->node)); + } else { + result = 0; + coord_dup(coord, &dup); + } + assert("nikita-2564", ergo(!result, coord_check(tap->coord))); + tap_check(tap); + return result; +} + +/** + * move @tap to the next unit, transparently crossing item and node + * boundaries + */ +reiser4_internal int +go_next_unit(tap_t * tap) +{ + return go_dir_el(tap, RIGHT_SIDE, 1); +} + +/** + * move @tap to the previous unit, transparently crossing item and node + * boundaries + */ +reiser4_internal int +go_prev_unit(tap_t * tap) +{ + return go_dir_el(tap, LEFT_SIDE, 1); +} + +/** + * @shift times apply @actor to the @tap. This is used to move @tap by + * @shift units (or items, or nodes) in either direction. + */ +static int +rewind_to(tap_t * tap, go_actor_t actor, int shift) +{ + int result; + + assert("nikita-2555", shift >= 0); + assert("nikita-2562", tap->coord->node == tap->lh->node); + + tap_check(tap); + result = tap_load(tap); + if (result != 0) + return result; + + for (; shift > 0; --shift) { + result = actor(tap); + assert("nikita-2563", tap->coord->node == tap->lh->node); + if (result != 0) + break; + } + tap_relse(tap); + tap_check(tap); + return result; +} + +/** move @tap @shift units rightward */ +reiser4_internal int +rewind_right(tap_t * tap, int shift) +{ + return rewind_to(tap, go_next_unit, shift); +} + +/** move @tap @shift units leftward */ +reiser4_internal int +rewind_left(tap_t * tap, int shift) +{ + return rewind_to(tap, go_prev_unit, shift); +} + +#if REISER4_DEBUG +/** debugging function: print @tap content in human readable form */ +static void +print_tap(const char * prefix, const tap_t * tap) +{ + if (tap == NULL) { + printk("%s: null tap\n", prefix); + return; + } + printk("%s: loaded: %i, in-list: %i, node: %p, mode: %s\n", prefix, + tap->loaded, tap_list_is_clean(tap), tap->lh->node, + lock_mode_name(tap->mode)); + print_coord("\tcoord", tap->coord, 0); +} + +/** check [tap-sane] invariant */ +static int tap_invariant(const tap_t * tap) +{ + /* [tap-sane] invariant */ + + if (tap == NULL) + return 1; + /* tap->mode is one of + * + * {ZNODE_NO_LOCK, ZNODE_READ_LOCK, ZNODE_WRITE_LOCK}, and + */ + if (tap->mode != ZNODE_NO_LOCK && + tap->mode != ZNODE_READ_LOCK && tap->mode != ZNODE_WRITE_LOCK) + return 2; + /* tap->coord != NULL, and */ + if (tap->coord == NULL) + return 3; + /* tap->lh != NULL, and */ + if (tap->lh == NULL) + return 4; + /* tap->loaded > 0 => znode_is_loaded(tap->coord->node), and */ + if (!ergo(tap->loaded, znode_is_loaded(tap->coord->node))) + return 5; + /* tap->coord->node == tap->lh->node if tap->lh->node is not 0 */ + if (tap->lh->node != NULL && tap->coord->node != tap->lh->node) + return 6; + return 0; +} + +/** debugging function: check internal @tap consistency */ +static void tap_check(const tap_t * tap) +{ + int result; + + result = tap_invariant(tap); + if (result != 0) { + print_tap("broken", tap); + reiser4_panic("nikita-2831", "tap broken: %i\n", result); + } +} +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/tap.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/tap.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,73 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* Tree Access Pointers. See tap.c for more details. */ + +#if !defined( __REISER4_TAP_H__ ) +#define __REISER4_TAP_H__ + +#include "forward.h" +#include "type_safe_list.h" +#include "readahead.h" + +TYPE_SAFE_LIST_DECLARE(tap); + +/** + tree_access_pointer aka tap. Data structure combining coord_t and lock + handle. + Invariants involving this data-type, see doc/lock-ordering for details: + + [tap-sane] + */ +struct tree_access_pointer { + /* coord tap is at */ + coord_t *coord; + /* lock handle on ->coord->node */ + lock_handle *lh; + /* mode of lock acquired by this tap */ + znode_lock_mode mode; + /* incremented by tap_load(). Decremented by tap_relse(). */ + int loaded; + /* list of taps */ + tap_list_link linkage; + /* read-ahead hint */ + ra_info_t ra_info; +}; + +TYPE_SAFE_LIST_DEFINE(tap, tap_t, linkage); + +typedef int (*go_actor_t) (tap_t * tap); + +extern int tap_load(tap_t * tap); +extern void tap_relse(tap_t * tap); +extern void tap_init(tap_t * tap, coord_t * coord, lock_handle * lh, znode_lock_mode mode); +extern void tap_monitor(tap_t * tap); +extern void tap_copy(tap_t * dst, tap_t * src); +extern void tap_done(tap_t * tap); +extern int tap_move(tap_t * tap, lock_handle * target); +extern int tap_to_coord(tap_t * tap, coord_t * target); + +extern int go_dir_el(tap_t * tap, sideof dir, int units_p); +extern int go_next_unit(tap_t * tap); +extern int go_prev_unit(tap_t * tap); +extern int rewind_right(tap_t * tap, int shift); +extern int rewind_left(tap_t * tap, int shift); + +extern tap_list_head *taps_list(void); + +#define for_all_taps( tap ) \ + for (tap = tap_list_front ( taps_list() ); \ + ! tap_list_end ( taps_list(), tap ); \ + tap = tap_list_next ( tap ) ) + +/* __REISER4_TAP_H__ */ +#endif +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/tree.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/tree.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1825 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* + * KEYS IN A TREE. + * + * The tree consists of nodes located on the disk. Node in the tree is either + * formatted or unformatted. Formatted node is one that has structure + * understood by the tree balancing and traversal code. Formatted nodes are + * further classified into leaf and internal nodes. Latter distinctions is + * (almost) of only historical importance: general structure of leaves and + * internal nodes is the same in Reiser4. Unformatted nodes contain raw data + * that are part of bodies of ordinary files and attributes. + * + * Each node in the tree spawns some interval in the key space. Key ranges for + * all nodes in the tree are disjoint. Actually, this only holds in some weak + * sense, because of the non-unique keys: intersection of key ranges for + * different nodes is either empty, or consists of exactly one key. + * + * Formatted node consists of a sequence of items. Each item spawns some + * interval in key space. Key ranges for all items in a tree are disjoint, + * modulo non-unique keys again. Items within nodes are ordered in the key + * order of the smallest key in a item. + * + * Particular type of item can be further split into units. Unit is piece of + * item that can be cut from item and moved into another item of the same + * time. Units are used by balancing code to repack data during balancing. + * + * Unit can be further split into smaller entities (for example, extent unit + * represents several pages, and it is natural for extent code to operate on + * particular pages and even bytes within one unit), but this is of no + * relevance to the generic balancing and lookup code. + * + * Although item is said to "spawn" range or interval of keys, it is not + * necessary that item contains piece of data addressable by each and every + * key in this range. For example, compound directory item, consisting of + * units corresponding to directory entries and keyed by hashes of file names, + * looks more as having "discrete spectrum": only some disjoint keys inside + * range occupied by this item really address data. + * + * No than less, each item always has well-defined least (minimal) key, that + * is recorded in item header, stored in the node this item is in. Also, item + * plugin can optionally define method ->max_key_inside() returning maximal + * key that can _possibly_ be located within this item. This method is used + * (mainly) to determine when given piece of data should be merged into + * existing item, in stead of creating new one. Because of this, even though + * ->max_key_inside() can be larger that any key actually located in the item, + * intervals + * + * [ min_key( item ), ->max_key_inside( item ) ] + * + * are still disjoint for all items within the _same_ node. + * + * In memory node is represented by znode. It plays several roles: + * + * . something locks are taken on + * + * . something tracked by transaction manager (this is going to change) + * + * . something used to access node data + * + * . something used to maintain tree structure in memory: sibling and + * parental linkage. + * + * . something used to organize nodes into "slums" + * + * More on znodes see in znode.[ch] + * + * DELIMITING KEYS + * + * To simplify balancing, allow some flexibility in locking and speed up + * important coord cache optimization, we keep delimiting keys of nodes in + * memory. Depending on disk format (implemented by appropriate node plugin) + * node on disk can record both left and right delimiting key, only one of + * them, or none. Still, our balancing and tree traversal code keep both + * delimiting keys for a node that is in memory stored in the znode. When + * node is first brought into memory during tree traversal, its left + * delimiting key is taken from its parent, and its right delimiting key is + * either next key in its parent, or is right delimiting key of parent if + * node is the rightmost child of parent. + * + * Physical consistency of delimiting key is protected by special dk + * read-write lock. That is, delimiting keys can only be inspected or + * modified under this lock. But dk lock is only sufficient for fast + * "pessimistic" check, because to simplify code and to decrease lock + * contention, balancing (carry) only updates delimiting keys right before + * unlocking all locked nodes on the given tree level. For example, + * coord-by-key cache scans LRU list of recently accessed znodes. For each + * node it first does fast check under dk spin lock. If key looked for is + * not between delimiting keys for this node, next node is inspected and so + * on. If key is inside of the key range, long term lock is taken on node + * and key range is rechecked. + * + * COORDINATES + * + * To find something in the tree, you supply a key, and the key is resolved + * by coord_by_key() into a coord (coordinate) that is valid as long as the + * node the coord points to remains locked. As mentioned above trees + * consist of nodes that consist of items that consist of units. A unit is + * the smallest and indivisible piece of tree as far as balancing and tree + * search are concerned. Each node, item, and unit can be addressed by + * giving its level in the tree and the key occupied by this entity. A node + * knows what the key ranges are of the items within it, and how to find its + * items and invoke their item handlers, but it does not know how to access + * individual units within its items except through the item handlers. + * coord is a structure containing a pointer to the node, the ordinal number + * of the item within this node (a sort of item offset), and the ordinal + * number of the unit within this item. + * + * TREE LOOKUP + * + * There are two types of access to the tree: lookup and modification. + * + * Lookup is a search for the key in the tree. Search can look for either + * exactly the key given to it, or for the largest key that is not greater + * than the key given to it. This distinction is determined by "bias" + * parameter of search routine (coord_by_key()). coord_by_key() either + * returns error (key is not in the tree, or some kind of external error + * occurred), or successfully resolves key into coord. + * + * This resolution is done by traversing tree top-to-bottom from root level + * to the desired level. On levels above twig level (level one above the + * leaf level) nodes consist exclusively of internal items. Internal item is + * nothing more than pointer to the tree node on the child level. On twig + * level nodes consist of internal items intermixed with extent + * items. Internal items form normal search tree structure used by traversal + * to descent through the tree. + * + * TREE LOOKUP OPTIMIZATIONS + * + * Tree lookup described above is expensive even if all nodes traversed are + * already in the memory: for each node binary search within it has to be + * performed and binary searches are CPU consuming and tend to destroy CPU + * caches. + * + * Several optimizations are used to work around this: + * + * . cbk_cache (look-aside cache for tree traversals, see search.c for + * details) + * + * . seals (see seal.[ch]) + * + * . vroot (see search.c) + * + * General search-by-key is layered thusly: + * + * [check seal, if any] --ok--> done + * | + * failed + * | + * V + * [vroot defined] --no--> node = tree_root + * | | + * yes | + * | | + * V | + * node = vroot | + * | | + * | | + * | | + * V V + * [check cbk_cache for key] --ok--> done + * | + * failed + * | + * V + * [start tree traversal from node] + * + */ + +#include "forward.h" +#include "debug.h" +#include "dformat.h" +#include "key.h" +#include "coord.h" +#include "plugin/item/static_stat.h" +#include "plugin/item/item.h" +#include "plugin/node/node.h" +#include "plugin/plugin.h" +#include "txnmgr.h" +#include "jnode.h" +#include "znode.h" +#include "block_alloc.h" +#include "tree_walk.h" +#include "carry.h" +#include "carry_ops.h" +#include "tap.h" +#include "tree.h" +#include "vfs_ops.h" +#include "page_cache.h" +#include "super.h" +#include "reiser4.h" +#include "inode.h" + +#include /* for struct super_block */ +#include + +/* Disk address (block number) never ever used for any real tree node. This is + used as block number of "uber" znode. + + Invalid block addresses are 0 by tradition. + +*/ +const reiser4_block_nr UBER_TREE_ADDR = 0ull; + +#define CUT_TREE_MIN_ITERATIONS 64 + +static int find_child_by_addr(znode * parent, znode * child, coord_t * result); + +/* return node plugin of coord->node */ +reiser4_internal node_plugin * +node_plugin_by_coord(const coord_t * coord) +{ + assert("vs-1", coord != NULL); + assert("vs-2", coord->node != NULL); + + return coord->node->nplug; +} + +/* insert item into tree. Fields of @coord are updated so that they can be + * used by consequent insert operation. */ +reiser4_internal insert_result +insert_by_key(reiser4_tree * tree /* tree to insert new item + * into */ , + const reiser4_key * key /* key of new item */ , + reiser4_item_data * data /* parameters for item + * creation */ , + coord_t * coord /* resulting insertion coord */ , + lock_handle * lh /* resulting lock + * handle */ , + tree_level stop_level /** level where to insert */ , + __u32 flags /* insertion flags */ ) +{ + int result; + + assert("nikita-358", tree != NULL); + assert("nikita-360", coord != NULL); + + result = coord_by_key(tree, key, coord, lh, ZNODE_WRITE_LOCK, + FIND_EXACT, stop_level, stop_level, flags | CBK_FOR_INSERT, 0/*ra_info*/); + switch (result) { + default: + break; + case CBK_COORD_FOUND: + result = IBK_ALREADY_EXISTS; + break; + case CBK_COORD_NOTFOUND: + assert("nikita-2017", coord->node != NULL); + result = insert_by_coord(coord, data, key, lh, 0 /*flags */ ); + break; + } + return result; +} + +/* insert item by calling carry. Helper function called if short-cut + insertion failed */ +static insert_result +insert_with_carry_by_coord(coord_t * coord /* coord where to insert */ , + lock_handle * lh /* lock handle of insertion + * node */ , + reiser4_item_data * data /* parameters of new + * item */ , + const reiser4_key * key /* key of new item */ , + carry_opcode cop /* carry operation to perform */ , + cop_insert_flag flags /* carry flags */) +{ + int result; + carry_pool *pool; + carry_level lowest_level; + carry_op *op; + carry_insert_data cdata; + + assert("umka-314", coord != NULL); + + pool = init_carry_pool(); + if (IS_ERR(pool)) + return PTR_ERR(pool); + init_carry_level(&lowest_level, pool); + + op = post_carry(&lowest_level, cop, coord->node, 0); + if (IS_ERR(op) || (op == NULL)) { + done_carry_pool(pool); + return RETERR(op ? PTR_ERR(op) : -EIO); + } + cdata.coord = coord; + cdata.data = data; + cdata.key = key; + op->u.insert.d = &cdata; + if (flags == 0) + flags = znode_get_tree(coord->node)->carry.insert_flags; + op->u.insert.flags = flags; + op->u.insert.type = COPT_ITEM_DATA; + op->u.insert.child = 0; + if (lh != NULL) { + assert("nikita-3245", lh->node == coord->node); + lowest_level.track_type = CARRY_TRACK_CHANGE; + lowest_level.tracked = lh; + } + + result = carry(&lowest_level, 0); + done_carry_pool(pool); + + return result; +} + +/* form carry queue to perform paste of @data with @key at @coord, and launch + its execution by calling carry(). + + Instruct carry to update @lh it after balancing insertion coord moves into + different block. + +*/ +static int +paste_with_carry(coord_t * coord /* coord of paste */ , + lock_handle * lh /* lock handle of node + * where item is + * pasted */ , + reiser4_item_data * data /* parameters of new + * item */ , + const reiser4_key * key /* key of new item */ , + unsigned flags /* paste flags */ ) +{ + int result; + carry_pool *pool; + carry_level lowest_level; + carry_op *op; + carry_insert_data cdata; + + assert("umka-315", coord != NULL); + assert("umka-316", key != NULL); + + pool = init_carry_pool(); + if (IS_ERR(pool)) + return PTR_ERR(pool); + init_carry_level(&lowest_level, pool); + + op = post_carry(&lowest_level, COP_PASTE, coord->node, 0); + if (IS_ERR(op) || (op == NULL)) { + done_carry_pool(pool); + return RETERR(op ? PTR_ERR(op) : -EIO); + } + cdata.coord = coord; + cdata.data = data; + cdata.key = key; + op->u.paste.d = &cdata; + if (flags == 0) + flags = znode_get_tree(coord->node)->carry.paste_flags; + op->u.paste.flags = flags; + op->u.paste.type = COPT_ITEM_DATA; + if (lh != NULL) { + lowest_level.track_type = CARRY_TRACK_CHANGE; + lowest_level.tracked = lh; + } + + result = carry(&lowest_level, 0); + done_carry_pool(pool); + + return result; +} + +/* insert item at the given coord. + + First try to skip carry by directly calling ->create_item() method of node + plugin. If this is impossible (there is not enough free space in the node, + or leftmost item in the node is created), call insert_with_carry_by_coord() + that will do full carry(). + +*/ +reiser4_internal insert_result +insert_by_coord(coord_t * coord /* coord where to + * insert. coord->node has + * to be write locked by + * caller */ , + reiser4_item_data * data /* data to be + * inserted */ , + const reiser4_key * key /* key of new item */ , + lock_handle * lh /* lock handle of write + * lock on node */ , + __u32 flags /* insertion flags */ ) +{ + unsigned item_size; + int result; + znode *node; + + assert("vs-247", coord != NULL); + assert("vs-248", data != NULL); + assert("vs-249", data->length >= 0); + assert("nikita-1191", znode_is_write_locked(coord->node)); + + node = coord->node; + coord_clear_iplug(coord); + result = zload(node); + if (result != 0) + return result; + + item_size = space_needed(node, NULL, data, 1); + if (item_size > znode_free_space(node) && + (flags & COPI_DONT_SHIFT_LEFT) && (flags & COPI_DONT_SHIFT_RIGHT) && (flags & COPI_DONT_ALLOCATE)) { + /* we are forced to use free space of coord->node and new item + does not fit into it. + + Currently we get here only when we allocate and copy units + of extent item from a node to its left neighbor during + "squalloc"-ing. If @node (this is left neighbor) does not + have enough free space - we do not want to attempt any + shifting and allocations because we are in squeezing and + everything to the left of @node is tightly packed. + */ + result = -E_NODE_FULL; + } else if ((item_size <= znode_free_space(node)) && + !coord_is_before_leftmost(coord) && + (node_plugin_by_node(node)->fast_insert != NULL) && node_plugin_by_node(node)->fast_insert(coord)) { + /* shortcut insertion without carry() overhead. + + Only possible if: + + - there is enough free space + + - insertion is not into the leftmost position in a node + (otherwise it would require updating of delimiting key in a + parent) + + - node plugin agrees with this + + */ + result = node_plugin_by_node(node)->create_item(coord, key, data, NULL); + znode_make_dirty(node); + } else { + /* otherwise do full-fledged carry(). */ + result = insert_with_carry_by_coord(coord, lh, data, key, COP_INSERT, flags); + } + zrelse(node); + return result; +} + +/* @coord is set to leaf level and @data is to be inserted to twig level */ +reiser4_internal insert_result +insert_extent_by_coord(coord_t * coord /* coord where to insert. coord->node * has to be write * locked by caller */ , + reiser4_item_data * data /* data to be inserted */ , + const reiser4_key * key /* key of new item */ , + lock_handle * lh /* lock handle of write lock on * node */) +{ + assert("vs-405", coord != NULL); + assert("vs-406", data != NULL); + assert("vs-407", data->length > 0); + assert("vs-408", znode_is_write_locked(coord->node)); + assert("vs-409", znode_get_level(coord->node) == LEAF_LEVEL); + + return insert_with_carry_by_coord(coord, lh, data, key, COP_EXTENT, 0 /*flags */ ); +} + +/* Insert into the item at the given coord. + + First try to skip carry by directly calling ->paste() method of item + plugin. If this is impossible (there is not enough free space in the node, + or we are pasting into leftmost position in the node), call + paste_with_carry() that will do full carry(). + +*/ +/* paste_into_item */ +reiser4_internal int +insert_into_item(coord_t * coord /* coord of pasting */ , + lock_handle * lh /* lock handle on node involved */ , + const reiser4_key * key /* key of unit being pasted */ , + reiser4_item_data * data /* parameters for new unit */ , + unsigned flags /* insert/paste flags */ ) +{ + int result; + int size_change; + node_plugin *nplug; + item_plugin *iplug; + + assert("umka-317", coord != NULL); + assert("umka-318", key != NULL); + + iplug = item_plugin_by_coord(coord); + nplug = node_plugin_by_coord(coord); + + assert("nikita-1480", iplug == data->iplug); + + size_change = space_needed(coord->node, coord, data, 0); + if (size_change > (int) znode_free_space(coord->node) && + (flags & COPI_DONT_SHIFT_LEFT) && (flags & COPI_DONT_SHIFT_RIGHT) && (flags & COPI_DONT_ALLOCATE)) { + /* we are forced to use free space of coord->node and new data + does not fit into it. */ + return -E_NODE_FULL; + } + + /* shortcut paste without carry() overhead. + + Only possible if: + + - there is enough free space + + - paste is not into the leftmost unit in a node (otherwise + it would require updating of delimiting key in a parent) + + - node plugin agrees with this + + - item plugin agrees with us + */ + if (size_change <= (int) znode_free_space(coord->node) && + (coord->item_pos != 0 || + coord->unit_pos != 0 || coord->between == AFTER_UNIT) && + coord->unit_pos != 0 && nplug->fast_paste != NULL && + nplug->fast_paste(coord) && + iplug->b.fast_paste != NULL && iplug->b.fast_paste(coord)) { + if (size_change > 0) + nplug->change_item_size(coord, size_change); + /* NOTE-NIKITA: huh? where @key is used? */ + result = iplug->b.paste(coord, data, NULL); + if (size_change < 0) + nplug->change_item_size(coord, size_change); + znode_make_dirty(coord->node); + } else + /* otherwise do full-fledged carry(). */ + result = paste_with_carry(coord, lh, data, key, flags); + return result; +} + +/* this either appends or truncates item @coord */ +reiser4_internal int +resize_item(coord_t * coord /* coord of item being resized */ , + reiser4_item_data * data /* parameters of resize */ , + reiser4_key * key /* key of new unit */ , + lock_handle * lh /* lock handle of node + * being modified */ , + cop_insert_flag flags /* carry flags */ ) +{ + int result; + znode *node; + + assert("nikita-362", coord != NULL); + assert("nikita-363", data != NULL); + assert("vs-245", data->length != 0); + + node = coord->node; + coord_clear_iplug(coord); + result = zload(node); + if (result != 0) + return result; + + if (data->length < 0) + result = node_plugin_by_coord(coord)->shrink_item(coord, + data->length); + else + result = insert_into_item(coord, lh, key, data, flags); + + zrelse(node); + return result; +} + +/* insert flow @f */ +reiser4_internal int +insert_flow(coord_t * coord, lock_handle * lh, flow_t * f) +{ + int result; + carry_pool *pool; + carry_level lowest_level; + carry_op *op; + reiser4_item_data data; + + pool = init_carry_pool(); + if (IS_ERR(pool)) + return PTR_ERR(pool); + init_carry_level(&lowest_level, pool); + + op = post_carry(&lowest_level, COP_INSERT_FLOW, coord->node, 0 /* operate directly on coord -> node */ ); + if (IS_ERR(op) || (op == NULL)) { + done_carry_pool(pool); + return RETERR(op ? PTR_ERR(op) : -EIO); + } + + /* these are permanent during insert_flow */ + data.user = 1; + data.iplug = item_plugin_by_id(FORMATTING_ID); + data.arg = 0; + /* data.length and data.data will be set before calling paste or + insert */ + data.length = 0; + data.data = 0; + + op->u.insert_flow.flags = 0; + op->u.insert_flow.insert_point = coord; + op->u.insert_flow.flow = f; + op->u.insert_flow.data = &data; + op->u.insert_flow.new_nodes = 0; + + lowest_level.track_type = CARRY_TRACK_CHANGE; + lowest_level.tracked = lh; + + result = carry(&lowest_level, 0); + done_carry_pool(pool); + + return result; +} + +/* Given a coord in parent node, obtain a znode for the corresponding child */ +reiser4_internal znode * +child_znode(const coord_t * parent_coord /* coord of pointer to + * child */ , + znode * parent /* parent of child */ , + int incore_p /* if !0 only return child if already in + * memory */ , + int setup_dkeys_p /* if !0 update delimiting keys of + * child */ ) +{ + znode *child; + + assert("nikita-1374", parent_coord != NULL); + assert("nikita-1482", parent != NULL); + assert("nikita-1384", ergo(setup_dkeys_p, + rw_dk_is_not_locked(znode_get_tree(parent)))); + assert("nikita-2947", znode_is_any_locked(parent)); + + if (znode_get_level(parent) <= LEAF_LEVEL) { + /* trying to get child of leaf node */ + warning("nikita-1217", "Child of maize?"); + print_znode("node", parent); + return ERR_PTR(RETERR(-EIO)); + } + if (item_is_internal(parent_coord)) { + reiser4_block_nr addr; + item_plugin *iplug; + reiser4_tree *tree; + + iplug = item_plugin_by_coord(parent_coord); + assert("vs-512", iplug->s.internal.down_link); + iplug->s.internal.down_link(parent_coord, NULL, &addr); + + tree = znode_get_tree(parent); + if (incore_p) + child = zlook(tree, &addr); + else + child = zget(tree, &addr, parent, znode_get_level(parent) - 1, GFP_KERNEL); + if ((child != NULL) && !IS_ERR(child) && setup_dkeys_p) + set_child_delimiting_keys(parent, parent_coord, child); + } else { + warning("nikita-1483", "Internal item expected"); + print_znode("node", parent); + child = ERR_PTR(RETERR(-EIO)); + } + return child; +} + +/* remove znode from transaction */ +static void uncapture_znode (znode * node) +{ + struct page * page; + + assert ("zam-1001", ZF_ISSET(node, JNODE_HEARD_BANSHEE)); + + /* Get e-flush block allocation back before deallocating node's + * block number. */ + spin_lock_znode(node); + if (ZF_ISSET(node, JNODE_EFLUSH)) + eflush_del(ZJNODE(node), 0); + spin_unlock_znode(node); + + if (!blocknr_is_fake(znode_get_block(node))) { + int ret; + + /* An already allocated block goes right to the atom's delete set. */ + ret = reiser4_dealloc_block( + znode_get_block(node), 0, BA_DEFER | BA_FORMATTED); + if (ret) + warning("zam-942", "can\'t add a block (%llu) number to atom's delete set\n", + (unsigned long long)(*znode_get_block(node))); + + spin_lock_znode(node); + /* Here we return flush reserved block which was reserved at the + * moment when this allocated node was marked dirty and still + * not used by flush in node relocation procedure. */ + if (ZF_ISSET(node, JNODE_FLUSH_RESERVED)) { + txn_atom * atom ; + + atom = jnode_get_atom(ZJNODE(node)); + assert("zam-939", atom != NULL); + spin_unlock_znode(node); + flush_reserved2grabbed(atom, (__u64)1); + UNLOCK_ATOM(atom); + } else + spin_unlock_znode(node); + } else { + /* znode has assigned block which is counted as "fake + allocated". Return it back to "free blocks") */ + fake_allocated2free((__u64) 1, BA_FORMATTED); + } + + /* + * uncapture page from transaction. There is a possibility of a race + * with ->releasepage(): reiser4_releasepage() detaches page from this + * jnode and we have nothing to uncapture. To avoid this, get + * reference of node->pg under jnode spin lock. uncapture_page() will + * deal with released page itself. + */ + spin_lock_znode(node); + page = znode_page(node); + if (likely(page != NULL)) { + /* + * uncapture_page() can only be called when we are sure that + * znode is pinned in memory, which we are, because + * forget_znode() is only called from longterm_unlock_znode(). + */ + page_cache_get(page); + spin_unlock_znode(node); + lock_page(page); + uncapture_page(page); + unlock_page(page); + page_cache_release(page); + } else { + txn_atom * atom; + + /* handle "flush queued" znodes */ + while (1) { + atom = jnode_get_atom(ZJNODE(node)); + assert("zam-943", atom != NULL); + + if (!ZF_ISSET(node, JNODE_FLUSH_QUEUED) || !atom->nr_running_queues) + break; + + spin_unlock_znode(node); + atom_wait_event(atom); + spin_lock_znode(node); + } + + uncapture_block(ZJNODE(node)); + UNLOCK_ATOM(atom); + zput(node); + } +} + +/* This is called from longterm_unlock_znode() when last lock is released from + the node that has been removed from the tree. At this point node is removed + from sibling list and its lock is invalidated. */ +reiser4_internal void +forget_znode(lock_handle * handle) +{ + znode *node; + reiser4_tree *tree; + + assert("umka-319", handle != NULL); + + node = handle->node; + tree = znode_get_tree(node); + + assert("vs-164", znode_is_write_locked(node)); + assert("nikita-1280", ZF_ISSET(node, JNODE_HEARD_BANSHEE)); + assert("nikita-3337", rw_zlock_is_locked(&node->lock)); + + /* We assume that this node was detached from its parent before + * unlocking, it gives no way to reach this node from parent through a + * down link. The node should have no children and, thereby, can't be + * reached from them by their parent pointers. The only way to obtain a + * reference to the node is to use sibling pointers from its left and + * right neighbors. In the next several lines we remove the node from + * the sibling list. */ + + WLOCK_TREE(tree); + sibling_list_remove(node); + znode_remove(node, tree); + WUNLOCK_TREE(tree); + + /* Here we set JNODE_DYING and cancel all pending lock requests. It + * forces all lock requestor threads to repeat iterations of getting + * lock on a child, neighbor or parent node. But, those threads can't + * come to this node again, because this node is no longer a child, + * neighbor or parent of any other node. This order of znode + * invalidation does not allow other threads to waste cpu time is a busy + * loop, trying to lock dying object. The exception is in the flush + * code when we take node directly from atom's capture list.*/ + + write_unlock_zlock(&node->lock); + /* and, remove from atom's capture list. */ + uncapture_znode(node); + write_lock_zlock(&node->lock); + + invalidate_lock(handle); +} + +/* Check that internal item at @pointer really contains pointer to @child. */ +reiser4_internal int +check_tree_pointer(const coord_t * pointer /* would-be pointer to + * @child */ , + const znode * child /* child znode */ ) +{ + assert("nikita-1016", pointer != NULL); + assert("nikita-1017", child != NULL); + assert("nikita-1018", pointer->node != NULL); + + assert("nikita-1325", znode_is_any_locked(pointer->node)); + + assert("nikita-2985", + znode_get_level(pointer->node) == znode_get_level(child) + 1); + + coord_clear_iplug((coord_t *) pointer); + + if (coord_is_existing_unit(pointer)) { + item_plugin *iplug; + reiser4_block_nr addr; + + if (item_is_internal(pointer)) { + iplug = item_plugin_by_coord(pointer); + assert("vs-513", iplug->s.internal.down_link); + iplug->s.internal.down_link(pointer, NULL, &addr); + /* check that cached value is correct */ + if (disk_addr_eq(&addr, znode_get_block(child))) { + return NS_FOUND; + } + } + } + /* warning ("jmacd-1002", "tree pointer incorrect"); */ + return NS_NOT_FOUND; +} + +/* find coord of pointer to new @child in @parent. + + Find the &coord_t in the @parent where pointer to a given @child will + be in. + +*/ +reiser4_internal int +find_new_child_ptr(znode * parent /* parent znode, passed locked */ , + znode * child UNUSED_ARG /* child znode, passed locked */ , + znode * left /* left brother of new node */ , + coord_t * result /* where result is stored in */ ) +{ + int ret; + + assert("nikita-1486", parent != NULL); + assert("nikita-1487", child != NULL); + assert("nikita-1488", result != NULL); + + ret = find_child_ptr(parent, left, result); + if (ret != NS_FOUND) { + warning("nikita-1489", "Cannot find brother position: %i", ret); + return RETERR(-EIO); + } else { + result->between = AFTER_UNIT; + return RETERR(NS_NOT_FOUND); + } +} + +/* find coord of pointer to @child in @parent. + + Find the &coord_t in the @parent where pointer to a given @child is in. + +*/ +reiser4_internal int +find_child_ptr(znode * parent /* parent znode, passed locked */ , + znode * child /* child znode, passed locked */ , + coord_t * result /* where result is stored in */ ) +{ + int lookup_res; + node_plugin *nplug; + /* left delimiting key of a child */ + reiser4_key ld; + reiser4_tree *tree; + + assert("nikita-934", parent != NULL); + assert("nikita-935", child != NULL); + assert("nikita-936", result != NULL); + assert("zam-356", znode_is_loaded(parent)); + + coord_init_zero(result); + result->node = parent; + + nplug = parent->nplug; + assert("nikita-939", nplug != NULL); + + tree = znode_get_tree(parent); + /* NOTE-NIKITA taking read-lock on tree here assumes that @result is + * not aliased to ->in_parent of some znode. Otherwise, + * parent_coord_to_coord() below would modify data protected by tree + * lock. */ + RLOCK_TREE(tree); + /* fast path. Try to use cached value. Lock tree to keep + node->pos_in_parent and pos->*_blocknr consistent. */ + if (child->in_parent.item_pos + 1 != 0) { + parent_coord_to_coord(&child->in_parent, result); + if (check_tree_pointer(result, child) == NS_FOUND) { + RUNLOCK_TREE(tree); + return NS_FOUND; + } + + child->in_parent.item_pos = (unsigned short)~0; + } + RUNLOCK_TREE(tree); + + /* is above failed, find some key from @child. We are looking for the + least key in a child. */ + UNDER_RW_VOID(dk, tree, read, ld = *znode_get_ld_key(child)); + /* + * now, lookup parent with key just found. Note, that left delimiting + * key doesn't identify node uniquely, because (in extremely rare + * case) two nodes can have equal left delimiting keys, if one of them + * is completely filled with directory entries that all happened to be + * hash collision. But, we check block number in check_tree_pointer() + * and, so, are safe. + */ + lookup_res = nplug->lookup(parent, &ld, FIND_EXACT, result); + /* update cached pos_in_node */ + if (lookup_res == NS_FOUND) { + WLOCK_TREE(tree); + coord_to_parent_coord(result, &child->in_parent); + WUNLOCK_TREE(tree); + lookup_res = check_tree_pointer(result, child); + } + if (lookup_res == NS_NOT_FOUND) + lookup_res = find_child_by_addr(parent, child, result); + return lookup_res; +} + +/* find coord of pointer to @child in @parent by scanning + + Find the &coord_t in the @parent where pointer to a given @child + is in by scanning all internal items in @parent and comparing block + numbers in them with that of @child. + +*/ +static int +find_child_by_addr(znode * parent /* parent znode, passed locked */ , + znode * child /* child znode, passed locked */ , + coord_t * result /* where result is stored in */ ) +{ + int ret; + + assert("nikita-1320", parent != NULL); + assert("nikita-1321", child != NULL); + assert("nikita-1322", result != NULL); + + ret = NS_NOT_FOUND; + + for_all_units(result, parent) { + if (check_tree_pointer(result, child) == NS_FOUND) { + UNDER_RW_VOID(tree, znode_get_tree(parent), write, + coord_to_parent_coord(result, + &child->in_parent)); + ret = NS_FOUND; + break; + } + } + return ret; +} + +/* true, if @addr is "unallocated block number", which is just address, with + highest bit set. */ +reiser4_internal int +is_disk_addr_unallocated(const reiser4_block_nr * addr /* address to + * check */ ) +{ + assert("nikita-1766", addr != NULL); + cassert(sizeof (reiser4_block_nr) == 8); + return (*addr & REISER4_BLOCKNR_STATUS_BIT_MASK) == REISER4_UNALLOCATED_STATUS_VALUE; +} + +/* returns true if removing bytes of given range of key [from_key, to_key] + causes removing of whole item @from */ +static int +item_removed_completely(coord_t * from, const reiser4_key * from_key, const reiser4_key * to_key) +{ + item_plugin *iplug; + reiser4_key key_in_item; + + assert("umka-325", from != NULL); + assert("", item_is_extent(from)); + + /* check first key just for case */ + item_key_by_coord(from, &key_in_item); + if (keygt(from_key, &key_in_item)) + return 0; + + /* check last key */ + iplug = item_plugin_by_coord(from); + assert("vs-611", iplug && iplug->s.file.append_key); + + iplug->s.file.append_key(from, &key_in_item); + set_key_offset(&key_in_item, get_key_offset(&key_in_item) - 1); + + if (keylt(to_key, &key_in_item)) + /* last byte is not removed */ + return 0; + return 1; +} + +/* helper function for prepare_twig_kill(): @left and @right are formatted + * neighbors of extent item being completely removed. Load and lock neighbors + * and store lock handles into @cdata for later use by kill_hook_extent() */ +static int +prepare_children(znode *left, znode *right, carry_kill_data *kdata) +{ + int result; + int left_loaded; + int right_loaded; + + result = 0; + left_loaded = right_loaded = 0; + + if (left != NULL) { + result = zload(left); + if (result == 0) { + left_loaded = 1; + result = longterm_lock_znode(kdata->left, left, + ZNODE_READ_LOCK, + ZNODE_LOCK_LOPRI); + } + } + if (result == 0 && right != NULL) { + result = zload(right); + if (result == 0) { + right_loaded = 1; + result = longterm_lock_znode(kdata->right, right, + ZNODE_READ_LOCK, + ZNODE_LOCK_HIPRI | ZNODE_LOCK_NONBLOCK); + } + } + if (result != 0) { + done_lh(kdata->left); + done_lh(kdata->right); + if (left_loaded != 0) + zrelse(left); + if (right_loaded != 0) + zrelse(right); + } + return result; +} + +static void +done_children(carry_kill_data *kdata) +{ + if (kdata->left != NULL && kdata->left->node != NULL) { + zrelse(kdata->left->node); + done_lh(kdata->left); + } + if (kdata->right != NULL && kdata->right->node != NULL) { + zrelse(kdata->right->node); + done_lh(kdata->right); + } +} + +/* part of cut_node. It is called when cut_node is called to remove or cut part + of extent item. When head of that item is removed - we have to update right + delimiting of left neighbor of extent. When item is removed completely - we + have to set sibling link between left and right neighbor of removed + extent. This may return -E_DEADLOCK because of trying to get left neighbor + locked. So, caller should repeat an attempt +*/ +/* Audited by: umka (2002.06.16) */ +static int +prepare_twig_kill(carry_kill_data *kdata, znode * locked_left_neighbor) +{ + int result; + reiser4_key key; + lock_handle left_lh; + lock_handle right_lh; + coord_t left_coord; + coord_t *from; + znode *left_child; + znode *right_child; + reiser4_tree *tree; + int left_zloaded_here, right_zloaded_here; + + from = kdata->params.from; + assert("umka-326", from != NULL); + assert("umka-327", kdata->params.to != NULL); + + /* for one extent item only yet */ + assert("vs-591", item_is_extent(from)); + assert ("vs-592", from->item_pos == kdata->params.to->item_pos); + + if ((kdata->params.from_key && keygt(kdata->params.from_key, item_key_by_coord(from, &key))) || + from->unit_pos != 0) { + /* head of item @from is not removed, there is nothing to + worry about */ + return 0; + } + + result = 0; + left_zloaded_here = 0; + right_zloaded_here = 0; + + left_child = right_child = NULL; + + coord_dup(&left_coord, from); + init_lh(&left_lh); + init_lh(&right_lh); + if (coord_prev_unit(&left_coord)) { + /* @from is leftmost item in its node */ + if (!locked_left_neighbor) { + result = reiser4_get_left_neighbor(&left_lh, from->node, ZNODE_READ_LOCK, GN_CAN_USE_UPPER_LEVELS); + switch (result) { + case 0: + break; + case -E_NO_NEIGHBOR: + /* there is no formatted node to the left of + from->node */ + warning("vs-605", + "extent item has smallest key in " "the tree and it is about to be removed"); + return 0; + case -E_DEADLOCK: + /* need to restart */ + default: + return result; + } + + /* we have acquired left neighbor of from->node */ + result = zload(left_lh.node); + if (result) + goto done; + + locked_left_neighbor = left_lh.node; + } else { + /* squalloc_right_twig_cut should have supplied locked + * left neighbor */ + assert("vs-834", znode_is_write_locked(locked_left_neighbor)); + result = zload(locked_left_neighbor); + if (result) + return result; + } + + left_zloaded_here = 1; + coord_init_last_unit(&left_coord, locked_left_neighbor); + } + + if (!item_is_internal(&left_coord)) { + /* what else but extent can be on twig level */ + assert("vs-606", item_is_extent(&left_coord)); + + /* there is no left formatted child */ + if (left_zloaded_here) + zrelse(locked_left_neighbor); + done_lh(&left_lh); + return 0; + } + + tree = znode_get_tree(left_coord.node); + left_child = child_znode(&left_coord, left_coord.node, 1, 0); + + if (IS_ERR(left_child)) { + result = PTR_ERR(left_child); + goto done; + } + + /* left child is acquired, calculate new right delimiting key for it + and get right child if it is necessary */ + if (item_removed_completely(from, kdata->params.from_key, kdata->params.to_key)) { + /* try to get right child of removed item */ + coord_t right_coord; + + assert("vs-607", kdata->params.to->unit_pos == coord_last_unit_pos(kdata->params.to)); + coord_dup(&right_coord, kdata->params.to); + if (coord_next_unit(&right_coord)) { + /* @to is rightmost unit in the node */ + result = reiser4_get_right_neighbor(&right_lh, from->node, ZNODE_READ_LOCK, GN_CAN_USE_UPPER_LEVELS); + switch (result) { + case 0: + result = zload(right_lh.node); + if (result) + goto done; + + right_zloaded_here = 1; + coord_init_first_unit(&right_coord, right_lh.node); + item_key_by_coord(&right_coord, &key); + break; + + case -E_NO_NEIGHBOR: + /* there is no formatted node to the right of + from->node */ + UNDER_RW_VOID(dk, tree, read, + key = *znode_get_rd_key(from->node)); + right_coord.node = 0; + result = 0; + break; + default: + /* real error */ + goto done; + } + } else { + /* there is an item to the right of @from - take its key */ + item_key_by_coord(&right_coord, &key); + } + + /* try to get right child of @from */ + if (right_coord.node && /* there is right neighbor of @from */ + item_is_internal(&right_coord)) { /* it is internal item */ + right_child = child_znode(&right_coord, + right_coord.node, 1, 0); + + if (IS_ERR(right_child)) { + result = PTR_ERR(right_child); + goto done; + } + + } + /* whole extent is removed between znodes left_child and right_child. Prepare them for linking and + update of right delimiting key of left_child */ + result = prepare_children(left_child, right_child, kdata); + } else { + /* head of item @to is removed. left_child has to get right delimting key update. Prepare it for that */ + result = prepare_children(left_child, NULL, kdata); + } + + done: + if (right_child) + zput(right_child); + if (right_zloaded_here) + zrelse(right_lh.node); + done_lh(&right_lh); + + if (left_child) + zput(left_child); + if (left_zloaded_here) + zrelse(locked_left_neighbor); + done_lh(&left_lh); + return result; +} + +/* this is used to remove part of node content between coordinates @from and @to. Units to which @from and @to are set + are to be cut completely */ +/* for try_to_merge_with_left, delete_copied, delete_node */ +reiser4_internal int +cut_node_content(coord_t *from, coord_t *to, + const reiser4_key * from_key /* first key to be removed */ , + const reiser4_key * to_key /* last key to be removed */ , + reiser4_key * smallest_removed /* smallest key actually removed */) +{ + carry_pool *pool; + carry_level lowest_level; + carry_op *op; + carry_cut_data cut_data; + int result; + + assert("vs-1715", coord_compare(from, to) != COORD_CMP_ON_RIGHT); + + pool = init_carry_pool(); + if (IS_ERR(pool)) + return PTR_ERR(pool); + init_carry_level(&lowest_level, pool); + + op = post_carry(&lowest_level, COP_CUT, from->node, 0); + assert("vs-1509", op != 0); + if (IS_ERR(op)) { + done_carry_pool(pool); + return PTR_ERR(op); + } + + cut_data.params.from = from; + cut_data.params.to = to; + cut_data.params.from_key = from_key; + cut_data.params.to_key = to_key; + cut_data.params.smallest_removed = smallest_removed; + + op->u.cut_or_kill.is_cut = 1; + op->u.cut_or_kill.u.cut = &cut_data; + + result = carry(&lowest_level, 0); + done_carry_pool(pool); + + return result; +} + +/* cut part of the node + + Cut part or whole content of node. + + cut data between @from and @to of @from->node and call carry() to make + corresponding changes in the tree. @from->node may become empty. If so - + pointer to it will be removed. Neighboring nodes are not changed. Smallest + removed key is stored in @smallest_removed + +*/ +reiser4_internal int +kill_node_content(coord_t * from /* coord of the first unit/item that will be + * eliminated */ , + coord_t * to /* coord of the last unit/item that will be + * eliminated */ , + const reiser4_key * from_key /* first key to be removed */ , + const reiser4_key * to_key /* last key to be removed */ , + reiser4_key * smallest_removed /* smallest key actually + * removed */ , + znode * locked_left_neighbor, /* this is set when kill_node_content is called with left neighbor + * locked (in squalloc_right_twig_cut, namely) */ + struct inode *inode, /* inode of file whose item (or its part) is to be killed. This is necessary to + invalidate pages together with item pointing to them */ + int truncate) /* this call is made for file truncate) */ +{ + int result; + carry_pool *pool; + carry_level lowest_level; + carry_op *op; + carry_kill_data kdata; + lock_handle left_child; + lock_handle right_child; + + assert("umka-328", from != NULL); + assert("vs-316", !node_is_empty(from->node)); + assert("nikita-1812", coord_is_existing_unit(from) && coord_is_existing_unit(to)); + + init_lh(&left_child); + init_lh(&right_child); + + kdata.params.from = from; + kdata.params.to = to; + kdata.params.from_key = from_key; + kdata.params.to_key = to_key; + kdata.params.smallest_removed = smallest_removed; + kdata.params.truncate = truncate; + kdata.flags = 0; + kdata.inode = inode; + kdata.left = &left_child; + kdata.right = &right_child; + + if (znode_get_level(from->node) == TWIG_LEVEL && item_is_extent(from)) { + /* left child of extent item may have to get updated right + delimiting key and to get linked with right child of extent + @from if it will be removed completely */ + result = prepare_twig_kill(&kdata, locked_left_neighbor); + if (result) { + done_children(&kdata); + return result; + } + } + + pool = init_carry_pool(); + if (IS_ERR(pool)) + return PTR_ERR(pool); + init_carry_level(&lowest_level, pool); + + op = post_carry(&lowest_level, COP_CUT, from->node, 0); + if (IS_ERR(op) || (op == NULL)) { + done_carry_pool(pool); + done_children(&kdata); + return RETERR(op ? PTR_ERR(op) : -EIO); + } + + op->u.cut_or_kill.is_cut = 0; + op->u.cut_or_kill.u.kill = &kdata; + + result = carry(&lowest_level, 0); + + done_carry_pool(pool); + done_children(&kdata); + return result; +} + +void +fake_kill_hook_tail(struct inode *inode, loff_t start, loff_t end, int truncate) +{ + if (inode_get_flag(inode, REISER4_HAS_MMAP)) { + pgoff_t start_pg, end_pg; + + start_pg = start >> PAGE_CACHE_SHIFT; + end_pg = (end - 1) >> PAGE_CACHE_SHIFT; + + if ((start & (PAGE_CACHE_SIZE - 1)) == 0) { + /* + * kill up to the page boundary. + */ + assert("vs-123456", start_pg == end_pg); + reiser4_invalidate_pages(inode->i_mapping, start_pg, 1, truncate); + } else if (start_pg != end_pg) { + /* + * page boundary is within killed portion of node. + */ + assert("vs-654321", end_pg - start_pg == 1); + reiser4_invalidate_pages(inode->i_mapping, end_pg, end_pg - start_pg, 1); + } + } + inode_sub_bytes(inode, end - start); +} + +/** + * Delete whole @node from the reiser4 tree without loading it. + * + * @left: locked left neighbor, + * @node: node to be deleted, + * @smallest_removed: leftmost key of deleted node, + * @object: inode pointer, if we truncate a file body. + * @truncate: true if called for file truncate. + * + * @return: 0 if success, error code otherwise. + * + * NOTE: if @object!=NULL we assume that @smallest_removed != NULL and it + * contains the right value of the smallest removed key from the previous + * cut_worker() iteration. This is needed for proper accounting of + * "i_blocks" and "i_bytes" fields of the @object. + */ +reiser4_internal int delete_node (znode * node, reiser4_key * smallest_removed, + struct inode * object, int truncate) +{ + lock_handle parent_lock; + coord_t cut_from; + coord_t cut_to; + reiser4_tree * tree; + int ret; + + assert ("zam-937", node != NULL); + assert ("zam-933", znode_is_write_locked(node)); + assert ("zam-999", smallest_removed != NULL); + + init_lh(&parent_lock); + + ret = reiser4_get_parent(&parent_lock, node, ZNODE_WRITE_LOCK, 0); + if (ret) + return ret; + + assert("zam-934", !znode_above_root(parent_lock.node)); + + ret = zload(parent_lock.node); + if (ret) + goto failed_nozrelse; + + ret = find_child_ptr(parent_lock.node, node, &cut_from); + if (ret) + goto failed; + + /* decrement child counter and set parent pointer to NULL before + deleting the list from parent node because of checks in + internal_kill_item_hook (we can delete the last item from the parent + node, the parent node is going to be deleted and its c_count should + be zero). */ + + tree = znode_get_tree(node); + WLOCK_TREE(tree); + init_parent_coord(&node->in_parent, NULL); + -- parent_lock.node->c_count; + WUNLOCK_TREE(tree); + + assert("zam-989", item_is_internal(&cut_from)); + + /* @node should be deleted after unlocking. */ + ZF_SET(node, JNODE_HEARD_BANSHEE); + + /* remove a pointer from the parent node to the node being deleted. */ + coord_dup(&cut_to, &cut_from); + /* FIXME: shouldn't this be kill_node_content */ + ret = cut_node_content(&cut_from, &cut_to, NULL, NULL, NULL); + if (ret) + /* FIXME(Zam): Should we re-connect the node to its parent if + * cut_node fails? */ + goto failed; + + { + reiser4_tree * tree = current_tree; + __u64 start_offset = 0, end_offset = 0; + + RLOCK_TREE(tree); + WLOCK_DK(tree); + if (object) { + /* We use @smallest_removed and the left delimiting of + * the current node for @object->i_blocks, i_bytes + * calculation. We assume that the items after the + * *@smallest_removed key have been deleted from the + * file body. */ + start_offset = get_key_offset(znode_get_ld_key(node)); + end_offset = get_key_offset(smallest_removed); + } + + assert("zam-1021", znode_is_connected(node)); + if (node->left) + znode_set_rd_key(node->left, znode_get_rd_key(node)); + + *smallest_removed = *znode_get_ld_key(node); + + WUNLOCK_DK(tree); + RUNLOCK_TREE(tree); + + if (object) { + /* we used to perform actions which are to be performed on items on their removal from tree in + special item method - kill_hook. Here for optimization reasons we avoid reading node + containing item we remove and can not call item's kill hook. Instead we call function which + does exactly the same things as tail kill hook in assumption that node we avoid reading + contains only one item and that item is a tail one. */ + fake_kill_hook_tail(object, start_offset, end_offset, truncate); + } + } + failed: + zrelse(parent_lock.node); + failed_nozrelse: + done_lh(&parent_lock); + + return ret; +} + +/** + * This subroutine is not optimal but implementation seems to + * be easier). + * + * @tap: the point deletion process begins from, + * @from_key: the beginning of the deleted key range, + * @to_key: the end of the deleted key range, + * @smallest_removed: the smallest removed key, + * @truncate: true if called for file truncate. + * @progress: return true if a progress in file items deletions was made, + * @smallest_removed value is actual in that case. + * + * @return: 0 if success, error code otherwise, -E_REPEAT means that long cut_tree + * operation was interrupted for allowing atom commit . + */ +reiser4_internal int +cut_tree_worker_common (tap_t * tap, const reiser4_key * from_key, + const reiser4_key * to_key, reiser4_key * smallest_removed, + struct inode * object, int truncate, int *progress) +{ + lock_handle next_node_lock; + coord_t left_coord; + int result; + + assert("zam-931", tap->coord->node != NULL); + assert("zam-932", znode_is_write_locked(tap->coord->node)); + + *progress = 0; + init_lh(&next_node_lock); + + while (1) { + znode *node; /* node from which items are cut */ + node_plugin *nplug; /* node plugin for @node */ + + node = tap->coord->node; + + /* Move next_node_lock to the next node on the left. */ + result = reiser4_get_left_neighbor( + &next_node_lock, node, ZNODE_WRITE_LOCK, GN_CAN_USE_UPPER_LEVELS); + if (result != 0 && result != -E_NO_NEIGHBOR) + break; + /* Check can we delete the node as a whole. */ + if (*progress && znode_get_level(node) == LEAF_LEVEL && + UNDER_RW(dk, current_tree, read, keyle(from_key, znode_get_ld_key(node)))) + { + result = delete_node(node, smallest_removed, object, truncate); + } else { + result = tap_load(tap); + if (result) + return result; + + /* Prepare the second (right) point for cut_node() */ + if (*progress) + coord_init_last_unit(tap->coord, node); + + else if (item_plugin_by_coord(tap->coord)->b.lookup == NULL) + /* set rightmost unit for the items without lookup method */ + tap->coord->unit_pos = coord_last_unit_pos(tap->coord); + + nplug = node->nplug; + + assert("vs-686", nplug); + assert("vs-687", nplug->lookup); + + /* left_coord is leftmost unit cut from @node */ + result = nplug->lookup(node, from_key, + FIND_MAX_NOT_MORE_THAN, &left_coord); + + if (IS_CBKERR(result)) + break; + + /* adjust coordinates so that they are set to existing units */ + if (coord_set_to_right(&left_coord) || coord_set_to_left(tap->coord)) { + result = 0; + break; + } + + if (coord_compare(&left_coord, tap->coord) == COORD_CMP_ON_RIGHT) { + /* keys from @from_key to @to_key are not in the tree */ + result = 0; + break; + } + + if (left_coord.item_pos != tap->coord->item_pos) { + /* do not allow to cut more than one item. It is added to solve problem of truncating + partially converted files. If file is partially converted there may exist a twig node + containing both internal item or items pointing to leaf nodes with formatting items + and extent item. We do not want to kill internal items being at twig node here + because cut_tree_worker assumes killing them from level level */ + coord_dup(&left_coord, tap->coord); + assert("vs-1652", coord_is_existing_unit(&left_coord)); + left_coord.unit_pos = 0; + } + + /* cut data from one node */ + // *smallest_removed = *min_key(); + result = kill_node_content(&left_coord, tap->coord, from_key, to_key, + smallest_removed, next_node_lock.node, + object, truncate); + tap_relse(tap); + } + if (result) + break; + + ++ (*progress); + + /* Check whether all items with keys >= from_key were removed + * from the tree. */ + if (keyle(smallest_removed, from_key)) + /* result = 0;*/ + break; + + if (next_node_lock.node == NULL) + break; + + result = tap_move(tap, &next_node_lock); + done_lh(&next_node_lock); + if (result) + break; + + /* Break long cut_tree operation (deletion of a large file) if + * atom requires commit. */ + if (*progress > CUT_TREE_MIN_ITERATIONS + && current_atom_should_commit()) + { + result = -E_REPEAT; + break; + } + } + done_lh(&next_node_lock); + // assert("vs-301", !keyeq(&smallest_removed, min_key())); + return result; +} + + +/* there is a fundamental problem with optimizing deletes: VFS does it + one file at a time. Another problem is that if an item can be + anything, then deleting items must be done one at a time. It just + seems clean to writes this to specify a from and a to key, and cut + everything between them though. */ + +/* use this function with care if deleting more than what is part of a single file. */ +/* do not use this when cutting a single item, it is suboptimal for that */ + +/* You are encouraged to write plugin specific versions of this. It + cannot be optimal for all plugins because it works item at a time, + and some plugins could sometimes work node at a time. Regular files + however are not optimizable to work node at a time because of + extents needing to free the blocks they point to. + + Optimizations compared to v3 code: + + It does not balance (that task is left to memory pressure code). + + Nodes are deleted only if empty. + + Uses extents. + + Performs read-ahead of formatted nodes whose contents are part of + the deletion. +*/ + + +/** + * Delete everything from the reiser4 tree between two keys: @from_key and + * @to_key. + * + * @from_key: the beginning of the deleted key range, + * @to_key: the end of the deleted key range, + * @smallest_removed: the smallest removed key, + * @object: owner of cutting items. + * @truncate: true if called for file truncate. + * @progress: return true if a progress in file items deletions was made, + * @smallest_removed value is actual in that case. + * + * @return: 0 if success, error code otherwise, -E_REPEAT means that long cut_tree + * operation was interrupted for allowing atom commit . + */ + +reiser4_internal int +cut_tree_object(reiser4_tree * tree, const reiser4_key * from_key, + const reiser4_key * to_key, reiser4_key * smallest_removed_p, + struct inode * object, int truncate, int *progress) +{ + lock_handle lock; + int result; + tap_t tap; + coord_t right_coord; + reiser4_key smallest_removed; + int (*cut_tree_worker)(tap_t *, const reiser4_key *, const reiser4_key *, + reiser4_key *, struct inode *, int, int *); + STORE_COUNTERS; + + assert("umka-329", tree != NULL); + assert("umka-330", from_key != NULL); + assert("umka-331", to_key != NULL); + assert("zam-936", keyle(from_key, to_key)); + + if (smallest_removed_p == NULL) + smallest_removed_p = &smallest_removed; + + init_lh(&lock); + + do { + /* Find rightmost item to cut away from the tree. */ + result = object_lookup( + object, to_key, &right_coord, &lock, + ZNODE_WRITE_LOCK, FIND_MAX_NOT_MORE_THAN, TWIG_LEVEL, + LEAF_LEVEL, CBK_UNIQUE, 0/*ra_info*/); + if (result != CBK_COORD_FOUND) + break; + if (object == NULL || inode_file_plugin(object)->cut_tree_worker == NULL) + cut_tree_worker = cut_tree_worker_common; + else + cut_tree_worker = inode_file_plugin(object)->cut_tree_worker; + tap_init(&tap, &right_coord, &lock, ZNODE_WRITE_LOCK); + result = cut_tree_worker( + &tap, from_key, to_key, smallest_removed_p, object, truncate, progress); + tap_done(&tap); + + preempt_point(); + + } while (0); + + done_lh(&lock); + + if (result) { + switch (result) { + case -E_NO_NEIGHBOR: + result = 0; + break; + case -E_DEADLOCK: + result = -E_REPEAT; + case -E_REPEAT: + case -ENOMEM: + case -ENOENT: + break; + default: + warning("nikita-2861", "failure: %i", result); + } + } + + CHECK_COUNTERS; + return result; +} + +/* repeat cut_tree_object until everything is deleted. unlike cut_file_items, it + * does not end current transaction if -E_REPEAT is returned by + * cut_tree_object. */ +reiser4_internal int +cut_tree(reiser4_tree *tree, const reiser4_key *from, const reiser4_key *to, + struct inode *inode, int truncate) +{ + int result; + int progress; + + do { + result = cut_tree_object(tree, from, to, NULL, inode, truncate, &progress); + } while (result == -E_REPEAT); + + return result; +} + + +/* first step of reiser4 tree initialization */ +reiser4_internal void +init_tree_0(reiser4_tree * tree) +{ + assert("zam-683", tree != NULL); + rw_tree_init(tree); + spin_epoch_init(tree); +} + +/* finishing reiser4 initialization */ +reiser4_internal int +init_tree(reiser4_tree * tree /* pointer to structure being + * initialized */ , + const reiser4_block_nr * root_block /* address of a root block + * on a disk */ , + tree_level height /* height of a tree */ , + node_plugin * nplug /* default node plugin */ ) +{ + int result; + + assert("nikita-306", tree != NULL); + assert("nikita-307", root_block != NULL); + assert("nikita-308", height > 0); + assert("nikita-309", nplug != NULL); + assert("zam-587", tree->super != NULL); + + /* someone might not call init_tree_0 before calling init_tree. */ + init_tree_0(tree); + + tree->root_block = *root_block; + tree->height = height; + tree->estimate_one_insert = calc_estimate_one_insert(height); + tree->nplug = nplug; + + tree->znode_epoch = 1ull; + + cbk_cache_init(&tree->cbk_cache); + + result = znodes_tree_init(tree); + if (result == 0) + result = jnodes_tree_init(tree); + if (result == 0) { + tree->uber = zget(tree, &UBER_TREE_ADDR, NULL, 0, GFP_KERNEL); + if (IS_ERR(tree->uber)) { + result = PTR_ERR(tree->uber); + tree->uber = NULL; + } + } + return result; +} + +/* release resources associated with @tree */ +reiser4_internal void +done_tree(reiser4_tree * tree /* tree to release */ ) +{ + if (tree == NULL) + return; + + if (tree->uber != NULL) { + zput(tree->uber); + tree->uber = NULL; + } + znodes_tree_done(tree); + jnodes_tree_done(tree); + cbk_cache_done(&tree->cbk_cache); +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/tree.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/tree.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,551 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Tree operations. See fs/reiser4/tree.c for comments */ + +#if !defined( __REISER4_TREE_H__ ) +#define __REISER4_TREE_H__ + +#include "forward.h" +#include "debug.h" +#include "spin_macros.h" +#include "dformat.h" +#include "type_safe_list.h" +#include "plugin/node/node.h" +#include "plugin/plugin.h" +#include "jnode.h" +#include "znode.h" +#include "tap.h" + +#include /* for __u?? */ +#include /* for struct super_block */ +#include +#include /* for struct task_struct */ + +/* fictive block number never actually used */ +extern const reiser4_block_nr UBER_TREE_ADDR; + +/* define typed list for cbk_cache lru */ +TYPE_SAFE_LIST_DECLARE(cbk_cache); + +/* &cbk_cache_slot - entry in a coord cache. + + This is entry in a coord_by_key (cbk) cache, represented by + &cbk_cache. + +*/ +typedef struct cbk_cache_slot { + /* cached node */ + znode *node; + /* linkage to the next cbk cache slot in a LRU order */ + cbk_cache_list_link lru; +} cbk_cache_slot; + +/* &cbk_cache - coord cache. This is part of reiser4_tree. + + cbk_cache is supposed to speed up tree lookups by caching results of recent + successful lookups (we don't cache negative results as dentry cache + does). Cache consists of relatively small number of entries kept in a LRU + order. Each entry (&cbk_cache_slot) contains a pointer to znode, from + which we can obtain a range of keys that covered by this znode. Before + embarking into real tree traversal we scan cbk_cache slot by slot and for + each slot check whether key we are looking for is between minimal and + maximal keys for node pointed to by this slot. If no match is found, real + tree traversal is performed and if result is successful, appropriate entry + is inserted into cache, possibly pulling least recently used entry out of + it. + + Tree spin lock is used to protect coord cache. If contention for this + lock proves to be too high, more finer grained locking can be added. + + Invariants involving parts of this data-type: + + [cbk-cache-invariant] +*/ +typedef struct cbk_cache { + /* serializator */ + reiser4_rw_data guard; + int nr_slots; + /* head of LRU list of cache slots */ + cbk_cache_list_head lru; + /* actual array of slots */ + cbk_cache_slot *slot; +} cbk_cache; + +#define rw_ordering_pred_cbk_cache(cache) (1) + +/* defined read-write locking functions for cbk_cache */ +RW_LOCK_FUNCTIONS(cbk_cache, cbk_cache, guard); + +/* define list manipulation functions for cbk_cache LRU list */ +TYPE_SAFE_LIST_DEFINE(cbk_cache, cbk_cache_slot, lru); + +/* level_lookup_result - possible outcome of looking up key at some level. + This is used by coord_by_key when traversing tree downward. */ +typedef enum { + /* continue to the next level */ + LOOKUP_CONT, + /* done. Either required item was found, or we can prove it + doesn't exist, or some error occurred. */ + LOOKUP_DONE, + /* restart traversal from the root. Infamous "repetition". */ + LOOKUP_REST +} level_lookup_result; + +/* This is representation of internal reiser4 tree where all file-system + data and meta-data are stored. This structure is passed to all tree + manipulation functions. It's different from the super block because: + we don't want to limit ourselves to strictly one to one mapping + between super blocks and trees, and, because they are logically + different: there are things in a super block that have no relation to + the tree (bitmaps, journalling area, mount options, etc.) and there + are things in a tree that bear no relation to the super block, like + tree of znodes. + + At this time, there is only one tree + per filesystem, and this struct is part of the super block. We only + call the super block the super block for historical reasons (most + other filesystems call the per filesystem metadata the super block). +*/ + +struct reiser4_tree { + /* block_nr == 0 is fake znode. Write lock it, while changing + tree height. */ + /* disk address of root node of a tree */ + reiser4_block_nr root_block; + + /* level of the root node. If this is 1, tree consists of root + node only */ + tree_level height; + + /* + * this is cached here avoid calling plugins through function + * dereference all the time. + */ + __u64 estimate_one_insert; + + /* cache of recent tree lookup results */ + cbk_cache cbk_cache; + + /* hash table to look up znodes by block number. */ + z_hash_table zhash_table; + z_hash_table zfake_table; + /* hash table to look up jnodes by inode and offset. */ + j_hash_table jhash_table; + + /* lock protecting: + - parent pointers, + - sibling pointers, + - znode hash table + - coord cache + */ + /* NOTE: The "giant" tree lock can be replaced by more spin locks, + hoping they will be less contented. We can use one spin lock per one + znode hash bucket. With adding of some code complexity, sibling + pointers can be protected by both znode spin locks. However it looks + more SMP scalable we should test this locking change on n-ways (n > + 4) SMP machines. Current 4-ways machine test does not show that tree + lock is contented and it is a bottleneck (2003.07.25). */ + + reiser4_rw_data tree_lock; + + /* lock protecting delimiting keys */ + reiser4_rw_data dk_lock; + + /* spin lock protecting znode_epoch */ + reiser4_spin_data epoch_lock; + /* version stamp used to mark znode updates. See seal.[ch] for more + * information. */ + __u64 znode_epoch; + + znode *uber; + node_plugin *nplug; + struct super_block *super; + struct { + /* carry flags used for insertion of new nodes */ + __u32 new_node_flags; + /* carry flags used for insertion of new extents */ + __u32 new_extent_flags; + /* carry flags used for paste operations */ + __u32 paste_flags; + /* carry flags used for insert operations */ + __u32 insert_flags; + } carry; +}; + +#define spin_ordering_pred_epoch(tree) (1) +SPIN_LOCK_FUNCTIONS(epoch, reiser4_tree, epoch_lock); + +extern void init_tree_0(reiser4_tree *); + +extern int init_tree(reiser4_tree * tree, + const reiser4_block_nr * root_block, tree_level height, node_plugin * default_plugin); +extern void done_tree(reiser4_tree * tree); + +/* &reiser4_item_data - description of data to be inserted or pasted + + Q: articulate the reasons for the difference between this and flow. + + A: Becides flow we insert into tree other things: stat data, directory + entry, etc. To insert them into tree one has to provide this structure. If + one is going to insert flow - he can use insert_flow, where this structure + does not have to be created +*/ +struct reiser4_item_data { + /* actual data to be inserted. If NULL, ->create_item() will not + do xmemcpy itself, leaving this up to the caller. This can + save some amount of unnecessary memory copying, for example, + during insertion of stat data. + + */ + char *data; + /* 1 if 'char * data' contains pointer to user space and 0 if it is + kernel space */ + int user; + /* amount of data we are going to insert or paste */ + int length; + /* "Arg" is opaque data that is passed down to the + ->create_item() method of node layout, which in turn + hands it to the ->create_hook() of item being created. This + arg is currently used by: + + . ->create_hook() of internal item + (fs/reiser4/plugin/item/internal.c:internal_create_hook()), + . ->paste() method of directory item. + . ->create_hook() of extent item + + For internal item, this is left "brother" of new node being + inserted and it is used to add new node into sibling list + after parent to it was just inserted into parent. + + While ->arg does look somewhat of unnecessary compication, + it actually saves a lot of headache in many places, because + all data necessary to insert or paste new data into tree are + collected in one place, and this eliminates a lot of extra + argument passing and storing everywhere. + + */ + void *arg; + /* plugin of item we are inserting */ + item_plugin *iplug; +}; + +/* cbk flags: options for coord_by_key() */ +typedef enum { + /* coord_by_key() is called for insertion. This is necessary because + of extents being located at the twig level. For explanation, see + comment just above is_next_item_internal(). + */ + CBK_FOR_INSERT = (1 << 0), + /* coord_by_key() is called with key that is known to be unique */ + CBK_UNIQUE = (1 << 1), + /* coord_by_key() can trust delimiting keys. This options is not user + accessible. coord_by_key() will set it automatically. It will be + only cleared by special-case in extents-on-the-twig-level handling + where it is necessary to insert item with a key smaller than + leftmost key in a node. This is necessary because of extents being + located at the twig level. For explanation, see comment just above + is_next_item_internal(). + */ + CBK_TRUST_DK = (1 << 2), + CBK_READA = (1 << 3), /* original: readahead leaves which contain items of certain file */ + CBK_READDIR_RA = (1 << 4), /* readdir: readahead whole directory and all its stat datas */ + CBK_DKSET = (1 << 5), + CBK_EXTENDED_COORD = (1 << 6), /* coord_t is actually */ + CBK_IN_CACHE = (1 << 7), /* node is already in cache */ + CBK_USE_CRABLOCK = (1 << 8) /* use crab_lock in stead of long term + * lock */ +} cbk_flags; + +/* insertion outcome. IBK = insert by key */ +typedef enum { + IBK_INSERT_OK = 0, + IBK_ALREADY_EXISTS = -EEXIST, + IBK_IO_ERROR = -EIO, + IBK_NO_SPACE = -E_NODE_FULL, + IBK_OOM = -ENOMEM +} insert_result; + +#define IS_CBKERR(err) ((err) != CBK_COORD_FOUND && (err) != CBK_COORD_NOTFOUND) + +typedef int (*tree_iterate_actor_t) (reiser4_tree * tree, coord_t * coord, lock_handle * lh, void *arg); +extern int iterate_tree(reiser4_tree * tree, coord_t * coord, lock_handle * lh, + tree_iterate_actor_t actor, void *arg, znode_lock_mode mode, int through_units_p); +extern int get_uber_znode(reiser4_tree * tree, znode_lock_mode mode, + znode_lock_request pri, lock_handle *lh); + +/* return node plugin of @node */ +static inline node_plugin * +node_plugin_by_node(const znode * node /* node to query */ ) +{ + assert("vs-213", node != NULL); + assert("vs-214", znode_is_loaded(node)); + + return node->nplug; +} + +/* number of items in @node */ +static inline pos_in_node_t +node_num_items(const znode * node) +{ + assert("nikita-2754", znode_is_loaded(node)); + assert("nikita-2468", + node_plugin_by_node(node)->num_of_items(node) == node->nr_items); + + return node->nr_items; +} + +/* Return the number of items at the present node. Asserts coord->node != + NULL. */ +static inline unsigned +coord_num_items(const coord_t * coord) +{ + assert("jmacd-9805", coord->node != NULL); + + return node_num_items(coord->node); +} + +/* true if @node is empty */ +static inline int +node_is_empty(const znode * node) +{ + return node_num_items(node) == 0; +} + +typedef enum { + SHIFTED_SOMETHING = 0, + SHIFT_NO_SPACE = -E_NODE_FULL, + SHIFT_IO_ERROR = -EIO, + SHIFT_OOM = -ENOMEM, +} shift_result; + +extern node_plugin *node_plugin_by_coord(const coord_t * coord); +extern int is_coord_in_node(const coord_t * coord); +extern int key_in_node(const reiser4_key *, const coord_t *); +extern void coord_item_move_to(coord_t * coord, int items); +extern void coord_unit_move_to(coord_t * coord, int units); + +/* there are two types of repetitive accesses (ra): intra-syscall + (local) and inter-syscall (global). Local ra is used when + during single syscall we add/delete several items and units in the + same place in a tree. Note that plan-A fragments local ra by + separating stat-data and file body in key-space. Global ra is + used when user does repetitive modifications in the same place in a + tree. + + Our ra implementation serves following purposes: + 1 it affects balancing decisions so that next operation in a row + can be performed faster; + 2 it affects lower-level read-ahead in page-cache; + 3 it allows to avoid unnecessary lookups by maintaining some state + across several operations (this is only for local ra); + 4 it leaves room for lazy-micro-balancing: when we start a sequence of + operations they are performed without actually doing any intra-node + shifts, until we finish sequence or scope of sequence leaves + current node, only then we really pack node (local ra only). +*/ + +/* another thing that can be useful is to keep per-tree and/or + per-process cache of recent lookups. This cache can be organised as a + list of block numbers of formatted nodes sorted by starting key in + this node. Balancings should invalidate appropriate parts of this + cache. +*/ + +lookup_result coord_by_key(reiser4_tree * tree, const reiser4_key * key, + coord_t * coord, lock_handle * handle, + znode_lock_mode lock, lookup_bias bias, + tree_level lock_level, tree_level stop_level, __u32 flags, + ra_info_t *); + +lookup_result object_lookup(struct inode *object, + const reiser4_key * key, + coord_t * coord, + lock_handle * lh, + znode_lock_mode lock_mode, + lookup_bias bias, + tree_level lock_level, + tree_level stop_level, + __u32 flags, + ra_info_t *info); + +insert_result insert_by_key(reiser4_tree * tree, const reiser4_key * key, + reiser4_item_data * data, coord_t * coord, + lock_handle * lh, + tree_level stop_level, __u32 flags); +insert_result insert_by_coord(coord_t * coord, + reiser4_item_data * data, const reiser4_key * key, + lock_handle * lh, + __u32); +insert_result insert_extent_by_coord(coord_t * coord, + reiser4_item_data * data, const reiser4_key * key, lock_handle * lh); +int cut_node_content(coord_t *from, coord_t *to, + const reiser4_key *from_key, const reiser4_key *to_key, + reiser4_key *smallest_removed); +int kill_node_content(coord_t *from, coord_t *to, + const reiser4_key *from_key, const reiser4_key *to_key, + reiser4_key *smallest_removed, + znode *locked_left_neighbor, + struct inode *inode, int truncate); + +int resize_item(coord_t * coord, reiser4_item_data * data, + reiser4_key * key, lock_handle * lh, cop_insert_flag); +int insert_into_item(coord_t * coord, lock_handle * lh, const reiser4_key * key, reiser4_item_data * data, unsigned); +int insert_flow(coord_t * coord, lock_handle * lh, flow_t * f); +int find_new_child_ptr(znode * parent, znode * child, znode * left, coord_t * result); + +int shift_right_of_but_excluding_insert_coord(coord_t * insert_coord); +int shift_left_of_and_including_insert_coord(coord_t * insert_coord); + +void fake_kill_hook_tail(struct inode *, loff_t start, loff_t end, int); + +extern int cut_tree_worker_common(tap_t *, const reiser4_key *, const reiser4_key *, + reiser4_key *, struct inode *, int, int*); +extern int cut_tree_object(reiser4_tree*, const reiser4_key*, const reiser4_key*, reiser4_key*, struct inode*, int, int*); +extern int cut_tree(reiser4_tree *tree, const reiser4_key *from, const reiser4_key *to, struct inode*, int); + +extern int delete_node(znode * node, reiser4_key *, struct inode *, int); +extern int check_tree_pointer(const coord_t * pointer, const znode * child); +extern int find_new_child_ptr(znode * parent, znode * child UNUSED_ARG, znode * left, coord_t * result); +extern int find_child_ptr(znode * parent, znode * child, coord_t * result); +extern int set_child_delimiting_keys(znode * parent, const coord_t * in_parent, znode *child); +extern znode *child_znode(const coord_t * in_parent, znode * parent, int incore_p, int setup_dkeys_p); + +extern int cbk_cache_init(cbk_cache * cache); +extern void cbk_cache_done(cbk_cache * cache); +extern void cbk_cache_invalidate(const znode * node, reiser4_tree * tree); + +extern char *sprint_address(const reiser4_block_nr * block); + +#if REISER4_DEBUG +extern void print_coord_content(const char *prefix, coord_t * p); +extern void print_address(const char *prefix, const reiser4_block_nr * block); +extern void print_tree_rec(const char *prefix, reiser4_tree * tree, __u32 flags); +#else +#define print_coord_content(p, c) noop +#define print_address(p, b) noop +#endif + +extern void forget_znode(lock_handle * handle); +extern int deallocate_znode(znode * node); + +extern int is_disk_addr_unallocated(const reiser4_block_nr * addr); + +/* struct used internally to pack all numerous arguments of tree lookup. + Used to avoid passing a lot of arguments to helper functions. */ +typedef struct cbk_handle { + /* tree we are in */ + reiser4_tree *tree; + /* key we are going after */ + const reiser4_key *key; + /* coord we will store result in */ + coord_t *coord; + /* type of lock to take on target node */ + znode_lock_mode lock_mode; + /* lookup bias. See comments at the declaration of lookup_bias */ + lookup_bias bias; + /* lock level: level starting from which tree traversal starts taking + * write locks. */ + tree_level lock_level; + /* level where search will stop. Either item will be found between + lock_level and stop_level, or CBK_COORD_NOTFOUND will be + returned. + */ + tree_level stop_level; + /* level we are currently at */ + tree_level level; + /* block number of @active node. Tree traversal operates on two + nodes: active and parent. */ + reiser4_block_nr block; + /* put here error message to be printed by caller */ + const char *error; + /* result passed back to caller */ + lookup_result result; + /* lock handles for active and parent */ + lock_handle *parent_lh; + lock_handle *active_lh; + reiser4_key ld_key; + reiser4_key rd_key; + /* flags, passed to the cbk routine. Bits of this bitmask are defined + in tree.h:cbk_flags enum. */ + __u32 flags; + ra_info_t *ra_info; + struct inode *object; +} cbk_handle; + +extern znode_lock_mode cbk_lock_mode(tree_level level, cbk_handle * h); + +/* eottl.c */ +extern int handle_eottl(cbk_handle * h, int *outcome); + +int lookup_multikey(cbk_handle * handle, int nr_keys); +int lookup_couple(reiser4_tree * tree, + const reiser4_key * key1, const reiser4_key * key2, + coord_t * coord1, coord_t * coord2, + lock_handle * lh1, lock_handle * lh2, + znode_lock_mode lock_mode, lookup_bias bias, + tree_level lock_level, tree_level stop_level, __u32 flags, int *result1, int *result2); + +/* ordering constraint for tree spin lock: tree lock is "strongest" */ +#define rw_ordering_pred_tree(tree) \ + (lock_counters()->spin_locked_txnh == 0) && \ + (lock_counters()->rw_locked_tree == 0) && \ + (lock_counters()->rw_locked_dk == 0) + +/* Define spin_lock_tree, spin_unlock_tree, and spin_tree_is_locked: + spin lock protecting znode hash, and parent and sibling pointers. */ +RW_LOCK_FUNCTIONS(tree, reiser4_tree, tree_lock); + +/* ordering constraint for delimiting key spin lock: dk lock is weaker than + tree lock */ +#define rw_ordering_pred_dk( tree ) 1 +#if 0 + (lock_counters()->rw_locked_tree == 0) && \ + (lock_counters()->spin_locked_jnode == 0) && \ + (lock_counters()->rw_locked_zlock == 0) && \ + (lock_counters()->spin_locked_txnh == 0) && \ + (lock_counters()->spin_locked_atom == 0) && \ + (lock_counters()->spin_locked_inode_object == 0) && \ + (lock_counters()->spin_locked_txnmgr == 0) +#endif + +/* Define spin_lock_dk(), spin_unlock_dk(), etc: locking for delimiting + keys. */ +RW_LOCK_FUNCTIONS(dk, reiser4_tree, dk_lock); + +#if REISER4_DEBUG +#define check_tree() print_tree_rec( "", current_tree, REISER4_TREE_CHECK ) +#else +#define check_tree() noop +#endif + +/* estimate api. Implementation is in estimate.c */ +reiser4_block_nr estimate_one_insert_item(reiser4_tree *); +reiser4_block_nr estimate_one_insert_into_item(reiser4_tree *); +reiser4_block_nr estimate_insert_flow(tree_level); +reiser4_block_nr estimate_one_item_removal(reiser4_tree *); +reiser4_block_nr calc_estimate_one_insert(tree_level); +reiser4_block_nr estimate_disk_cluster(struct inode *); +reiser4_block_nr estimate_insert_cluster(struct inode *, int); + +/* take read or write tree lock, depending on @takeread argument */ +#define XLOCK_TREE(tree, takeread) \ + (takeread ? RLOCK_TREE(tree) : WLOCK_TREE(tree)) + +/* release read or write tree lock, depending on @takeread argument */ +#define XUNLOCK_TREE(tree, takeread) \ + (takeread ? RUNLOCK_TREE(tree) : WUNLOCK_TREE(tree)) + +/* __REISER4_TREE_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/tree_mod.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/tree_mod.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,364 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* + * Functions to add/delete new nodes to/from the tree. + * + * Functions from this file are used by carry (see carry*) to handle: + * + * . insertion of new formatted node into tree + * + * . addition of new tree root, increasing tree height + * + * . removing tree root, decreasing tree height + * + */ + +#include "forward.h" +#include "debug.h" +#include "dformat.h" +#include "key.h" +#include "coord.h" +#include "plugin/plugin.h" +#include "jnode.h" +#include "znode.h" +#include "tree_mod.h" +#include "block_alloc.h" +#include "tree_walk.h" +#include "tree.h" +#include "super.h" + +#include + +static int add_child_ptr(znode * parent, znode * child); +/* warning only issued if error is not -E_REPEAT */ +#define ewarning( error, ... ) \ + if( ( error ) != -E_REPEAT ) \ + warning( __VA_ARGS__ ) + +/* allocate new node on the @level and immediately on the right of @brother. */ +reiser4_internal znode * +new_node(znode * brother /* existing left neighbor of new node */ , + tree_level level /* tree level at which new node is to + * be allocated */ ) +{ + znode *result; + int retcode; + reiser4_block_nr blocknr; + + assert("nikita-930", brother != NULL); + assert("umka-264", level < REAL_MAX_ZTREE_HEIGHT); + + retcode = assign_fake_blocknr_formatted(&blocknr); + if (retcode == 0) { + result = zget(znode_get_tree(brother), &blocknr, NULL, level, GFP_KERNEL); + if (IS_ERR(result)) { + ewarning(PTR_ERR(result), "nikita-929", + "Cannot allocate znode for carry: %li", PTR_ERR(result)); + return result; + } + /* cheap test, can be executed even when debugging is off */ + if (!znode_just_created(result)) { + warning("nikita-2213", "Allocated already existing block: %llu", + (unsigned long long)blocknr); + zput(result); + return ERR_PTR(RETERR(-EIO)); + } + + assert("nikita-931", result != NULL); + result->nplug = znode_get_tree(brother)->nplug; + assert("nikita-933", result->nplug != NULL); + + retcode = zinit_new(result, GFP_KERNEL); + if (retcode == 0) { + ZF_SET(result, JNODE_CREATED); + zrelse(result); + } else { + zput(result); + result = ERR_PTR(retcode); + } + } else { + /* failure to allocate new node during balancing. + This should never happen. Ever. Returning -E_REPEAT + is not viable solution, because "out of disk space" + is not transient error that will go away by itself. + */ + ewarning(retcode, "nikita-928", + "Cannot allocate block for carry: %i", retcode); + result = ERR_PTR(retcode); + } + assert("nikita-1071", result != NULL); + return result; +} + +/* allocate new root and add it to the tree + + This helper function is called by add_new_root(). + +*/ +reiser4_internal znode * +add_tree_root(znode * old_root /* existing tree root */ , + znode * fake /* "fake" znode */ ) +{ + reiser4_tree *tree = znode_get_tree(old_root); + znode *new_root = NULL; /* to shut gcc up */ + int result; + + assert("nikita-1069", old_root != NULL); + assert("umka-262", fake != NULL); + assert("umka-263", tree != NULL); + + /* "fake" znode---one always hanging just above current root. This + node is locked when new root is created or existing root is + deleted. Downward tree traversal takes lock on it before taking + lock on a root node. This avoids race conditions with root + manipulations. + + */ + assert("nikita-1348", znode_above_root(fake)); + assert("nikita-1211", znode_is_root(old_root)); + + result = 0; + if (tree->height >= REAL_MAX_ZTREE_HEIGHT) { + warning("nikita-1344", "Tree is too tall: %i", tree->height); + /* ext2 returns -ENOSPC when it runs out of free inodes with a + following comment (fs/ext2/ialloc.c:441): Is it really + ENOSPC? + + -EXFULL? -EINVAL? + */ + result = RETERR(-ENOSPC); + } else { + /* Allocate block for new root. It's not that + important where it will be allocated, as root is + almost always in memory. Moreover, allocate on + flush can be going here. + */ + assert("nikita-1448", znode_is_root(old_root)); + new_root = new_node(fake, tree->height + 1); + if (!IS_ERR(new_root) && (result = zload(new_root)) == 0) { + lock_handle rlh; + + init_lh(&rlh); + result = longterm_lock_znode(&rlh, new_root, ZNODE_WRITE_LOCK, ZNODE_LOCK_LOPRI); + if (result == 0) { + parent_coord_t *in_parent; + + znode_make_dirty(fake); + + /* new root is a child of "fake" node */ + WLOCK_TREE(tree); + + ++tree->height; + + /* recalculate max balance overhead */ + tree->estimate_one_insert = estimate_one_insert_item(tree); + + tree->root_block = *znode_get_block(new_root); + in_parent = &new_root->in_parent; + init_parent_coord(in_parent, fake); + /* manually insert new root into sibling + * list. With this all nodes involved into + * balancing are connected after balancing is + * done---useful invariant to check. */ + sibling_list_insert_nolock(new_root, NULL); + WUNLOCK_TREE(tree); + + /* insert into new root pointer to the + @old_root. */ + assert("nikita-1110", WITH_DATA(new_root, node_is_empty(new_root))); + WLOCK_DK(tree); + znode_set_ld_key(new_root, min_key()); + znode_set_rd_key(new_root, max_key()); + WUNLOCK_DK(tree); + if (REISER4_DEBUG) { + ZF_CLR(old_root, JNODE_LEFT_CONNECTED); + ZF_CLR(old_root, JNODE_RIGHT_CONNECTED); + ZF_SET(old_root, JNODE_ORPHAN); + } + result = add_child_ptr(new_root, old_root); + done_lh(&rlh); + } + zrelse(new_root); + } + } + if (result != 0) + new_root = ERR_PTR(result); + return new_root; +} + +/* build &reiser4_item_data for inserting child pointer + + Build &reiser4_item_data that can be later used to insert pointer to @child + in its parent. + +*/ +reiser4_internal void +build_child_ptr_data(znode * child /* node pointer to which will be + * inserted */ , + reiser4_item_data * data /* where to store result */ ) +{ + assert("nikita-1116", child != NULL); + assert("nikita-1117", data != NULL); + + /* this is subtle assignment to meditate upon */ + data->data = (char *) znode_get_block(child); + /* data -> data is kernel space */ + data->user = 0; + data->length = sizeof (reiser4_block_nr); + /* FIXME-VS: hardcoded internal item? */ + + /* AUDIT: Is it possible that "item_plugin_by_id" may find nothing? */ + data->iplug = item_plugin_by_id(NODE_POINTER_ID); +} + +/* add pointer to @child into empty @parent. + + This is used when pointer to old root is inserted into new root which is + empty. +*/ +static int +add_child_ptr(znode * parent, znode * child) +{ + coord_t coord; + reiser4_item_data data; + int result; + reiser4_key *key; + + assert("nikita-1111", parent != NULL); + assert("nikita-1112", child != NULL); + assert("nikita-1115", znode_get_level(parent) == znode_get_level(child) + 1); + + result = zload(parent); + if (result != 0) + return result; + assert("nikita-1113", node_is_empty(parent)); + coord_init_first_unit(&coord, parent); + + build_child_ptr_data(child, &data); + data.arg = NULL; + + key = UNDER_RW(dk, znode_get_tree(parent), read, znode_get_ld_key(child)); + result = node_plugin_by_node(parent)->create_item(&coord, key, &data, NULL); + znode_make_dirty(parent); + zrelse(parent); + return result; +} + +/* actually remove tree root */ +static int +kill_root(reiser4_tree * tree /* tree from which root is being + * removed */ , + znode * old_root /* root node that is being removed */ , + znode * new_root /* new root---sole child of * + * @old_root */ , + const reiser4_block_nr * new_root_blk /* disk address of + * @new_root */ ) +{ + znode *uber; + int result; + lock_handle handle_for_uber; + + assert("umka-265", tree != NULL); + assert("nikita-1198", new_root != NULL); + assert("nikita-1199", znode_get_level(new_root) + 1 == znode_get_level(old_root)); + + assert("nikita-1201", znode_is_write_locked(old_root)); + + assert("nikita-1203", disk_addr_eq(new_root_blk, znode_get_block(new_root))); + + init_lh(&handle_for_uber); + /* obtain and lock "fake" znode protecting changes in tree height. */ + result = get_uber_znode(tree, ZNODE_WRITE_LOCK, ZNODE_LOCK_HIPRI, + &handle_for_uber); + if (result == 0) { + uber = handle_for_uber.node; + + znode_make_dirty(uber); + + /* don't take long term lock a @new_root. Take spinlock. */ + + WLOCK_TREE(tree); + + tree->root_block = *new_root_blk; + --tree->height; + + /* recalculate max balance overhead */ + tree->estimate_one_insert = estimate_one_insert_item(tree); + + assert("nikita-1202", tree->height == znode_get_level(new_root)); + + /* new root is child on "fake" node */ + init_parent_coord(&new_root->in_parent, uber); + ++ uber->c_count; + + /* sibling_list_insert_nolock(new_root, NULL); */ + WUNLOCK_TREE(tree); + + /* reinitialise old root. */ + result = node_plugin_by_node(old_root)->init(old_root); + znode_make_dirty(old_root); + if (result == 0) { + assert("nikita-1279", node_is_empty(old_root)); + ZF_SET(old_root, JNODE_HEARD_BANSHEE); + old_root->c_count = 0; + } + } + done_lh(&handle_for_uber); + + return result; +} + +/* remove tree root + + This function removes tree root, decreasing tree height by one. Tree root + and its only child (that is going to become new tree root) are write locked + at the entry. + + To remove tree root we need to take lock on special "fake" znode that + protects changes of tree height. See comments in add_tree_root() for more + on this. + + Also parent pointers have to be updated in + old and new root. To simplify code, function is split into two parts: outer + kill_tree_root() collects all necessary arguments and calls kill_root() + to do the actual job. + +*/ +reiser4_internal int +kill_tree_root(znode * old_root /* tree root that we are removing */ ) +{ + int result; + coord_t down_link; + znode *new_root; + reiser4_tree *tree; + + assert("umka-266", current_tree != NULL); + assert("nikita-1194", old_root != NULL); + assert("nikita-1196", znode_is_root(old_root)); + assert("nikita-1200", node_num_items(old_root) == 1); + assert("nikita-1401", znode_is_write_locked(old_root)); + + coord_init_first_unit(&down_link, old_root); + + tree = znode_get_tree(old_root); + new_root = child_znode(&down_link, old_root, 0, 1); + if (!IS_ERR(new_root)) { + result = kill_root(tree, old_root, new_root, znode_get_block(new_root)); + zput(new_root); + } else + result = PTR_ERR(new_root); + + return result; +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/tree_mod.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/tree_mod.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,29 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Functions to add/delete new nodes to/from the tree. See tree_mod.c for + * comments. */ + +#if !defined( __REISER4_TREE_MOD_H__ ) +#define __REISER4_TREE_MOD_H__ + +#include "forward.h" + +znode *new_node(znode * brother, tree_level level); +znode *add_tree_root(znode * old_root, znode * fake); +int kill_tree_root(znode * old_root); +void build_child_ptr_data(znode * child, reiser4_item_data * data); + +/* __REISER4_TREE_MOD_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/tree_walk.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/tree_walk.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1232 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Routines and macros to: + + get_left_neighbor() + + get_right_neighbor() + + get_parent() + + get_first_child() + + get_last_child() + + various routines to walk the whole tree and do things to it like + repack it, or move it to tertiary storage. Please make them as + generic as is reasonable. + +*/ + +#include "forward.h" +#include "debug.h" +#include "dformat.h" +#include "coord.h" +#include "plugin/item/item.h" +#include "jnode.h" +#include "znode.h" +#include "tree_walk.h" +#include "tree.h" +#include "super.h" + +/* These macros are used internally in tree_walk.c in attempt to make + lock_neighbor() code usable to build lock_parent(), lock_right_neighbor, + lock_left_neighbor */ +#define GET_NODE_BY_PTR_OFFSET(node, off) (*(znode**)(((unsigned long)(node)) + (off))) +#define FIELD_OFFSET(name) offsetof(znode, name) +#define PARENT_PTR_OFFSET FIELD_OFFSET(in_parent.node) +#define LEFT_PTR_OFFSET FIELD_OFFSET(left) +#define RIGHT_PTR_OFFSET FIELD_OFFSET(right) + +/* This is the generic procedure to get and lock `generic' neighbor (left or + right neighbor or parent). It implements common algorithm for all cases of + getting lock on neighbor node, only znode structure field is different in + each case. This is parameterized by ptr_offset argument, which is byte + offset for the pointer to the desired neighbor within the current node's + znode structure. This function should be called with the tree lock held */ +static int +lock_neighbor( + /* resulting lock handle*/ + lock_handle * result, + /* znode to lock */ + znode * node, + /* pointer to neighbor (or parent) znode field offset, in bytes from + the base address of znode structure */ + int ptr_offset, + /* lock mode for longterm_lock_znode call */ + znode_lock_mode mode, + /* lock request for longterm_lock_znode call */ + znode_lock_request req, + /* GN_* flags */ + int flags, + int rlocked) +{ + reiser4_tree *tree = znode_get_tree(node); + znode *neighbor; + int ret; + + assert("umka-236", node != NULL); + assert("umka-237", tree != NULL); + assert("umka-301", rw_tree_is_locked(tree)); + + if (flags & GN_TRY_LOCK) + req |= ZNODE_LOCK_NONBLOCK; + if (flags & GN_SAME_ATOM) + req |= ZNODE_LOCK_DONT_FUSE; + + /* get neighbor's address by using of sibling link, quit while loop + (and return) if link is not available. */ + while (1) { + neighbor = GET_NODE_BY_PTR_OFFSET(node, ptr_offset); + + /* return -E_NO_NEIGHBOR if parent or side pointer is NULL or if + * node pointed by it is not connected. + * + * However, GN_ALLOW_NOT_CONNECTED option masks "connected" + * check and allows passing reference to not connected znode to + * subsequent longterm_lock_znode() call. This kills possible + * busy loop if we are trying to get longterm lock on locked but + * not yet connected parent node. */ + if (neighbor == NULL || !((flags & GN_ALLOW_NOT_CONNECTED) + || znode_is_connected(neighbor))) { + return RETERR(-E_NO_NEIGHBOR); + } + + /* protect it from deletion. */ + zref(neighbor); + + XUNLOCK_TREE(tree, rlocked); + + ret = longterm_lock_znode(result, neighbor, mode, req); + + /* The lock handle obtains its own reference, release the one from above. */ + zput(neighbor); + + XLOCK_TREE(tree, rlocked); + + /* restart if node we got reference to is being + invalidated. we should not get reference to this node + again.*/ + if (ret == -EINVAL) + continue; + if (ret) + return ret; + + /* check if neighbor link still points to just locked znode; + the link could have been changed while the process slept. */ + if (neighbor == GET_NODE_BY_PTR_OFFSET(node, ptr_offset)) + return 0; + + /* znode was locked by mistake; unlock it and restart locking + process from beginning. */ + XUNLOCK_TREE(tree, rlocked); + longterm_unlock_znode(result); + XLOCK_TREE(tree, rlocked); + } +} +/* get parent node with longterm lock, accepts GN* flags. */ +reiser4_internal int +reiser4_get_parent_flags(lock_handle * result /* resulting lock handle */, + znode * node /* child node */, + znode_lock_mode mode /* type of lock: read or write */, + int flags /* GN_* flags */) +{ + return UNDER_RW(tree, znode_get_tree(node), read, + lock_neighbor(result, node, PARENT_PTR_OFFSET, mode, + ZNODE_LOCK_HIPRI, flags, 1)); +} + +/* A wrapper for reiser4_get_parent_flags(). */ +reiser4_internal int +reiser4_get_parent(lock_handle * result /* resulting lock + * handle */ , + znode * node /* child node */ , + znode_lock_mode mode /* type of lock: read or write */ , + int only_connected_p /* if this is true, parent is + * only returned when it is + * connected. If parent is + * unconnected, -E_NO_NEIGHBOR is + * returned. Normal users should + * pass 1 here. Only during carry + * we want to access still + * unconnected parents. */ ) +{ + assert("umka-238", znode_get_tree(node) != NULL); + + return reiser4_get_parent_flags(result, node, mode, + only_connected_p ? 0 : GN_ALLOW_NOT_CONNECTED); +} + +/* wrapper function to lock right or left neighbor depending on GN_GO_LEFT + bit in @flags parameter */ +/* Audited by: umka (2002.06.14) */ +static inline int +lock_side_neighbor(lock_handle * result, + znode * node, + znode_lock_mode mode, + int flags, int rlocked) +{ + int ret; + int ptr_offset; + znode_lock_request req; + + if (flags & GN_GO_LEFT) { + ptr_offset = LEFT_PTR_OFFSET; + req = ZNODE_LOCK_LOPRI; + } else { + ptr_offset = RIGHT_PTR_OFFSET; + req = ZNODE_LOCK_HIPRI; + } + + ret = lock_neighbor(result, node, ptr_offset, mode, req, flags, rlocked); + + if (ret == -E_NO_NEIGHBOR) /* if we walk left or right -E_NO_NEIGHBOR does not + * guarantee that neighbor is absent in the + * tree; in this case we return -ENOENT -- + * means neighbor at least not found in + * cache */ + return RETERR(-ENOENT); + + return ret; +} + +#if REISER4_DEBUG + +int check_sibling_list(znode * node) +{ + znode *scan; + znode *next; + + assert("nikita-3283", LOCK_CNT_GTZ(write_locked_tree)); + + if (node == NULL) + return 1; + + if (ZF_ISSET(node, JNODE_RIP)) + return 1; + + assert("nikita-3270", node != NULL); + assert("nikita-3269", rw_tree_is_write_locked(znode_get_tree(node))); + + for (scan = node; znode_is_left_connected(scan); scan = next) { + next = scan->left; + if (next != NULL && !ZF_ISSET(next, JNODE_RIP)) { + assert("nikita-3271", znode_is_right_connected(next)); + assert("nikita-3272", next->right == scan); + } else + break; + } + for (scan = node; znode_is_right_connected(scan); scan = next) { + next = scan->right; + if (next != NULL && !ZF_ISSET(next, JNODE_RIP)) { + assert("nikita-3273", znode_is_left_connected(next)); + assert("nikita-3274", next->left == scan); + } else + break; + } + return 1; +} + +#endif + +/* Znode sibling pointers maintenence. */ + +/* Znode sibling pointers are established between any neighbored nodes which are + in cache. There are two znode state bits (JNODE_LEFT_CONNECTED, + JNODE_RIGHT_CONNECTED), if left or right sibling pointer contains actual + value (even NULL), corresponded JNODE_*_CONNECTED bit is set. + + Reiser4 tree operations which may allocate new znodes (CBK, tree balancing) + take care about searching (hash table lookup may be required) of znode + neighbors, establishing sibling pointers between them and setting + JNODE_*_CONNECTED state bits. */ + +/* adjusting of sibling pointers and `connected' states for two + neighbors; works if one neighbor is NULL (was not found). */ + +/* FIXME-VS: this is unstatic-ed to use in tree.c in prepare_twig_cut */ +reiser4_internal void +link_left_and_right(znode * left, znode * right) +{ + assert("nikita-3275", check_sibling_list(left)); + assert("nikita-3275", check_sibling_list(right)); + + if (left != NULL) { + if (left->right == NULL) { + left->right = right; + ZF_SET(left, JNODE_RIGHT_CONNECTED); + + ON_DEBUG(left->right_version = atomic_inc_return(&delim_key_version);); + + } else if (ZF_ISSET(left->right, JNODE_HEARD_BANSHEE)) { + + ON_DEBUG( + left->right->left_version = atomic_inc_return(&delim_key_version); + left->right_version = atomic_inc_return(&delim_key_version); + ); + + left->right->left = NULL; + left->right = right; + ZF_SET(left, JNODE_RIGHT_CONNECTED); + } else + /* + * there is a race condition in renew_sibling_link() + * and assertions below check that it is only one + * there. Thread T1 calls renew_sibling_link() without + * GN_NO_ALLOC flag. zlook() doesn't find neighbor + * node, but before T1 gets to the + * link_left_and_right(), another thread T2 creates + * neighbor node and connects it. check for + * left->right == NULL above protects T1 from + * overwriting correct left->right pointer installed + * by T2. + */ + assert("nikita-3302", + right == NULL || left->right == right); + } + if (right != NULL) { + if (right->left == NULL) { + right->left = left; + ZF_SET(right, JNODE_LEFT_CONNECTED); + + ON_DEBUG(right->left_version = atomic_inc_return(&delim_key_version);); + + } else if (ZF_ISSET(right->left, JNODE_HEARD_BANSHEE)) { + + ON_DEBUG( + right->left->right_version = atomic_inc_return(&delim_key_version); + right->left_version = atomic_inc_return(&delim_key_version); + ); + + right->left->right = NULL; + right->left = left; + ZF_SET(right, JNODE_LEFT_CONNECTED); + + } else + assert("nikita-3303", + left == NULL || right->left == left); + } + assert("nikita-3275", check_sibling_list(left)); + assert("nikita-3275", check_sibling_list(right)); +} + +/* Audited by: umka (2002.06.14) */ +static void +link_znodes(znode * first, znode * second, int to_left) +{ + if (to_left) + link_left_and_right(second, first); + else + link_left_and_right(first, second); +} + +/* getting of next (to left or to right, depend on gn_to_left bit in flags) + coord's unit position in horizontal direction, even across node + boundary. Should be called under tree lock, it protects nonexistence of + sibling link on parent level, if lock_side_neighbor() fails with + -ENOENT. */ +static int +far_next_coord(coord_t * coord, lock_handle * handle, int flags) +{ + int ret; + znode *node; + reiser4_tree *tree; + + assert("umka-243", coord != NULL); + assert("umka-244", handle != NULL); + + handle->owner = NULL; /* mark lock handle as unused */ + + ret = (flags & GN_GO_LEFT) ? coord_prev_unit(coord) : coord_next_unit(coord); + if (!ret) + return 0; + + ret = lock_side_neighbor(handle, coord->node, ZNODE_READ_LOCK, flags, 0); + if (ret) + return ret; + + node = handle->node; + tree = znode_get_tree(node); + WUNLOCK_TREE(tree); + + coord_init_zero(coord); + + /* We avoid synchronous read here if it is specified by flag. */ + if ((flags & GN_ASYNC) && znode_page(handle->node) == NULL) { + ret = jstartio(ZJNODE(handle->node)); + if (!ret) + ret = -E_REPEAT; + goto error_locked; + } + + /* corresponded zrelse() should be called by the clients of + far_next_coord(), in place when this node gets unlocked. */ + ret = zload(handle->node); + if (ret) + goto error_locked; + + if (flags & GN_GO_LEFT) + coord_init_last_unit(coord, node); + else + coord_init_first_unit(coord, node); + + if (0) { + error_locked: + longterm_unlock_znode(handle); + } + WLOCK_TREE(tree); + return ret; +} + +/* Very significant function which performs a step in horizontal direction + when sibling pointer is not available. Actually, it is only function which + does it. + Note: this function does not restore locking status at exit, + caller should does care about proper unlocking and zrelsing */ +static int +renew_sibling_link(coord_t * coord, lock_handle * handle, znode * child, tree_level level, int flags, int *nr_locked) +{ + int ret; + int to_left = flags & GN_GO_LEFT; + reiser4_block_nr da; + /* parent of the neighbor node; we set it to parent until not sharing + of one parent between child and neighbor node is detected */ + znode *side_parent = coord->node; + reiser4_tree *tree = znode_get_tree(child); + znode *neighbor = NULL; + + assert("umka-245", coord != NULL); + assert("umka-246", handle != NULL); + assert("umka-247", child != NULL); + assert("umka-303", tree != NULL); + + WLOCK_TREE(tree); + ret = far_next_coord(coord, handle, flags); + + if (ret) { + if (ret != -ENOENT) { + WUNLOCK_TREE(tree); + return ret; + } + } else { + item_plugin *iplug; + + if (handle->owner != NULL) { + (*nr_locked)++; + side_parent = handle->node; + } + + /* does coord object points to internal item? We do not + support sibling pointers between znode for formatted and + unformatted nodes and return -E_NO_NEIGHBOR in that case. */ + iplug = item_plugin_by_coord(coord); + if (!item_is_internal(coord)) { + link_znodes(child, NULL, to_left); + WUNLOCK_TREE(tree); + /* we know there can't be formatted neighbor */ + return RETERR(-E_NO_NEIGHBOR); + } + WUNLOCK_TREE(tree); + + iplug->s.internal.down_link(coord, NULL, &da); + + if (flags & GN_NO_ALLOC) { + neighbor = zlook(tree, &da); + } else { + neighbor = zget(tree, &da, side_parent, level, GFP_KERNEL); + } + + if (IS_ERR(neighbor)) { + ret = PTR_ERR(neighbor); + return ret; + } + + if (neighbor) + /* update delimiting keys */ + set_child_delimiting_keys(coord->node, coord, neighbor); + + WLOCK_TREE(tree); + } + + if (likely(neighbor == NULL || + (znode_get_level(child) == znode_get_level(neighbor) && child != neighbor))) + link_znodes(child, neighbor, to_left); + else { + warning("nikita-3532", + "Sibling nodes on the different levels: %i != %i\n", + znode_get_level(child), znode_get_level(neighbor)); + ret = RETERR(-EIO); + } + + WUNLOCK_TREE(tree); + + /* if GN_NO_ALLOC isn't set we keep reference to neighbor znode */ + if (neighbor != NULL && (flags & GN_NO_ALLOC)) + /* atomic_dec(&ZJNODE(neighbor)->x_count); */ + zput(neighbor); + + return ret; +} + +/* This function is for establishing of one side relation. */ +/* Audited by: umka (2002.06.14) */ +static int +connect_one_side(coord_t * coord, znode * node, int flags) +{ + coord_t local; + lock_handle handle; + int nr_locked; + int ret; + + assert("umka-248", coord != NULL); + assert("umka-249", node != NULL); + + coord_dup_nocheck(&local, coord); + + init_lh(&handle); + + ret = renew_sibling_link(&local, &handle, node, znode_get_level(node), flags | GN_NO_ALLOC, &nr_locked); + + if (handle.owner != NULL) { + /* complementary operations for zload() and lock() in far_next_coord() */ + zrelse(handle.node); + longterm_unlock_znode(&handle); + } + + /* we catch error codes which are not interesting for us because we + run renew_sibling_link() only for znode connection. */ + if (ret == -ENOENT || ret == -E_NO_NEIGHBOR) + return 0; + + return ret; +} + +/* if @child is not in `connected' state, performs hash searches for left and + right neighbor nodes and establishes horizontal sibling links */ +/* Audited by: umka (2002.06.14), umka (2002.06.15) */ +reiser4_internal int +connect_znode(coord_t * parent_coord, znode * child) +{ + reiser4_tree *tree = znode_get_tree(child); + int ret = 0; + + assert("zam-330", parent_coord != NULL); + assert("zam-331", child != NULL); + assert("zam-332", parent_coord->node != NULL); + assert("umka-305", tree != NULL); + + /* it is trivial to `connect' root znode because it can't have + neighbors */ + if (znode_above_root(parent_coord->node)) { + child->left = NULL; + child->right = NULL; + ZF_SET(child, JNODE_LEFT_CONNECTED); + ZF_SET(child, JNODE_RIGHT_CONNECTED); + + ON_DEBUG( + child->left_version = atomic_inc_return(&delim_key_version); + child->right_version = atomic_inc_return(&delim_key_version); + ); + + return 0; + } + + /* load parent node */ + coord_clear_iplug(parent_coord); + ret = zload(parent_coord->node); + + if (ret != 0) + return ret; + + /* protect `connected' state check by tree_lock */ + RLOCK_TREE(tree); + + if (!znode_is_right_connected(child)) { + RUNLOCK_TREE(tree); + /* connect right (default is right) */ + ret = connect_one_side(parent_coord, child, GN_NO_ALLOC); + if (ret) + goto zrelse_and_ret; + + RLOCK_TREE(tree); + } + + ret = znode_is_left_connected(child); + + RUNLOCK_TREE(tree); + + if (!ret) { + ret = connect_one_side(parent_coord, child, GN_NO_ALLOC | GN_GO_LEFT); + } else + ret = 0; + +zrelse_and_ret: + zrelse(parent_coord->node); + + return ret; +} + +/* this function is like renew_sibling_link() but allocates neighbor node if + it doesn't exist and `connects' it. It may require making two steps in + horizontal direction, first one for neighbor node finding/allocation, + second one is for finding neighbor of neighbor to connect freshly allocated + znode. */ +/* Audited by: umka (2002.06.14), umka (2002.06.15) */ +static int +renew_neighbor(coord_t * coord, znode * node, tree_level level, int flags) +{ + coord_t local; + lock_handle empty[2]; + reiser4_tree *tree = znode_get_tree(node); + znode *neighbor = NULL; + int nr_locked = 0; + int ret; + + assert("umka-250", coord != NULL); + assert("umka-251", node != NULL); + assert("umka-307", tree != NULL); + assert("umka-308", level <= tree->height); + + /* umka (2002.06.14) + Here probably should be a check for given "level" validness. + Something like assert("xxx-yyy", level < REAL_MAX_ZTREE_HEIGHT); + */ + + coord_dup(&local, coord); + + ret = renew_sibling_link(&local, &empty[0], node, level, flags & ~GN_NO_ALLOC, &nr_locked); + if (ret) + goto out; + + /* tree lock is not needed here because we keep parent node(s) locked + and reference to neighbor znode incremented */ + neighbor = (flags & GN_GO_LEFT) ? node->left : node->right; + + ret = UNDER_RW(tree, tree, read, znode_is_connected(neighbor)); + + if (ret) { + ret = 0; + goto out; + } + + ret = renew_sibling_link(&local, &empty[nr_locked], neighbor, level, flags | GN_NO_ALLOC, &nr_locked); + /* second renew_sibling_link() call is used for znode connection only, + so we can live with these errors */ + if (-ENOENT == ret || -E_NO_NEIGHBOR == ret) + ret = 0; + +out: + + for (--nr_locked; nr_locked >= 0; --nr_locked) { + zrelse(empty[nr_locked].node); + longterm_unlock_znode(&empty[nr_locked]); + } + + if (neighbor != NULL) + /* decrement znode reference counter without actually + releasing it. */ + atomic_dec(&ZJNODE(neighbor)->x_count); + + return ret; +} + +/* + reiser4_get_neighbor() -- lock node's neighbor. + + reiser4_get_neighbor() locks node's neighbor (left or right one, depends on + given parameter) using sibling link to it. If sibling link is not available + (i.e. neighbor znode is not in cache) and flags allow read blocks, we go one + level up for information about neighbor's disk address. We lock node's + parent, if it is common parent for both 'node' and its neighbor, neighbor's + disk address is in next (to left or to right) down link from link that points + to original node. If not, we need to lock parent's neighbor, read its content + and take first(last) downlink with neighbor's disk address. That locking + could be done by using sibling link and lock_neighbor() function, if sibling + link exists. In another case we have to go level up again until we find + common parent or valid sibling link. Then go down + allocating/connecting/locking/reading nodes until neighbor of first one is + locked. + + @neighbor: result lock handle, + @node: a node which we lock neighbor of, + @lock_mode: lock mode {LM_READ, LM_WRITE}, + @flags: logical OR of {GN_*} (see description above) subset. + + @return: 0 if success, negative value if lock was impossible due to an error + or lack of neighbor node. +*/ + +/* Audited by: umka (2002.06.14), umka (2002.06.15) */ +reiser4_internal int +reiser4_get_neighbor ( + lock_handle * neighbor, znode * node, znode_lock_mode lock_mode, int flags) +{ + reiser4_tree *tree = znode_get_tree(node); + lock_handle path[REAL_MAX_ZTREE_HEIGHT]; + + coord_t coord; + + tree_level base_level; + tree_level h = 0; + int ret; + + assert("umka-252", tree != NULL); + assert("umka-253", neighbor != NULL); + assert("umka-254", node != NULL); + + base_level = znode_get_level(node); + + assert("umka-310", base_level <= tree->height); + + coord_init_zero(&coord); + +again: + /* first, we try to use simple lock_neighbor() which requires sibling + link existence */ + ret = UNDER_RW(tree, tree, read, + lock_side_neighbor(neighbor, node, lock_mode, flags, 1)); + + if (!ret) { + /* load znode content if it was specified */ + if (flags & GN_LOAD_NEIGHBOR) { + ret = zload(node); + if (ret) + longterm_unlock_znode(neighbor); + } + return ret; + } + + /* only -ENOENT means we may look upward and try to connect + @node with its neighbor (if @flags allow us to do it) */ + if (ret != -ENOENT || !(flags & GN_CAN_USE_UPPER_LEVELS)) + return ret; + + /* before establishing of sibling link we lock parent node; it is + required by renew_neighbor() to work. */ + init_lh(&path[0]); + ret = reiser4_get_parent(&path[0], node, ZNODE_READ_LOCK, 1); + if (ret) + return ret; + if (znode_above_root(path[0].node)) { + longterm_unlock_znode(&path[0]); + return RETERR(-E_NO_NEIGHBOR); + } + + while (1) { + znode *child = (h == 0) ? node : path[h - 1].node; + znode *parent = path[h].node; + + ret = zload(parent); + if (ret) + break; + + ret = find_child_ptr(parent, child, &coord); + + if (ret) { + zrelse(parent); + break; + } + + /* try to establish missing sibling link */ + ret = renew_neighbor(&coord, child, h + base_level, flags); + + zrelse(parent); + + switch (ret) { + case 0: + /* unlocking of parent znode prevents simple + deadlock situation */ + done_lh(&path[h]); + + /* depend on tree level we stay on we repeat first + locking attempt ... */ + if (h == 0) + goto again; + + /* ... or repeat establishing of sibling link at + one level below. */ + --h; + break; + + case -ENOENT: + /* sibling link is not available -- we go + upward. */ + init_lh(&path[h + 1]); + ret = reiser4_get_parent(&path[h + 1], parent, ZNODE_READ_LOCK, 1); + if (ret) + goto fail; + ++h; + if (znode_above_root(path[h].node)) { + ret = RETERR(-E_NO_NEIGHBOR); + goto fail; + } + break; + + case -E_DEADLOCK: + /* there was lock request from hi-pri locker. if + it is possible we unlock last parent node and + re-lock it again. */ + while (check_deadlock()) { + if (h == 0) + goto fail; + + done_lh(&path[--h]); + } + + break; + + default: /* other errors. */ + goto fail; + } + } +fail: + ON_DEBUG(check_lock_node_data(node)); + ON_DEBUG(check_lock_data()); + + /* unlock path */ + do { + longterm_unlock_znode(&path[h]); + --h; + } while (h + 1 != 0); + + return ret; +} + +/* remove node from sibling list */ +/* Audited by: umka (2002.06.14) */ +reiser4_internal void +sibling_list_remove(znode * node) +{ + reiser4_tree *tree; + + tree = znode_get_tree(node); + assert("umka-255", node != NULL); + assert("zam-878", rw_tree_is_write_locked(tree)); + assert("nikita-3275", check_sibling_list(node)); + + WLOCK_DK(tree); + if (znode_is_right_connected(node) && node->right != NULL && + znode_is_left_connected(node) && node->left != NULL) { + assert("zam-32245", keyeq(znode_get_rd_key(node), znode_get_ld_key(node->right))); + znode_set_rd_key(node->left, znode_get_ld_key(node->right)); + } + WUNLOCK_DK(tree); + + if (znode_is_right_connected(node) && node->right != NULL) { + assert("zam-322", znode_is_left_connected(node->right)); + node->right->left = node->left; + ON_DEBUG(node->right->left_version = atomic_inc_return(&delim_key_version);); + } + if (znode_is_left_connected(node) && node->left != NULL) { + assert("zam-323", znode_is_right_connected(node->left)); + node->left->right = node->right; + ON_DEBUG(node->left->right_version = atomic_inc_return(&delim_key_version);); + } + + ZF_CLR(node, JNODE_LEFT_CONNECTED); + ZF_CLR(node, JNODE_RIGHT_CONNECTED); + ON_DEBUG( + node->left = node->right = NULL; + node->left_version = atomic_inc_return(&delim_key_version); + node->right_version = atomic_inc_return(&delim_key_version); + ); + assert("nikita-3276", check_sibling_list(node)); +} + +/* disconnect node from sibling list */ +reiser4_internal void +sibling_list_drop(znode * node) +{ + znode *right; + znode *left; + + assert("nikita-2464", node != NULL); + assert("nikita-3277", check_sibling_list(node)); + + right = node->right; + if (right != NULL) { + assert("nikita-2465", znode_is_left_connected(right)); + right->left = NULL; + ON_DEBUG(right->left_version = atomic_inc_return(&delim_key_version);); + } + left = node->left; + if (left != NULL) { + assert("zam-323", znode_is_right_connected(left)); + left->right = NULL; + ON_DEBUG(left->right_version = atomic_inc_return(&delim_key_version);); + } + ZF_CLR(node, JNODE_LEFT_CONNECTED); + ZF_CLR(node, JNODE_RIGHT_CONNECTED); + ON_DEBUG( + node->left = node->right = NULL; + node->left_version = atomic_inc_return(&delim_key_version); + node->right_version = atomic_inc_return(&delim_key_version); + ); +} + +/* Insert new node into sibling list. Regular balancing inserts new node + after (at right side) existing and locked node (@before), except one case + of adding new tree root node. @before should be NULL in that case. */ +reiser4_internal void +sibling_list_insert_nolock(znode * new, znode * before) +{ + assert("zam-334", new != NULL); + assert("nikita-3298", !znode_is_left_connected(new)); + assert("nikita-3299", !znode_is_right_connected(new)); + assert("nikita-3300", new->left == NULL); + assert("nikita-3301", new->right == NULL); + assert("nikita-3278", check_sibling_list(new)); + assert("nikita-3279", check_sibling_list(before)); + + if (before != NULL) { + assert("zam-333", znode_is_connected(before)); + new->right = before->right; + new->left = before; + ON_DEBUG( + new->right_version = atomic_inc_return(&delim_key_version); + new->left_version = atomic_inc_return(&delim_key_version); + ); + if (before->right != NULL) { + before->right->left = new; + ON_DEBUG(before->right->left_version = atomic_inc_return(&delim_key_version);); + } + before->right = new; + ON_DEBUG(before->right_version = atomic_inc_return(&delim_key_version);); + } else { + new->right = NULL; + new->left = NULL; + ON_DEBUG( + new->right_version = atomic_inc_return(&delim_key_version); + new->left_version = atomic_inc_return(&delim_key_version); + ); + } + ZF_SET(new, JNODE_LEFT_CONNECTED); + ZF_SET(new, JNODE_RIGHT_CONNECTED); + assert("nikita-3280", check_sibling_list(new)); + assert("nikita-3281", check_sibling_list(before)); +} + +struct tw_handle { + /* A key for tree walking (re)start, updated after each successful tree + * node processing */ + reiser4_key start_key; + /* A tree traversal current position. */ + tap_t tap; + /* An externally supplied pair of functions for formatted and + * unformatted nodes processing. */ + struct tree_walk_actor * actor; + /* It is passed to actor functions as is. */ + void * opaque; + /* A direction of a tree traversal: 1 if going from right to left. */ + int go_left:1; + /* "Done" flag */ + int done:1; + /* Current node was processed completely */ + int node_completed:1; +}; + +/* it locks the root node, handles the restarts inside */ +static int lock_tree_root (lock_handle * lock, znode_lock_mode mode) +{ + int ret; + + reiser4_tree * tree = current_tree; + lock_handle uber_znode_lock; + znode * root; + + init_lh(&uber_znode_lock); + again: + + ret = get_uber_znode(tree, mode, ZNODE_LOCK_HIPRI, &uber_znode_lock); + if (ret) + return ret; + + root = zget(tree, &tree->root_block, uber_znode_lock.node, tree->height, GFP_KERNEL); + if (IS_ERR(root)) { + done_lh(&uber_znode_lock); + return PTR_ERR(root); + } + + ret = longterm_lock_znode(lock, root, ZNODE_WRITE_LOCK, ZNODE_LOCK_HIPRI); + + zput(root); + done_lh(&uber_znode_lock); + + if (ret == -E_DEADLOCK) + goto again; + + return ret; +} + +/* Update the handle->start_key by the first key of the node is being + * processed. */ +static int update_start_key(struct tw_handle * h) +{ + int ret; + + ret = tap_load(&h->tap); + if (ret == 0) { + unit_key_by_coord(h->tap.coord, &h->start_key); + tap_relse(&h->tap); + } + return ret; +} + +/* Move tap to the next node, load it. */ +static int go_next_node (struct tw_handle * h, lock_handle * lock, const coord_t * coord) +{ + int ret; + + assert ("zam-948", ergo (coord != NULL, lock->node == coord->node)); + + tap_relse(&h->tap); + + ret = tap_move(&h->tap, lock); + if (ret) + return ret; + + ret = tap_load(&h->tap); + if (ret) + goto error; + + if (coord) + coord_dup(h->tap.coord, coord); + else { + if (h->go_left) + coord_init_last_unit(h->tap.coord, lock->node); + else + coord_init_first_unit(h->tap.coord, lock->node); + } + + if (h->actor->process_znode != NULL) { + ret = (h->actor->process_znode)(&h->tap, h->opaque); + if (ret) + goto error; + } + + ret = update_start_key(h); + + error: + done_lh(lock); + return ret; +} + +static void next_unit (struct tw_handle * h) +{ + if (h->go_left) + h->node_completed = coord_prev_unit(h->tap.coord); + else + h->node_completed = coord_next_unit(h->tap.coord); +} + + +/* Move tree traversal position (which is embedded into tree_walk_handle) to the + * parent of current node (h->lh.node). */ +static int tw_up (struct tw_handle * h) +{ + coord_t coord; + lock_handle lock; + load_count load; + int ret; + + init_lh(&lock); + init_load_count(&load); + + do { + ret = reiser4_get_parent(&lock, h->tap.lh->node, ZNODE_WRITE_LOCK, 0); + if (ret) + break; + if (znode_above_root(lock.node)) { + h->done = 1; + break; + } + ret = incr_load_count_znode(&load, lock.node); + if (ret) + break; + ret = find_child_ptr(lock.node, h->tap.lh->node, &coord); + if (ret) + break; + ret = go_next_node(h, &lock, &coord); + if (ret) + break; + next_unit(h); + } while (0); + + done_load_count(&load); + done_lh(&lock); + + return ret; +} + +/* Move tree traversal position to the child of current node pointed by + * h->tap.coord. */ +static int tw_down(struct tw_handle * h) +{ + reiser4_block_nr block; + lock_handle lock; + znode * child; + item_plugin * iplug; + tree_level level = znode_get_level(h->tap.lh->node); + int ret; + + assert ("zam-943", item_is_internal(h->tap.coord)); + + iplug = item_plugin_by_coord(h->tap.coord); + iplug->s.internal.down_link(h->tap.coord, NULL, &block); + init_lh(&lock); + + do { + child = zget(current_tree, &block, h->tap.lh->node, level - 1, GFP_KERNEL); + if (IS_ERR(child)) + return PTR_ERR(child); + ret = connect_znode(h->tap.coord, child); + if (ret) + break; + ret = longterm_lock_znode(&lock, child, ZNODE_WRITE_LOCK, 0); + if (ret) + break; + set_child_delimiting_keys(h->tap.coord->node, h->tap.coord, child); + ret = go_next_node (h, &lock, NULL); + } while(0); + + zput(child); + done_lh(&lock); + return ret; +} +/* Traverse the reiser4 tree until either all tree traversing is done or an + * error encountered (including recoverable ones as -E_DEADLOCK or -E_REPEAT). The + * @actor function is able to stop tree traversal by returning an appropriate + * error code. */ +static int tw_by_handle (struct tw_handle * h) +{ + int ret; + lock_handle next_lock; + + ret = tap_load(&h->tap); + if (ret) + return ret; + + init_lh (&next_lock); + + while (!h->done) { + tree_level level; + + if (h->node_completed) { + h->node_completed = 0; + ret = tw_up(h); + if (ret) + break; + continue; + } + + assert ("zam-944", coord_is_existing_unit(h->tap.coord)); + level = znode_get_level(h->tap.lh->node); + + if (level == LEAF_LEVEL) { + h->node_completed = 1; + continue; + } + + if (item_is_extent(h->tap.coord)) { + if (h->actor->process_extent != NULL) { + ret = (h->actor->process_extent)(&h->tap, h->opaque); + if (ret) + break; + } + next_unit(h); + continue; + } + + ret = tw_down(h); + if (ret) + break; + } + + done_lh(&next_lock); + return ret; +} + +/* Walk the reiser4 tree in parent-first order */ +reiser4_internal int +tree_walk (const reiser4_key *start_key, int go_left, struct tree_walk_actor * actor, void * opaque) +{ + coord_t coord; + lock_handle lock; + struct tw_handle handle; + + int ret; + + assert ("zam-950", actor != NULL); + + handle.actor = actor; + handle.opaque = opaque; + handle.go_left = !!go_left; + handle.done = 0; + handle.node_completed = 0; + + init_lh(&lock); + + if (start_key == NULL) { + if (actor->before) { + ret = actor->before(opaque); + if (ret) + return ret; + } + + ret = lock_tree_root(&lock, ZNODE_WRITE_LOCK); + if (ret) + return ret; + ret = zload(lock.node); + if (ret) + goto done; + + if (go_left) + coord_init_last_unit(&coord, lock.node); + else + coord_init_first_unit_nocheck(&coord, lock.node); + + zrelse(lock.node); + goto no_start_key; + } else + handle.start_key = *start_key; + + do { + if (actor->before) { + ret = actor->before(opaque); + if (ret) + return ret; + } + + ret = coord_by_key(current_tree, &handle.start_key, &coord, &lock, ZNODE_WRITE_LOCK, + FIND_MAX_NOT_MORE_THAN, TWIG_LEVEL, LEAF_LEVEL, 0, NULL); + if (ret != CBK_COORD_FOUND) + break; + no_start_key: + tap_init(&handle.tap, &coord, &lock, ZNODE_WRITE_LOCK); + + ret = update_start_key(&handle); + if (ret) { + tap_done(&handle.tap); + break; + } + ret = tw_by_handle(&handle); + tap_done (&handle.tap); + + } while (!handle.done && (ret == -E_DEADLOCK || ret == -E_REPEAT)); + + done: + done_lh(&lock); + return ret; +} + + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 80 + End: +*/ diff -puN /dev/null fs/reiser4/tree_walk.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/tree_walk.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,117 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +/* definitions of reiser4 tree walk functions */ + +#ifndef __FS_REISER4_TREE_WALK_H__ +#define __FS_REISER4_TREE_WALK_H__ + +#include "debug.h" +#include "forward.h" + +/* establishes horizontal links between cached znodes */ +int connect_znode(coord_t * coord, znode * node); + +/* tree traversal functions (reiser4_get_parent(), reiser4_get_neighbor()) + have the following common arguments: + + return codes: + + @return : 0 - OK, + +ZAM-FIXME-HANS: wrong return code name. Change them all. + -ENOENT - neighbor is not in cache, what is detected by sibling + link absence. + + -E_NO_NEIGHBOR - we are sure that neighbor (or parent) node cannot be + found (because we are left-/right- most node of the + tree, for example). Also, this return code is for + reiser4_get_parent() when we see no parent link -- it + means that our node is root node. + + -E_DEADLOCK - deadlock detected (request from high-priority process + received), other error codes are conformed to + /usr/include/asm/errno.h . +*/ + +int +reiser4_get_parent_flags(lock_handle * result, znode * node, + znode_lock_mode mode, int flags); + +int reiser4_get_parent(lock_handle * result, znode * node, znode_lock_mode mode, int only_connected_p); + +/* bits definition for reiser4_get_neighbor function `flags' arg. */ +typedef enum { + /* If sibling pointer is NULL, this flag allows get_neighbor() to try to + * find not allocated not connected neigbor by going though upper + * levels */ + GN_CAN_USE_UPPER_LEVELS = 0x1, + /* locking left neighbor instead of right one */ + GN_GO_LEFT = 0x2, + /* automatically load neighbor node content */ + GN_LOAD_NEIGHBOR = 0x4, + /* return -E_REPEAT if can't lock */ + GN_TRY_LOCK = 0x8, + /* used internally in tree_walk.c, causes renew_sibling to not + allocate neighbor znode, but only search for it in znode cache */ + GN_NO_ALLOC = 0x10, + /* do not go across atom boundaries */ + GN_SAME_ATOM = 0x20, + /* allow to lock not connected nodes */ + GN_ALLOW_NOT_CONNECTED = 0x40, + /* Avoid synchronous jload, instead, call jstartio() and return -E_REPEAT. */ + GN_ASYNC = 0x80 +} znode_get_neigbor_flags; + +int reiser4_get_neighbor(lock_handle * neighbor, znode * node, znode_lock_mode lock_mode, int flags); + +/* there are wrappers for most common usages of reiser4_get_neighbor() */ +static inline int +reiser4_get_left_neighbor(lock_handle * result, znode * node, int lock_mode, int flags) +{ + return reiser4_get_neighbor(result, node, lock_mode, flags | GN_GO_LEFT); +} + +static inline int +reiser4_get_right_neighbor(lock_handle * result, znode * node, int lock_mode, int flags) +{ + ON_DEBUG(check_lock_node_data(node)); + ON_DEBUG(check_lock_data()); + return reiser4_get_neighbor(result, node, lock_mode, flags & (~GN_GO_LEFT)); +} + +extern void invalidate_lock(lock_handle * _link); + +extern void sibling_list_remove(znode * node); +extern void sibling_list_drop(znode * node); +extern void sibling_list_insert_nolock(znode * new, znode * before); +extern void link_left_and_right(znode * left, znode * right); + +/* Functions called by tree_walk() when tree_walk() ... */ +struct tree_walk_actor { + /* ... meets a formatted node, */ + int (*process_znode)(tap_t* , void*); + /* ... meets an extent, */ + int (*process_extent)(tap_t*, void*); + /* ... begins tree traversal or repeats it after -E_REPEAT was returned by + * node or extent processing functions. */ + int (*before)(void *); +}; +extern int tree_walk(const reiser4_key *, int, struct tree_walk_actor *, void *); + +#if REISER4_DEBUG +int check_sibling_list(znode * node); +#else +#define check_sibling_list(n) (1) +#endif + +#endif /* __FS_REISER4_TREE_WALK_H__ */ + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/txnmgr.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/txnmgr.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,4172 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Joshua MacDonald wrote the first draft of this code. */ + +/* ZAM-LONGTERM-FIXME-HANS: The locking in this file is badly designed, and a +filesystem scales only as well as its worst locking design. You need to +substantially restructure this code. Josh was not as experienced a programmer +as you. Particularly review how the locking style differs from what you did +for znodes usingt hi-lo priority locking, and present to me an opinion on +whether the differences are well founded. */ + +/* I cannot help but to disagree with the sentiment above. Locking of + * transaction manager is _not_ badly designed, and, at the very least, is not + * the scaling bottleneck. Scaling bottleneck is _exactly_ hi-lo priority + * locking on znodes, especially on the root node of the tree. --nikita, + * 2003.10.13 */ + +/* The txnmgr is a set of interfaces that keep track of atoms and transcrash handles. The + txnmgr processes capture_block requests and manages the relationship between jnodes and + atoms through the various stages of a transcrash, and it also oversees the fusion and + capture-on-copy processes. The main difficulty with this task is maintaining a + deadlock-free lock ordering between atoms and jnodes/handles. The reason for the + difficulty is that jnodes, handles, and atoms contain pointer circles, and the cycle + must be broken. The main requirement is that atom-fusion be deadlock free, so once you + hold the atom_lock you may then wait to acquire any jnode or handle lock. This implies + that any time you check the atom-pointer of a jnode or handle and then try to lock that + atom, you must use trylock() and possibly reverse the order. + + This code implements the design documented at: + + http://namesys.com/txn-doc.html + +ZAM-FIXME-HANS: update v4.html to contain all of the information present in the above (but updated), and then remove the +above document and reference the new. Be sure to provide some credit to Josh. I already have some writings on this +topic in v4.html, but they are lacking in details present in the above. Cure that. Remember to write for the bright 12 +year old --- define all technical terms used. + +*/ + +/* Thoughts on the external transaction interface: + + In the current code, a TRANSCRASH handle is created implicitly by init_context() (which + creates state that lasts for the duration of a system call and is called at the start + of ReiserFS methods implementing VFS operations), and closed by reiser4_exit_context(), + occupying the scope of a single system call. We wish to give certain applications an + interface to begin and close (commit) transactions. Since our implementation of + transactions does not yet support isolation, allowing an application to open a + transaction implies trusting it to later close the transaction. Part of the + transaction interface will be aimed at enabling that trust, but the interface for + actually using transactions is fairly narrow. + + BEGIN_TRANSCRASH: Returns a transcrash identifier. It should be possible to translate + this identifier into a string that a shell-script could use, allowing you to start a + transaction by issuing a command. Once open, the transcrash should be set in the task + structure, and there should be options (I suppose) to allow it to be carried across + fork/exec. A transcrash has several options: + + - READ_FUSING or WRITE_FUSING: The default policy is for txn-capture to capture only + on writes (WRITE_FUSING) and allow "dirty reads". If the application wishes to + capture on reads as well, it should set READ_FUSING. + + - TIMEOUT: Since a non-isolated transcrash cannot be undone, every transcrash must + eventually close (or else the machine must crash). If the application dies an + unexpected death with an open transcrash, for example, or if it hangs for a long + duration, one solution (to avoid crashing the machine) is to simply close it anyway. + This is a dangerous option, but it is one way to solve the problem until isolated + transcrashes are available for untrusted applications. + + It seems to be what databases do, though it is unclear how one avoids a DoS attack + creating a vulnerability based on resource starvation. Guaranteeing that some + minimum amount of computational resources are made available would seem more correct + than guaranteeing some amount of time. When we again have someone to code the work, + this issue should be considered carefully. -Hans + + RESERVE_BLOCKS: A running transcrash should indicate to the transaction manager how + many dirty blocks it expects. The reserve_blocks interface should be called at a point + where it is safe for the application to fail, because the system may not be able to + grant the allocation and the application must be able to back-out. For this reason, + the number of reserve-blocks can also be passed as an argument to BEGIN_TRANSCRASH, but + the application may also wish to extend the allocation after beginning its transcrash. + + CLOSE_TRANSCRASH: The application closes the transcrash when it is finished making + modifications that require transaction protection. When isolated transactions are + supported the CLOSE operation is replaced by either COMMIT or ABORT. For example, if a + RESERVE_BLOCKS call fails for the application, it should "abort" by calling + CLOSE_TRANSCRASH, even though it really commits any changes that were made (which is + why, for safety, the application should call RESERVE_BLOCKS before making any changes). + + For actually implementing these out-of-system-call-scopped transcrashes, the + reiser4_context has a "txn_handle *trans" pointer that may be set to an open + transcrash. Currently there are no dynamically-allocated transcrashes, but there is a + "kmem_cache_t *_txnh_slab" created for that purpose in this file. +*/ + +/* Extending the other system call interfaces for future transaction features: + + Specialized applications may benefit from passing flags to the ordinary system call + interface such as read(), write(), or stat(). For example, the application specifies + WRITE_FUSING by default but wishes to add that a certain read() command should be + treated as READ_FUSING. But which read? Is it the directory-entry read, the stat-data + read, or the file-data read? These issues are straight-forward, but there are a lot of + them and adding the necessary flags-passing code will be tedious. + + When supporting isolated transactions, there is a corresponding READ_MODIFY_WRITE (RMW) + flag, which specifies that although it is a read operation being requested, a + write-lock should be taken. The reason is that read-locks are shared while write-locks + are exclusive, so taking a read-lock when a later-write is known in advance will often + leads to deadlock. If a reader knows it will write later, it should issue read + requests with the RMW flag set. +*/ + +/* + The znode/atom deadlock avoidance. + + FIXME(Zam): writing of this comment is in progress. + + The atom's special stage ASTAGE_CAPTURE_WAIT introduces a kind of atom's + long-term locking, which makes reiser4 locking scheme more complex. It had + deadlocks until we implement deadlock avoidance algorithms. That deadlocks + looked as the following: one stopped thread waits for a long-term lock on + znode, the thread who owns that lock waits when fusion with another atom will + be allowed. + + The source of the deadlocks is an optimization of not capturing index nodes + for read. Let's prove it. Suppose we have dumb node capturing scheme which + unconditionally captures each block before locking it. + + That scheme has no deadlocks. Let's begin with the thread which stage is + ASTAGE_CAPTURE_WAIT and it waits for a znode lock. The thread can't wait for + a capture because it's stage allows fusion with any atom except which are + being committed currently. A process of atom commit can't deadlock because + atom commit procedure does not acquire locks and does not fuse with other + atoms. Reiser4 does capturing right before going to sleep inside the + longtertm_lock_znode() function, it means the znode which we want to lock is + already captured and its atom is in ASTAGE_CAPTURE_WAIT stage. If we + continue the analysis we understand that no one process in the sequence may + waits atom fusion. Thereby there are no deadlocks of described kind. + + The capturing optimization makes the deadlocks possible. A thread can wait a + lock which owner did not captured that node. The lock owner's current atom + is not fused with the first atom and it does not get a ASTAGE_CAPTURE_WAIT + state. A deadlock is possible when that atom meets another one which is in + ASTAGE_CAPTURE_WAIT already. + + The deadlock avoidance scheme includes two algorithms: + + First algorithm is used when a thread captures a node which is locked but not + captured by another thread. Those nodes are marked MISSED_IN_CAPTURE at the + moment we skip their capturing. If such a node (marked MISSED_IN_CAPTURE) is + being captured by a thread with current atom is in ASTAGE_CAPTURE_WAIT, the + routine which forces all lock owners to join with current atom is executed. + + Second algorithm does not allow to skip capturing of already captured nodes. + + Both algorithms together prevent waiting a longterm lock without atom fusion + with atoms of all lock owners, which is a key thing for getting atom/znode + locking deadlocks. +*/ + +/* + * Transactions and mmap(2). + * + * 1. Transactions are not supported for accesses through mmap(2), because + * this would effectively amount to user-level transactions whose duration + * is beyond control of the kernel. + * + * 2. That said, we still want to preserve some decency with regard to + * mmap(2). During normal write(2) call, following sequence of events + * happens: + * + * 1. page is created; + * + * 2. jnode is created, dirtied and captured into current atom. + * + * 3. extent is inserted and modified. + * + * Steps (2) and (3) take place under long term lock on the twig node. + * + * When file is accessed through mmap(2) page is always created during + * page fault. After this (in reiser4_readpage()->readpage_extent()): + * + * 1. if access is made to non-hole page new jnode is created, (if + * necessary) + * + * 2. if access is made to the hole page, jnode is not created (XXX + * not clear why). + * + * Also, even if page is created by write page fault it is not marked + * dirty immediately by handle_mm_fault(). Probably this is to avoid races + * with page write-out. + * + * Dirty bit installed by hardware is only transferred to the struct page + * later, when page is unmapped (in zap_pte_range(), or + * try_to_unmap_one()). + * + * So, with mmap(2) we have to handle following irksome situations: + * + * 1. there exists modified page (clean or dirty) without jnode + * + * 2. there exists modified page (clean or dirty) with clean jnode + * + * 3. clean page which is a part of atom can be transparently modified + * at any moment through mapping without becoming dirty. + * + * (1) and (2) can lead to the out-of-memory situation: ->writepage() + * doesn't know what to do with such pages and ->sync_sb()/->writepages() + * don't see them, because these methods operate on atoms. + * + * (3) can lead to the loss of data: suppose we have dirty page with dirty + * captured jnode captured by some atom. As part of early flush (for + * example) page was written out. Dirty bit was cleared on both page and + * jnode. After this page is modified through mapping, but kernel doesn't + * notice and just discards page and jnode as part of commit. (XXX + * actually it doesn't, because to reclaim page ->releasepage() has to be + * called and before this dirty bit will be transferred to the struct + * page). + * + */ + +#include "debug.h" +#include "type_safe_list.h" +#include "txnmgr.h" +#include "jnode.h" +#include "znode.h" +#include "block_alloc.h" +#include "tree.h" +#include "wander.h" +#include "ktxnmgrd.h" +#include "super.h" +#include "page_cache.h" +#include "reiser4.h" +#include "vfs_ops.h" +#include "inode.h" +#include "flush.h" + +#include +#include +#include +#include +#include +#include +#include +#include /* for totalram_pages */ + +static void atom_free(txn_atom * atom); + +static long commit_txnh(txn_handle * txnh); + +static void wakeup_atom_waitfor_list(txn_atom * atom); +static void wakeup_atom_waiting_list(txn_atom * atom); + +static void capture_assign_txnh_nolock(txn_atom * atom, txn_handle * txnh); + +static void capture_assign_block_nolock(txn_atom * atom, jnode * node); + +static int capture_assign_block(txn_handle * txnh, jnode * node); + +static int capture_assign_txnh(jnode * node, txn_handle * txnh, txn_capture mode, int can_coc); + +static int fuse_not_fused_lock_owners(txn_handle * txnh, znode * node); + +static int capture_init_fusion(jnode * node, txn_handle * txnh, txn_capture mode, int can_coc); + +static int capture_fuse_wait(jnode * node, txn_handle * txnh, txn_atom * atomf, txn_atom * atomh, txn_capture mode); + +static void capture_fuse_into(txn_atom * small, txn_atom * large); + +static int capture_copy(jnode * node, txn_handle * txnh, txn_atom * atomf, txn_atom * atomh, txn_capture mode, int can_coc); + +void invalidate_list(capture_list_head *); + +/* GENERIC STRUCTURES */ + +typedef struct _txn_wait_links txn_wait_links; + +struct _txn_wait_links { + lock_stack *_lock_stack; + fwaitfor_list_link _fwaitfor_link; + fwaiting_list_link _fwaiting_link; + int (*waitfor_cb)(txn_atom *atom, struct _txn_wait_links *wlinks); + int (*waiting_cb)(txn_atom *atom, struct _txn_wait_links *wlinks); +}; + +TYPE_SAFE_LIST_DEFINE(txnh, txn_handle, txnh_link); + +TYPE_SAFE_LIST_DEFINE(fwaitfor, txn_wait_links, _fwaitfor_link); +TYPE_SAFE_LIST_DEFINE(fwaiting, txn_wait_links, _fwaiting_link); + +/* FIXME: In theory, we should be using the slab cache init & destructor + methods instead of, e.g., jnode_init, etc. */ +static kmem_cache_t *_atom_slab = NULL; +/* this is for user-visible, cross system-call transactions. */ +static kmem_cache_t *_txnh_slab = NULL; + +ON_DEBUG(extern atomic_t flush_cnt;) + +/* TXN_INIT */ +/* Initialize static variables in this file. */ +reiser4_internal int +txnmgr_init_static(void) +{ + assert("jmacd-600", _atom_slab == NULL); + assert("jmacd-601", _txnh_slab == NULL); + + ON_DEBUG(atomic_set(&flush_cnt, 0)); + + _atom_slab = kmem_cache_create("txn_atom", sizeof (txn_atom), 0, + SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT, + NULL, NULL); + + if (_atom_slab == NULL) { + goto error; + } + + _txnh_slab = kmem_cache_create("txn_handle", sizeof (txn_handle), 0, SLAB_HWCACHE_ALIGN, NULL, NULL); + + if (_txnh_slab == NULL) { + goto error; + } + + return 0; + +error: + + if (_atom_slab != NULL) { + kmem_cache_destroy(_atom_slab); + } + if (_txnh_slab != NULL) { + kmem_cache_destroy(_txnh_slab); + } + return RETERR(-ENOMEM); +} + +/* Un-initialize static variables in this file. */ +reiser4_internal int +txnmgr_done_static(void) +{ + int ret1, ret2, ret3; + + ret1 = ret2 = ret3 = 0; + + if (_atom_slab != NULL) { + ret1 = kmem_cache_destroy(_atom_slab); + _atom_slab = NULL; + } + + if (_txnh_slab != NULL) { + ret2 = kmem_cache_destroy(_txnh_slab); + _txnh_slab = NULL; + } + + return ret1 ? : ret2; +} + +/* Initialize a new transaction manager. Called when the super_block is initialized. */ +reiser4_internal void +txnmgr_init(txn_mgr * mgr) +{ + assert("umka-169", mgr != NULL); + + mgr->atom_count = 0; + mgr->id_count = 1; + + atom_list_init(&mgr->atoms_list); + spin_txnmgr_init(mgr); + + sema_init(&mgr->commit_semaphore, 1); +} + +/* Free transaction manager. */ +reiser4_internal int +txnmgr_done(txn_mgr * mgr UNUSED_ARG) +{ + assert("umka-170", mgr != NULL); + + return 0; +} + +/* Initialize a transaction handle. */ +/* Audited by: umka (2002.06.13) */ +static void +txnh_init(txn_handle * txnh, txn_mode mode) +{ + assert("umka-171", txnh != NULL); + + txnh->mode = mode; + txnh->atom = NULL; + txnh->flags = 0; + + spin_txnh_init(txnh); + + txnh_list_clean(txnh); +} + +#if REISER4_DEBUG +/* Check if a transaction handle is clean. */ +static int +txnh_isclean(txn_handle * txnh) +{ + assert("umka-172", txnh != NULL); + return txnh->atom == NULL && spin_txnh_is_not_locked(txnh); +} +#endif + +/* Initialize an atom. */ +static void +atom_init(txn_atom * atom) +{ + int level; + + assert("umka-173", atom != NULL); + + memset(atom, 0, sizeof (txn_atom)); + + atom->stage = ASTAGE_FREE; + atom->start_time = jiffies; + + for (level = 0; level < REAL_MAX_ZTREE_HEIGHT + 1; level += 1) + capture_list_init(ATOM_DIRTY_LIST(atom, level)); + + capture_list_init(ATOM_CLEAN_LIST(atom)); + capture_list_init(ATOM_OVRWR_LIST(atom)); + capture_list_init(ATOM_WB_LIST(atom)); + capture_list_init(&atom->inodes); + spin_atom_init(atom); + txnh_list_init(&atom->txnh_list); + atom_list_clean(atom); + fwaitfor_list_init(&atom->fwaitfor_list); + fwaiting_list_init(&atom->fwaiting_list); + prot_list_init(&atom->protected); + blocknr_set_init(&atom->delete_set); + blocknr_set_init(&atom->wandered_map); + + init_atom_fq_parts(atom); +} + +#if REISER4_DEBUG +/* Check if an atom is clean. */ +static int +atom_isclean(txn_atom * atom) +{ + int level; + + assert("umka-174", atom != NULL); + + for (level = 0; level < REAL_MAX_ZTREE_HEIGHT + 1; level += 1) { + if (!capture_list_empty(ATOM_DIRTY_LIST(atom, level))) { + return 0; + } + } + + return + atom->stage == ASTAGE_FREE && + atom->txnh_count == 0 && + atom->capture_count == 0 && + atomic_read(&atom->refcount) == 0 && + atom_list_is_clean(atom) && + txnh_list_empty(&atom->txnh_list) && + capture_list_empty(ATOM_CLEAN_LIST(atom)) && + capture_list_empty(ATOM_OVRWR_LIST(atom)) && + capture_list_empty(ATOM_WB_LIST(atom)) && + fwaitfor_list_empty(&atom->fwaitfor_list) && + fwaiting_list_empty(&atom->fwaiting_list) && + prot_list_empty(&atom->protected) && + atom_fq_parts_are_clean(atom); +} +#endif + +/* Begin a transaction in this context. Currently this uses the reiser4_context's + trans_in_ctx, which means that transaction handles are stack-allocated. Eventually + this will be extended to allow transaction handles to span several contexts. */ +/* Audited by: umka (2002.06.13) */ +reiser4_internal void +txn_begin(reiser4_context * context) +{ + assert("jmacd-544", context->trans == NULL); + + context->trans = &context->trans_in_ctx; + + /* FIXME_LATER_JMACD Currently there's no way to begin a TXN_READ_FUSING + transcrash. Default should be TXN_WRITE_FUSING. Also, the _trans variable is + stack allocated right now, but we would like to allow for dynamically allocated + transcrashes that span multiple system calls. + */ + txnh_init(context->trans, TXN_WRITE_FUSING); +} + +/* Finish a transaction handle context. */ +reiser4_internal long +txn_end(reiser4_context * context) +{ + long ret = 0; + txn_handle *txnh; + + assert("umka-283", context != NULL); + assert("nikita-3012", schedulable()); + + /* closing non top-level context---nothing to do */ + if (context != context->parent) + return 0; + + assert("nikita-2967", lock_stack_isclean(get_current_lock_stack())); + + txnh = context->trans; + + if (txnh != NULL) { + /* The txnh's field "atom" can be checked for NULL w/o holding a + lock because txnh->atom could be set by this thread's call to + try_capture or the deadlock prevention code in + fuse_not_fused_lock_owners(). But that code may assign an + atom to this transaction handle only if there are locked and + not yet fused nodes. It cannot happen because lock stack + should be clean at this moment. */ + if (txnh->atom != NULL) + ret = commit_txnh(txnh); + + assert("jmacd-633", txnh_isclean(txnh)); + + context->trans = NULL; + } + + return ret; +} + +reiser4_internal void +txn_restart(reiser4_context * context) +{ + txn_end(context); + preempt_point(); + txn_begin(context); +} + +reiser4_internal void +txn_restart_current(void) +{ + txn_restart(get_current_context()); +} + +/* TXN_ATOM */ + +/* Get the atom belonging to a txnh, which is not locked. Return txnh locked. Locks atom, if atom + is not NULL. This performs the necessary spin_trylock to break the lock-ordering cycle. May + return NULL. */ +static txn_atom * +txnh_get_atom(txn_handle * txnh) +{ + txn_atom *atom; + + assert("umka-180", txnh != NULL); + assert("jmacd-5108", spin_txnh_is_not_locked(txnh)); + + while (1) { + LOCK_TXNH(txnh); + atom = txnh->atom; + + if (atom == NULL) + break; + + if (spin_trylock_atom(atom)) + break; + + atomic_inc(&atom->refcount); + + UNLOCK_TXNH(txnh); + LOCK_ATOM(atom); + LOCK_TXNH(txnh); + + if (txnh->atom == atom) { + atomic_dec(&atom->refcount); + break; + } + + UNLOCK_TXNH(txnh); + atom_dec_and_unlock(atom); + } + + return atom; +} + +/* Get the current atom and spinlock it if current atom present. May return NULL */ +reiser4_internal txn_atom * +get_current_atom_locked_nocheck(void) +{ + reiser4_context *cx; + txn_atom *atom; + txn_handle *txnh; + + cx = get_current_context(); + assert("zam-437", cx != NULL); + + txnh = cx->trans; + assert("zam-435", txnh != NULL); + + atom = txnh_get_atom(txnh); + + UNLOCK_TXNH(txnh); + return atom; +} + +/* Get the atom belonging to a jnode, which is initially locked. Return with + both jnode and atom locked. This performs the necessary spin_trylock to + break the lock-ordering cycle. Assumes the jnode is already locked, and + returns NULL if atom is not set. */ +reiser4_internal txn_atom * +jnode_get_atom(jnode * node) +{ + txn_atom *atom; + + assert("umka-181", node != NULL); + + while (1) { + assert("jmacd-5108", spin_jnode_is_locked(node)); + + atom = node->atom; + /* node is not in any atom */ + if (atom == NULL) + break; + + /* If atom is not locked, grab the lock and return */ + if (spin_trylock_atom(atom)) + break; + + /* At least one jnode belongs to this atom it guarantees that + * atom->refcount > 0, we can safely increment refcount. */ + atomic_inc(&atom->refcount); + UNLOCK_JNODE(node); + + /* re-acquire spin locks in the right order */ + LOCK_ATOM(atom); + LOCK_JNODE(node); + + /* check if node still points to the same atom. */ + if (node->atom == atom) { + atomic_dec(&atom->refcount); + break; + } + + /* releasing of atom lock and reference requires not holding + * locks on jnodes. */ + UNLOCK_JNODE(node); + + /* We do not sure that this atom has extra references except our + * one, so we should call proper function which may free atom if + * last reference is released. */ + atom_dec_and_unlock(atom); + + /* lock jnode again for getting valid node->atom pointer + * value. */ + LOCK_JNODE(node); + } + + return atom; +} + +/* Returns true if @node is dirty and part of the same atom as one of its neighbors. Used + by flush code to indicate whether the next node (in some direction) is suitable for + flushing. */ +reiser4_internal int +same_slum_check(jnode * node, jnode * check, int alloc_check, int alloc_value) +{ + int compat; + txn_atom *atom; + + assert("umka-182", node != NULL); + assert("umka-183", check != NULL); + + /* Not sure what this function is supposed to do if supplied with @check that is + neither formatted nor unformatted (bitmap or so). */ + assert("nikita-2373", jnode_is_znode(check) || jnode_is_unformatted(check)); + + /* Need a lock on CHECK to get its atom and to check various state bits. + Don't need a lock on NODE once we get the atom lock. */ + /* It is not enough to lock two nodes and check (node->atom == + check->atom) because atom could be locked and being fused at that + moment, jnodes of the atom of that state (being fused) can point to + different objects, but the atom is the same.*/ + LOCK_JNODE(check); + + atom = jnode_get_atom(check); + + if (atom == NULL) { + compat = 0; + } else { + compat = (node->atom == atom && jnode_is_dirty(check)); + + if (compat && jnode_is_znode(check)) { + compat &= znode_is_connected(JZNODE(check)); + } + + if (compat && alloc_check) { + compat &= (alloc_value == jnode_is_flushprepped(check)); + } + + UNLOCK_ATOM(atom); + } + + UNLOCK_JNODE(check); + + return compat; +} + +/* Decrement the atom's reference count and if it falls to zero, free it. */ +reiser4_internal void +atom_dec_and_unlock(txn_atom * atom) +{ + txn_mgr *mgr = &get_super_private(reiser4_get_current_sb())->tmgr; + + assert("umka-186", atom != NULL); + assert("jmacd-1071", spin_atom_is_locked(atom)); + assert("zam-1039", atomic_read(&atom->refcount) > 0); + + if (atomic_dec_and_test(&atom->refcount)) { + /* take txnmgr lock and atom lock in proper order. */ + if (!spin_trylock_txnmgr(mgr)) { + /* This atom should exist after we re-acquire its + * spinlock, so we increment its reference counter. */ + atomic_inc(&atom->refcount); + UNLOCK_ATOM(atom); + spin_lock_txnmgr(mgr); + LOCK_ATOM(atom); + + if (!atomic_dec_and_test(&atom->refcount)) { + UNLOCK_ATOM(atom); + spin_unlock_txnmgr(mgr); + return; + } + } + assert("nikita-2656", spin_txnmgr_is_locked(mgr)); + atom_free(atom); + spin_unlock_txnmgr(mgr); + } else + UNLOCK_ATOM(atom); +} + +/* Return a new atom, locked. This adds the atom to the transaction manager's list and + sets its reference count to 1, an artificial reference which is kept until it + commits. We play strange games to avoid allocation under jnode & txnh spinlocks.*/ + +/* ZAM-FIXME-HANS: should we set node->atom and txnh->atom here also? */ +/* ANSWER(ZAM): there are special functions, capture_assign_txnh_nolock() and + capture_assign_block_nolock(), they are called right after calling + atom_begin_and_lock(). It could be done here, but, for understandability, it + is better to keep those calls inside try_capture_block main routine where all + assignments are made. */ +static txn_atom * +atom_begin_andlock(txn_atom ** atom_alloc, jnode * node, txn_handle * txnh) +{ + txn_atom *atom; + txn_mgr *mgr; + + assert("jmacd-43228", spin_jnode_is_locked(node)); + assert("jmacd-43227", spin_txnh_is_locked(txnh)); + assert("jmacd-43226", node->atom == NULL); + assert("jmacd-43225", txnh->atom == NULL); + + if (REISER4_DEBUG && rofs_jnode(node)) { + warning("nikita-3366", "Creating atom on rofs"); + dump_stack(); + } + + /* A memory allocation may schedule we have to release those spinlocks + * before kmem_cache_alloc() call. */ + UNLOCK_JNODE(node); + UNLOCK_TXNH(txnh); + + if (*atom_alloc == NULL) { + (*atom_alloc) = kmem_cache_alloc(_atom_slab, GFP_KERNEL); + + if (*atom_alloc == NULL) + return ERR_PTR(RETERR(-ENOMEM)); + } + + /* and, also, txnmgr spin lock should be taken before jnode and txnh + locks. */ + mgr = &get_super_private(reiser4_get_current_sb())->tmgr; + spin_lock_txnmgr(mgr); + + LOCK_JNODE(node); + LOCK_TXNH(txnh); + + /* Check if both atom pointers are still NULL... */ + if (node->atom != NULL || txnh->atom != NULL) { + /* NOTE-NIKITA probably it is rather better to free + * atom_alloc here than thread it up to try_capture(). */ + + UNLOCK_TXNH(txnh); + UNLOCK_JNODE(node); + spin_unlock_txnmgr(mgr); + + return ERR_PTR(-E_REPEAT); + } + + atom = *atom_alloc; + *atom_alloc = NULL; + + atom_init(atom); + + assert("jmacd-17", atom_isclean(atom)); + + /* Take the atom and txnmgr lock. No checks for lock ordering, because + @atom is new and inaccessible for others. */ + spin_lock_atom_no_ord(atom); + + atom_list_push_back(&mgr->atoms_list, atom); + atom->atom_id = mgr->id_count++; + mgr->atom_count += 1; + + /* Release txnmgr lock */ + spin_unlock_txnmgr(mgr); + + /* One reference until it commits. */ + atomic_inc(&atom->refcount); + + atom->stage = ASTAGE_CAPTURE_FUSE; + + return atom; +} + +#if REISER4_DEBUG +/* Return true if an atom is currently "open". */ +static int atom_isopen(const txn_atom * atom) +{ + assert("umka-185", atom != NULL); + + return atom->stage > 0 && atom->stage < ASTAGE_PRE_COMMIT; +} +#endif + +/* Return the number of pointers to this atom that must be updated during fusion. This + approximates the amount of work to be done. Fusion chooses the atom with fewer + pointers to fuse into the atom with more pointers. */ +static int +atom_pointer_count(const txn_atom * atom) +{ + assert("umka-187", atom != NULL); + + /* This is a measure of the amount of work needed to fuse this atom + * into another. */ + return atom->txnh_count + atom->capture_count; +} + +/* Called holding the atom lock, this removes the atom from the transaction manager list + and frees it. */ +static void +atom_free(txn_atom * atom) +{ + txn_mgr *mgr = &get_super_private(reiser4_get_current_sb())->tmgr; + + assert("umka-188", atom != NULL); + assert("jmacd-18", spin_atom_is_locked(atom)); + + /* Remove from the txn_mgr's atom list */ + assert("nikita-2657", spin_txnmgr_is_locked(mgr)); + mgr->atom_count -= 1; + atom_list_remove_clean(atom); + + /* Clean the atom */ + assert("jmacd-16", (atom->stage == ASTAGE_INVALID || atom->stage == ASTAGE_DONE)); + atom->stage = ASTAGE_FREE; + + blocknr_set_destroy(&atom->delete_set); + blocknr_set_destroy(&atom->wandered_map); + + assert("jmacd-16", atom_isclean(atom)); + + UNLOCK_ATOM(atom); + + kmem_cache_free(_atom_slab, atom); +} + +static int +atom_is_dotard(const txn_atom * atom) +{ + return time_after(jiffies, atom->start_time + + get_current_super_private()->tmgr.atom_max_age); +} + +static int atom_can_be_committed (txn_atom * atom) +{ + assert ("zam-884", spin_atom_is_locked(atom)); + assert ("zam-885", atom->txnh_count > atom->nr_waiters); + return atom->txnh_count == atom->nr_waiters + 1; +} + +/* Return true if an atom should commit now. This is determined by aging, atom + size or atom flags. */ +static int +atom_should_commit(const txn_atom * atom) +{ + assert("umka-189", atom != NULL); + return + (atom->flags & ATOM_FORCE_COMMIT) || + ((unsigned) atom_pointer_count(atom) > get_current_super_private()->tmgr.atom_max_size) || + atom_is_dotard(atom); +} + +/* return 1 if current atom exists and requires commit. */ +reiser4_internal int current_atom_should_commit(void) +{ + txn_atom * atom; + int result = 0; + + atom = get_current_atom_locked_nocheck(); + if (atom) { + result = atom_should_commit(atom); + UNLOCK_ATOM(atom); + } + return result; +} + +static int +atom_should_commit_asap(const txn_atom * atom) +{ + unsigned int captured; + unsigned int pinnedpages; + + assert("nikita-3309", atom != NULL); + + captured = (unsigned) atom->capture_count; + pinnedpages = (captured >> PAGE_CACHE_SHIFT) * sizeof(znode); + + return + (pinnedpages > (totalram_pages >> 3)) || + (atom->flushed > 100); +} + +static jnode * find_first_dirty_in_list (capture_list_head * head, int flags) +{ + jnode * first_dirty; + + for_all_type_safe_list(capture, head, first_dirty) { + if (!(flags & JNODE_FLUSH_COMMIT)) { + if ( + /* skip jnodes which have "heard banshee" */ + JF_ISSET(first_dirty, JNODE_HEARD_BANSHEE) || + /* and with active I/O */ + JF_ISSET(first_dirty, JNODE_WRITEBACK)) + continue; + } + return first_dirty; + } + return NULL; +} + +/* Get first dirty node from the atom's dirty_nodes[n] lists; return NULL if atom has no dirty + nodes on atom's lists */ +reiser4_internal jnode * find_first_dirty_jnode (txn_atom * atom, int flags) +{ + jnode *first_dirty; + tree_level level; + + assert("zam-753", spin_atom_is_locked(atom)); + + /* The flush starts from LEAF_LEVEL (=1). */ + for (level = 1; level < REAL_MAX_ZTREE_HEIGHT + 1; level += 1) { + if (capture_list_empty(ATOM_DIRTY_LIST(atom, level))) + continue; + + first_dirty = find_first_dirty_in_list(ATOM_DIRTY_LIST(atom, level), flags); + if (first_dirty) + return first_dirty; + } + + /* znode-above-root is on the list #0. */ + return find_first_dirty_in_list(ATOM_DIRTY_LIST(atom, 0), flags); +} + +#if REISER4_COPY_ON_CAPTURE + +/* this spin lock is used to prevent races during steal on capture. + FIXME: should be per filesystem or even per atom */ +spinlock_t scan_lock = SPIN_LOCK_UNLOCKED; + +/* Scan atom->writeback_nodes list and dispatch jnodes according to their state: + * move dirty and !writeback jnodes to @fq, clean jnodes to atom's clean + * list. */ +/* NOTE: doing that in end IO handler requires using of special spinlocks which + * disables interrupts in all places except IO handler. That is expensive. */ +static void dispatch_wb_list (txn_atom * atom, flush_queue_t * fq) +{ + jnode * cur; + int total, moved; + + assert("zam-905", spin_atom_is_locked(atom)); + + total = 0; + moved = 0; + + spin_lock(&scan_lock); + cur = capture_list_front(ATOM_WB_LIST(atom)); + while (!capture_list_end(ATOM_WB_LIST(atom), cur)) { + jnode * next; + + total ++; + JF_SET(cur, JNODE_SCANNED); + next = capture_list_next(cur); + if (!capture_list_end(ATOM_WB_LIST(atom), next)) + JF_SET(next, JNODE_SCANNED); + + spin_unlock(&scan_lock); + + LOCK_JNODE(cur); + assert("vs-1441", NODE_LIST(cur) == WB_LIST); + if (!JF_ISSET(cur, JNODE_WRITEBACK)) { + moved ++; + if (JF_ISSET(cur, JNODE_DIRTY)) { + queue_jnode(fq, cur); + } else { + /* move from writeback list to clean list */ + capture_list_remove(cur); + capture_list_push_back(ATOM_CLEAN_LIST(atom), cur); + ON_DEBUG(count_jnode(atom, cur, WB_LIST, CLEAN_LIST, 1)); + } + } + UNLOCK_JNODE(cur); + + spin_lock(&scan_lock); + JF_CLR(cur, JNODE_SCANNED); + cur = next; + assert("vs-1450", ergo(!capture_list_end(ATOM_WB_LIST(atom), cur), + JF_ISSET(cur, JNODE_SCANNED) && NODE_LIST(cur) == WB_LIST)); + } + spin_unlock(&scan_lock); +} + +#else + +static void dispatch_wb_list (txn_atom * atom, flush_queue_t * fq) +{ + jnode * cur; + + assert("zam-905", atom_is_protected(atom)); + + cur = capture_list_front(ATOM_WB_LIST(atom)); + while (!capture_list_end(ATOM_WB_LIST(atom), cur)) { + jnode * next = capture_list_next(cur); + + LOCK_JNODE(cur); + if (!JF_ISSET(cur, JNODE_WRITEBACK)) { + if (JF_ISSET(cur, JNODE_DIRTY)) { + queue_jnode(fq, cur); + } else { + capture_list_remove(cur); + capture_list_push_back(ATOM_CLEAN_LIST(atom), cur); + } + } + UNLOCK_JNODE(cur); + + cur = next; + } +} + +#endif + +/* Scan current atom->writeback_nodes list, re-submit dirty and !writeback + * jnodes to disk. */ +static int submit_wb_list (void) +{ + int ret; + flush_queue_t * fq; + + fq = get_fq_for_current_atom(); + if (IS_ERR(fq)) + return PTR_ERR(fq); + + dispatch_wb_list(fq->atom, fq); + UNLOCK_ATOM(fq->atom); + + ret = write_fq(fq, NULL, 1); + fq_put(fq); + + return ret; +} + +/* Wait completion of all writes, re-submit atom writeback list if needed. */ +static int current_atom_complete_writes (void) +{ + int ret; + + /* Each jnode from that list was modified and dirtied when it had i/o + * request running already. After i/o completion we have to resubmit + * them to disk again.*/ + ret = submit_wb_list(); + if (ret < 0) + return ret; + + /* Wait all i/o completion */ + ret = current_atom_finish_all_fq(); + if (ret) + return ret; + + /* Scan wb list again; all i/o should be completed, we re-submit dirty + * nodes to disk */ + ret = submit_wb_list(); + if(ret < 0) + return ret; + + /* Wait all nodes we just submitted */ + return current_atom_finish_all_fq(); +} + +#define TOOMANYFLUSHES (1 << 13) + +/* Called with the atom locked and no open "active" transaction handlers except + ours, this function calls flush_current_atom() until all dirty nodes are + processed. Then it initiates commit processing. + + Called by the single remaining open "active" txnh, which is closing. Other + open txnhs belong to processes which wait atom commit in commit_txnh() + routine. They are counted as "waiters" in atom->nr_waiters. Therefore as + long as we hold the atom lock none of the jnodes can be captured and/or + locked. + + Return value is an error code if commit fails. +*/ +static int commit_current_atom (long *nr_submitted, txn_atom ** atom) +{ + reiser4_super_info_data * sbinfo = get_current_super_private (); + long ret; + /* how many times jnode_flush() was called as a part of attempt to + * commit this atom. */ + int flushiters; + + assert ("zam-888", atom != NULL && *atom != NULL); + assert ("zam-886", spin_atom_is_locked(*atom)); + assert ("zam-887", get_current_context()->trans->atom == *atom); + assert("jmacd-151", atom_isopen(*atom)); + + /* lock ordering: delete_sema and commit_sema are unordered */ + assert("nikita-3184", + get_current_super_private()->delete_sema_owner != current); + + for (flushiters = 0 ;; ++ flushiters) { + ret = flush_current_atom(JNODE_FLUSH_WRITE_BLOCKS | JNODE_FLUSH_COMMIT, nr_submitted, atom); + if (ret != -E_REPEAT) + break; + + /* if atom's dirty list contains one znode which is + HEARD_BANSHEE and is locked we have to allow lock owner to + continue and uncapture that znode */ + preempt_point(); + + *atom = get_current_atom_locked(); + if (flushiters > TOOMANYFLUSHES && IS_POW(flushiters)) { + warning("nikita-3176", + "Flushing like mad: %i", flushiters); + info_atom("atom", *atom); + DEBUGON(flushiters > (1 << 20)); + } + } + + if (ret) + return ret; + + assert ("zam-882", spin_atom_is_locked(*atom)); + + if (!atom_can_be_committed(*atom)) { + UNLOCK_ATOM(*atom); + return RETERR(-E_REPEAT); + } + + /* Up to this point we have been flushing and after flush is called we + return -E_REPEAT. Now we can commit. We cannot return -E_REPEAT + at this point, commit should be successful. */ + atom_set_stage(*atom, ASTAGE_PRE_COMMIT); + ON_DEBUG(((*atom)->committer = current)); + + UNLOCK_ATOM(*atom); + + ret = current_atom_complete_writes(); + if (ret) + return ret; + + assert ("zam-906", capture_list_empty(ATOM_WB_LIST(*atom))); + + /* isolate critical code path which should be executed by only one + * thread using tmgr semaphore */ + down(&sbinfo->tmgr.commit_semaphore); + + ret = reiser4_write_logs(nr_submitted); + if (ret < 0) + reiser4_panic("zam-597", "write log failed (%ld)\n", ret); + + /* The atom->ovrwr_nodes list is processed under commit semaphore held + because of bitmap nodes which are captured by special way in + bitmap_pre_commit_hook(), that way does not include + capture_fuse_wait() as a capturing of other nodes does -- the commit + semaphore is used for transaction isolation instead. */ + invalidate_list(ATOM_OVRWR_LIST(*atom)); + up(&sbinfo->tmgr.commit_semaphore); + + invalidate_list(ATOM_CLEAN_LIST(*atom)); + invalidate_list(ATOM_WB_LIST(*atom)); + assert("zam-927", capture_list_empty(&(*atom)->inodes)); + + LOCK_ATOM(*atom); + atom_set_stage(*atom, ASTAGE_DONE); + ON_DEBUG((*atom)->committer = 0); + + /* Atom's state changes, so wake up everybody waiting for this + event. */ + wakeup_atom_waiting_list(*atom); + + /* Decrement the "until commit" reference, at least one txnh (the caller) is + still open. */ + atomic_dec(&(*atom)->refcount); + + assert("jmacd-1070", atomic_read(&(*atom)->refcount) > 0); + assert("jmacd-1062", (*atom)->capture_count == 0); + BUG_ON((*atom)->capture_count != 0); + assert("jmacd-1071", spin_atom_is_locked(*atom)); + + return ret; +} + +/* TXN_TXNH */ + +/* commit current atom and wait commit completion; atom and txn_handle should be + * locked before call, this function unlocks them on exit. */ +static int force_commit_atom_nolock (txn_handle * txnh) +{ + txn_atom * atom; + + assert ("zam-837", txnh != NULL); + assert ("zam-835", spin_txnh_is_locked(txnh)); + assert ("nikita-2966", lock_stack_isclean(get_current_lock_stack())); + + atom = txnh->atom; + + assert ("zam-834", atom != NULL); + assert ("zam-836", spin_atom_is_locked(atom)); + + /* Set flags for atom and txnh: forcing atom commit and waiting for + * commit completion */ + txnh->flags |= TXNH_WAIT_COMMIT; + atom->flags |= ATOM_FORCE_COMMIT; + + UNLOCK_TXNH(txnh); + UNLOCK_ATOM(atom); + + txn_restart_current(); + return 0; +} + +/* Called to force commit of any outstanding atoms. @commit_all_atoms controls + * should we commit all atoms including new ones which are created after this + * functions is called. */ +reiser4_internal int +txnmgr_force_commit_all (struct super_block *super, int commit_all_atoms) +{ + int ret; + txn_atom *atom; + txn_mgr *mgr; + txn_handle *txnh; + unsigned long start_time = jiffies; + reiser4_context * ctx = get_current_context(); + + assert("nikita-2965", lock_stack_isclean(get_current_lock_stack())); + assert("nikita-3058", commit_check_locks()); + + txn_restart(ctx); + + mgr = &get_super_private(super)->tmgr; + + txnh = ctx->trans; + +again: + + spin_lock_txnmgr(mgr); + + for_all_type_safe_list(atom, &mgr->atoms_list, atom) { + LOCK_ATOM(atom); + + /* Commit any atom which can be committed. If @commit_new_atoms + * is not set we commit only atoms which were created before + * this call is started. */ + if (commit_all_atoms || time_before_eq(atom->start_time, start_time)) { + if (atom->stage <= ASTAGE_POST_COMMIT) { + spin_unlock_txnmgr(mgr); + + if (atom->stage < ASTAGE_PRE_COMMIT) { + LOCK_TXNH(txnh); + /* Add force-context txnh */ + capture_assign_txnh_nolock(atom, txnh); + ret = force_commit_atom_nolock(txnh); + if(ret) + return ret; + } else + /* wait atom commit */ + atom_wait_event(atom); + + goto again; + } + } + + UNLOCK_ATOM(atom); + } + +#if REISER4_DEBUG + if (commit_all_atoms) { + reiser4_super_info_data * sbinfo = get_super_private(super); + reiser4_spin_lock_sb(sbinfo); + assert("zam-813", sbinfo->blocks_fake_allocated_unformatted == 0); + assert("zam-812", sbinfo->blocks_fake_allocated == 0); + reiser4_spin_unlock_sb(sbinfo); + } +#endif + + spin_unlock_txnmgr(mgr); + + return 0; +} + +/* check whether commit_some_atoms() can commit @atom. Locking is up to the + * caller */ +static int atom_is_committable(txn_atom *atom) +{ + return + atom->stage < ASTAGE_PRE_COMMIT && + atom->txnh_count == atom->nr_waiters && + atom_should_commit(atom); +} + +/* called periodically from ktxnmgrd to commit old atoms. Releases ktxnmgrd spin + * lock at exit */ +reiser4_internal int +commit_some_atoms(txn_mgr * mgr) +{ + int ret = 0; + txn_atom *atom; + txn_atom *next_atom; + txn_handle *txnh; + reiser4_context *ctx; + + ctx = get_current_context(); + assert("nikita-2444", ctx != NULL); + + txnh = ctx->trans; + spin_lock_txnmgr(mgr); + + /* look for atom to commit */ + for_all_type_safe_list_safe(atom, &mgr->atoms_list, atom, next_atom) { + /* first test without taking atom spin lock, whether it is + * eligible for committing at all */ + if (atom_is_committable(atom)) { + /* now, take spin lock and re-check */ + LOCK_ATOM(atom); + if (atom_is_committable(atom)) + break; + UNLOCK_ATOM(atom); + } + } + + ret = atom_list_end(&mgr->atoms_list, atom); + spin_unlock_txnmgr(mgr); + + if (ret) { + /* nothing found */ + spin_unlock(&mgr->daemon->guard); + return 0; + } + + LOCK_TXNH(txnh); + + /* Set the atom to force committing */ + atom->flags |= ATOM_FORCE_COMMIT; + + /* Add force-context txnh */ + capture_assign_txnh_nolock(atom, txnh); + + UNLOCK_TXNH(txnh); + UNLOCK_ATOM(atom); + + /* we are about to release daemon spin lock, notify daemon it + has to rescan atoms */ + mgr->daemon->rescan = 1; + spin_unlock(&mgr->daemon->guard); + txn_restart(ctx); + return 0; +} + +/* Calls jnode_flush for current atom if it exists; if not, just take another + atom and call jnode_flush() for him. If current transaction handle has + already assigned atom (current atom) we have to close current transaction + prior to switch to another atom or do something with current atom. This + code tries to flush current atom. + + flush_some_atom() is called as part of memory clearing process. It is + invoked from balance_dirty_pages(), pdflushd, and entd. + + If we can flush no nodes, atom is committed, because this frees memory. + + If atom is too large or too old it is committed also. +*/ +reiser4_internal int +flush_some_atom(long *nr_submitted, const struct writeback_control *wbc, int flags) +{ + reiser4_context *ctx = get_current_context(); + txn_handle *txnh = ctx->trans; + txn_atom *atom; + int ret; + int ret1; + + assert("zam-1042", txnh != NULL); + repeat: + if (txnh->atom == NULL) { + /* current atom is available, take first from txnmgr */ + txn_mgr *tmgr = &get_super_private(ctx->super)->tmgr; + + spin_lock_txnmgr(tmgr); + + /* traverse the list of all atoms */ + for_all_type_safe_list(atom, &tmgr->atoms_list, atom) { + /* lock atom before checking its state */ + LOCK_ATOM(atom); + + /* we need an atom which is not being committed and which has no + * flushers (jnode_flush() add one flusher at the beginning and + * subtract one at the end). */ + if (atom->stage < ASTAGE_PRE_COMMIT && atom->nr_flushers == 0) { + LOCK_TXNH(txnh); + capture_assign_txnh_nolock(atom, txnh); + UNLOCK_TXNH(txnh); + + goto found; + } + + UNLOCK_ATOM(atom); + } + + /* Write throttling is case of no one atom can be + * flushed/committed. */ + if (!current_is_pdflush() && !wbc->nonblocking) { + for_all_type_safe_list(atom, &tmgr->atoms_list, atom) { + LOCK_ATOM(atom); + /* Repeat the check from the above. */ + if (atom->stage < ASTAGE_PRE_COMMIT && atom->nr_flushers == 0) { + LOCK_TXNH(txnh); + capture_assign_txnh_nolock(atom, txnh); + UNLOCK_TXNH(txnh); + + goto found; + } + if (atom->stage <= ASTAGE_POST_COMMIT) { + spin_unlock_txnmgr(tmgr); + /* we just wait until atom's flusher + makes a progress in flushing or + committing the atom */ + atom_wait_event(atom); + goto repeat; + } + UNLOCK_ATOM(atom); + } + } + spin_unlock_txnmgr(tmgr); + return 0; + found: + spin_unlock_txnmgr(tmgr); + } else + atom = get_current_atom_locked(); + + ret = flush_current_atom(flags, nr_submitted, &atom); + if (ret == 0) { + if (*nr_submitted == 0 || atom_should_commit_asap(atom)) { + /* if early flushing could not make more nodes clean, + * or atom is too old/large, + * we force current atom to commit */ + /* wait for commit completion but only if this + * wouldn't stall pdflushd and ent thread. */ + if (!wbc->nonblocking && !ctx->entd) + txnh->flags |= TXNH_WAIT_COMMIT; + atom->flags |= ATOM_FORCE_COMMIT; + } + UNLOCK_ATOM(atom); + } else if (ret == -E_REPEAT) { + if (*nr_submitted == 0) + goto repeat; + ret = 0; + } + + ret1 = txn_end(ctx); + assert("vs-1692", ret1 == 0); + if (ret1 > 0) + *nr_submitted += ret1; + txn_begin(ctx); + + return ret; +} + +#if REISER4_COPY_ON_CAPTURE + +/* Remove processed nodes from atom's clean list (thereby remove them from transaction). */ +void +invalidate_list(capture_list_head * head) +{ + txn_atom *atom; + + spin_lock(&scan_lock); + while (!capture_list_empty(head)) { + jnode *node; + + node = capture_list_front(head); + JF_SET(node, JNODE_SCANNED); + spin_unlock(&scan_lock); + + atom = node->atom; + LOCK_ATOM(atom); + LOCK_JNODE(node); + if (JF_ISSET(node, JNODE_CC) && node->pg) + page_cache_release(node->pg); + uncapture_block(node); + UNLOCK_ATOM(atom); + JF_CLR(node, JNODE_SCANNED); + jput(node); + + spin_lock(&scan_lock); + } + spin_unlock(&scan_lock); +} + +#else + +/* Remove processed nodes from atom's clean list (thereby remove them from transaction). */ +void +invalidate_list(capture_list_head * head) +{ + while (!capture_list_empty(head)) { + jnode *node; + + node = capture_list_front(head); + LOCK_JNODE(node); + uncapture_block(node); + jput(node); + } +} + +#endif + +static void +init_wlinks(txn_wait_links * wlinks) +{ + wlinks->_lock_stack = get_current_lock_stack(); + fwaitfor_list_clean(wlinks); + fwaiting_list_clean(wlinks); + wlinks->waitfor_cb = NULL; + wlinks->waiting_cb = NULL; +} + +/* Add atom to the atom's waitfor list and wait for somebody to wake us up; */ +reiser4_internal void atom_wait_event(txn_atom * atom) +{ + txn_wait_links _wlinks; + + assert("zam-744", spin_atom_is_locked(atom)); + assert("nikita-3156", + lock_stack_isclean(get_current_lock_stack()) || + atom->nr_running_queues > 0); + + init_wlinks(&_wlinks); + fwaitfor_list_push_back(&atom->fwaitfor_list, &_wlinks); + atomic_inc(&atom->refcount); + UNLOCK_ATOM(atom); + + prepare_to_sleep(_wlinks._lock_stack); + go_to_sleep(_wlinks._lock_stack); + + LOCK_ATOM (atom); + fwaitfor_list_remove(&_wlinks); + atom_dec_and_unlock (atom); +} + +reiser4_internal void +atom_set_stage(txn_atom *atom, txn_stage stage) +{ + assert("nikita-3535", atom != NULL); + assert("nikita-3538", spin_atom_is_locked(atom)); + assert("nikita-3536", ASTAGE_FREE <= stage && stage <= ASTAGE_INVALID); + /* Excelsior! */ + assert("nikita-3537", stage >= atom->stage); + if (atom->stage != stage) { + atom->stage = stage; + atom_send_event(atom); + } +} + +/* wake all threads which wait for an event */ +reiser4_internal void +atom_send_event(txn_atom * atom) +{ + assert("zam-745", spin_atom_is_locked(atom)); + wakeup_atom_waitfor_list(atom); +} + +/* Informs txn manager code that owner of this txn_handle should wait atom commit completion (for + example, because it does fsync(2)) */ +static int +should_wait_commit(txn_handle * h) +{ + return h->flags & TXNH_WAIT_COMMIT; +} + +typedef struct commit_data { + txn_atom *atom; + txn_handle *txnh; + long nr_written; + /* as an optimization we start committing atom by first trying to + * flush it few times without switching into ASTAGE_CAPTURE_WAIT. This + * allows to reduce stalls due to other threads waiting for atom in + * ASTAGE_CAPTURE_WAIT stage. ->preflush is counter of these + * preliminary flushes. */ + int preflush; + /* have we waited on atom. */ + int wait; + int failed; + int wake_ktxnmgrd_up; +} commit_data; + +/* + * Called from commit_txnh() repeatedly, until either error happens, or atom + * commits successfully. + */ +static int +try_commit_txnh(commit_data *cd) +{ + int result; + + assert("nikita-2968", lock_stack_isclean(get_current_lock_stack())); + + /* Get the atom and txnh locked. */ + cd->atom = txnh_get_atom(cd->txnh); + assert("jmacd-309", cd->atom != NULL); + UNLOCK_TXNH(cd->txnh); + + if (cd->wait) { + cd->atom->nr_waiters --; + cd->wait = 0; + } + + if (cd->atom->stage == ASTAGE_DONE) + return 0; + + if (cd->failed) + return 0; + + if (atom_should_commit(cd->atom)) { + /* if atom is _very_ large schedule it for common as soon as + * possible. */ + if (atom_should_commit_asap(cd->atom)) { + /* + * When atom is in PRE_COMMIT or later stage following + * invariant (encoded in atom_can_be_committed()) + * holds: there is exactly one non-waiter transaction + * handle opened on this atom. When thread wants to + * wait until atom commits (for example sync()) it + * waits on atom event after increasing + * atom->nr_waiters (see blow in this function). It + * cannot be guaranteed that atom is already committed + * after receiving event, so loop has to be + * re-started. But if atom switched into PRE_COMMIT + * stage and became too large, we cannot change its + * state back to CAPTURE_WAIT (atom stage can only + * increase monotonically), hence this check. + */ + if (cd->atom->stage < ASTAGE_CAPTURE_WAIT) + atom_set_stage(cd->atom, ASTAGE_CAPTURE_WAIT); + cd->atom->flags |= ATOM_FORCE_COMMIT; + } + if (cd->txnh->flags & TXNH_DONT_COMMIT) { + /* + * this thread (transaction handle that is) doesn't + * want to commit atom. Notify waiters that handle is + * closed. This can happen, for example, when we are + * under VFS directory lock and don't want to commit + * atom right now to avoid stalling other threads + * working in the same directory. + */ + + /* Wake the ktxnmgrd up if the ktxnmgrd is needed to + * commit this atom: no atom waiters and only one + * (our) open transaction handle. */ + cd->wake_ktxnmgrd_up = + cd->atom->txnh_count == 1 && + cd->atom->nr_waiters == 0; + atom_send_event(cd->atom); + result = 0; + } else if (!atom_can_be_committed(cd->atom)) { + if (should_wait_commit(cd->txnh)) { + /* sync(): wait for commit */ + cd->atom->nr_waiters++; + cd->wait = 1; + atom_wait_event(cd->atom); + result = RETERR(-E_REPEAT); + } else { + result = 0; + } + } else if (cd->preflush > 0 && !is_current_ktxnmgrd()) { + /* + * optimization: flush atom without switching it into + * ASTAGE_CAPTURE_WAIT. + * + * But don't do this for ktxnmgrd, because ktxnmgrd + * should never block on atom fusion. + */ + result = flush_current_atom(JNODE_FLUSH_WRITE_BLOCKS, + &cd->nr_written, &cd->atom); + if (result == 0) { + UNLOCK_ATOM(cd->atom); + cd->preflush = 0; + result = RETERR(-E_REPEAT); + } else /* Atoms wasn't flushed + * completely. Rinse. Repeat. */ + -- cd->preflush; + } else { + /* We change atom state to ASTAGE_CAPTURE_WAIT to + prevent atom fusion and count ourself as an active + flusher */ + atom_set_stage(cd->atom, ASTAGE_CAPTURE_WAIT); + cd->atom->flags |= ATOM_FORCE_COMMIT; + + result = commit_current_atom(&cd->nr_written, &cd->atom); + if (result != 0 && result != -E_REPEAT) + cd->failed = 1; + } + } else + result = 0; + + assert("jmacd-1027", ergo(result == 0, spin_atom_is_locked(cd->atom))); + /* perfectly valid assertion, except that when atom/txnh is not locked + * fusion can take place, and cd->atom points nowhere. */ + /* + assert("jmacd-1028", ergo(result != 0, spin_atom_is_not_locked(cd->atom))); + */ + return result; +} + +/* Called to commit a transaction handle. This decrements the atom's number of open + handles and if it is the last handle to commit and the atom should commit, initiates + atom commit. if commit does not fail, return number of written blocks */ +static long +commit_txnh(txn_handle * txnh) +{ + commit_data cd; + assert("umka-192", txnh != NULL); + + memset(&cd, 0, sizeof cd); + cd.txnh = txnh; + cd.preflush = 10; + + /* calls try_commit_txnh() until either atom commits, or error + * happens */ + while (try_commit_txnh(&cd) != 0) + preempt_point(); + + assert("nikita-3171", spin_txnh_is_not_locked(txnh)); + LOCK_TXNH(txnh); + + cd.atom->txnh_count -= 1; + txnh->atom = NULL; + + txnh_list_remove(txnh); + + UNLOCK_TXNH(txnh); + atom_dec_and_unlock(cd.atom); + /* if we don't want to do a commit (TXNH_DONT_COMMIT is set, probably + * because it takes time) by current thread, we do that work + * asynchronously by ktxnmgrd daemon. */ + if (cd.wake_ktxnmgrd_up) + ktxnmgrd_kick(&get_current_super_private()->tmgr); + + return 0; +} + +/* TRY_CAPTURE */ + +/* This routine attempts a single block-capture request. It may return -E_REPEAT if some + condition indicates that the request should be retried, and it may block if the + txn_capture mode does not include the TXN_CAPTURE_NONBLOCKING request flag. + + This routine encodes the basic logic of block capturing described by: + + http://namesys.com/v4/v4.html + + Our goal here is to ensure that any two blocks that contain dependent modifications + should commit at the same time. This function enforces this discipline by initiating + fusion whenever a transaction handle belonging to one atom requests to read or write a + block belonging to another atom (TXN_CAPTURE_WRITE or TXN_CAPTURE_READ_ATOMIC). + + In addition, this routine handles the initial assignment of atoms to blocks and + transaction handles. These are possible outcomes of this function: + + 1. The block and handle are already part of the same atom: return immediate success + + 2. The block is assigned but the handle is not: call capture_assign_txnh to assign + the handle to the block's atom. + + 3. The handle is assigned but the block is not: call capture_assign_block to assign + the block to the handle's atom. + + 4. Both handle and block are assigned, but to different atoms: call capture_init_fusion + to fuse atoms. + + 5. Neither block nor handle are assigned: create a new atom and assign them both. + + 6. A read request for a non-captured block: return immediate success. + + This function acquires and releases the handle's spinlock. This function is called + under the jnode lock and if the return value is 0, it returns with the jnode lock still + held. If the return is -E_REPEAT or some other error condition, the jnode lock is + released. The external interface (try_capture) manages re-aquiring the jnode lock + in the failure case. +*/ + +static int +try_capture_block(txn_handle * txnh, jnode * node, txn_capture mode, txn_atom ** atom_alloc, int can_coc) +{ + int ret; + txn_atom *block_atom; + txn_atom *txnh_atom; + + /* Should not call capture for READ_NONCOM requests, handled in try_capture. */ + assert("jmacd-567", CAPTURE_TYPE(mode) != TXN_CAPTURE_READ_NONCOM); + + /* FIXME-ZAM-HANS: FIXME_LATER_JMACD Should assert that atom->tree == node->tree somewhere. */ + + assert("umka-194", txnh != NULL); + assert("umka-195", node != NULL); + + /* The jnode is already locked! Being called from try_capture(). */ + assert("jmacd-567", spin_jnode_is_locked(node)); + + block_atom = node->atom; + + /* Get txnh spinlock, this allows us to compare txn_atom pointers but it doesn't + let us touch the atoms themselves. */ + LOCK_TXNH(txnh); + + txnh_atom = txnh->atom; + + if (txnh_atom != NULL && block_atom == txnh_atom) { + UNLOCK_TXNH(txnh); + return 0; + } + /* NIKITA-HANS: nothing */ + if (txnh_atom != NULL) { + /* It is time to perform deadlock prevention check over the + node we want to capture. It is possible this node was + locked for read without capturing it. The optimization + which allows to do it helps us in keeping atoms independent + as long as possible but it may cause lock/fuse deadlock + problems. + + A number of similar deadlock situations with locked but not + captured nodes were found. In each situation there are two + or more threads: one of them does flushing while another + one does routine balancing or tree lookup. The flushing + thread (F) sleeps in long term locking request for node + (N), another thread (A) sleeps in trying to capture some + node already belonging the atom F, F has a state which + prevents immediately fusion . + + Deadlocks of this kind cannot happen if node N was properly + captured by thread A. The F thread fuse atoms before + locking therefore current atom of thread F and current atom + of thread A became the same atom and thread A may proceed. + This does not work if node N was not captured because the + fusion of atom does not happens. + + The following scheme solves the deadlock: If + longterm_lock_znode locks and does not capture a znode, + that znode is marked as MISSED_IN_CAPTURE. A node marked + this way is processed by the code below which restores the + missed capture and fuses current atoms of all the node lock + owners by calling the fuse_not_fused_lock_owners() + function. + */ + + if ( // txnh_atom->stage >= ASTAGE_CAPTURE_WAIT && + jnode_is_znode(node) && znode_is_locked(JZNODE(node)) + && JF_ISSET(node, JNODE_MISSED_IN_CAPTURE)) { + JF_CLR(node, JNODE_MISSED_IN_CAPTURE); + + ret = fuse_not_fused_lock_owners(txnh, JZNODE(node)); + + if (ret) { + JF_SET(node, JNODE_MISSED_IN_CAPTURE); + + assert("zam-687", spin_txnh_is_not_locked(txnh)); + assert("zam-688", spin_jnode_is_not_locked(node)); + + return ret; + } + + assert("zam-701", spin_txnh_is_locked(txnh)); + assert("zam-702", spin_jnode_is_locked(node)); + } + } + + if (block_atom != NULL) { + /* The block has already been assigned to an atom. */ + + /* case (block_atom == txnh_atom) is already handled above */ + if (txnh_atom == NULL) { + + /* The txnh is unassigned, try to assign it. */ + ret = capture_assign_txnh(node, txnh, mode, can_coc); + if (ret != 0) { + /* E_REPEAT or otherwise */ + assert("jmacd-6129", spin_txnh_is_not_locked(txnh)); + assert("jmacd-6130", spin_jnode_is_not_locked(node)); + return ret; + } + + /* Either the txnh is now assigned to the block's atom or the read-request was + granted because the block is committing. Locks still held. */ + } else { + if (mode & TXN_CAPTURE_DONT_FUSE) { + UNLOCK_TXNH(txnh); + UNLOCK_JNODE(node); + /* we are in a "no-fusion" mode and @node is + * already part of transaction. */ + return RETERR(-E_NO_NEIGHBOR); + } + /* In this case, both txnh and node belong to different atoms. This function + returns -E_REPEAT on successful fusion, 0 on the fall-through case. */ + ret = capture_init_fusion(node, txnh, mode, can_coc); + if (ret != 0) { + assert("jmacd-6131", spin_txnh_is_not_locked(txnh)); + assert("jmacd-6132", spin_jnode_is_not_locked(node)); + return ret; + } + + /* The fall-through case is read request for committing block. Locks still + held. */ + } + + } else if ((mode & TXN_CAPTURE_WTYPES) != 0) { + + /* In this case, the page is unlocked and the txnh wishes exclusive access. */ + + if (txnh_atom != NULL) { + /* The txnh is already assigned: add the page to its atom. */ + ret = capture_assign_block(txnh, node); + if (ret != 0) { + /* E_REPEAT or otherwise */ + assert("jmacd-6133", spin_txnh_is_not_locked(txnh)); + assert("jmacd-6134", spin_jnode_is_not_locked(node)); + return ret; + } + + /* Success: Locks are still held. */ + + } else { + + /* In this case, neither txnh nor page are assigned to an atom. */ + block_atom = atom_begin_andlock(atom_alloc, node, txnh); + + if (!IS_ERR(block_atom)) { + /* Assign both, release atom lock. */ + assert("jmacd-125", block_atom->stage == ASTAGE_CAPTURE_FUSE); + + capture_assign_txnh_nolock(block_atom, txnh); + capture_assign_block_nolock(block_atom, node); + + UNLOCK_ATOM(block_atom); + } else { + /* all locks are released already */ + return PTR_ERR(block_atom); + } + + /* Success: Locks are still held. */ + } + + } else { + /* The jnode is uncaptured and its a read request -- fine. */ + assert("jmacd-411", CAPTURE_TYPE(mode) == TXN_CAPTURE_READ_ATOMIC); + } + + /* Successful case: both jnode and txnh are still locked. */ + assert("jmacd-740", spin_txnh_is_locked(txnh)); + assert("jmacd-741", spin_jnode_is_locked(node)); + + /* Release txnh lock, return with the jnode still locked. */ + UNLOCK_TXNH(txnh); + + return 0; +} + +static txn_capture +build_capture_mode(jnode * node, znode_lock_mode lock_mode, txn_capture flags) +{ + txn_capture cap_mode; + + assert("nikita-3187", spin_jnode_is_locked(node)); + + /* FIXME_JMACD No way to set TXN_CAPTURE_READ_MODIFY yet. */ + + if (lock_mode == ZNODE_WRITE_LOCK) { + cap_mode = TXN_CAPTURE_WRITE; + } else if (node->atom != NULL) { + cap_mode = TXN_CAPTURE_WRITE; + } else if (0 && /* txnh->mode == TXN_READ_FUSING && */ + jnode_get_level(node) == LEAF_LEVEL) { + /* NOTE-NIKITA TXN_READ_FUSING is not currently used */ + /* We only need a READ_FUSING capture at the leaf level. This + is because the internal levels of the tree (twigs included) + are redundant from the point of the user that asked for a + read-fusing transcrash. The user only wants to read-fuse + atoms due to reading uncommitted data that another user has + written. It is the file system that reads/writes the + internal tree levels, the user only reads/writes leaves. */ + cap_mode = TXN_CAPTURE_READ_ATOMIC; + } else { + /* In this case (read lock at a non-leaf) there's no reason to + * capture. */ + /* cap_mode = TXN_CAPTURE_READ_NONCOM; */ + + /* Mark this node as "MISSED". It helps in further deadlock + * analysis */ + JF_SET(node, JNODE_MISSED_IN_CAPTURE); + return 0; + } + + cap_mode |= (flags & (TXN_CAPTURE_NONBLOCKING | + TXN_CAPTURE_DONT_FUSE)); + assert("nikita-3186", cap_mode != 0); + return cap_mode; +} + +/* This is an external interface to try_capture_block(), it calls + try_capture_block() repeatedly as long as -E_REPEAT is returned. + + @node: node to capture, + @lock_mode: read or write lock is used in capture mode calculation, + @flags: see txn_capture flags enumeration, + @can_coc : can copy-on-capture + + @return: 0 - node was successfully captured, -E_REPEAT - capture request + cannot be processed immediately as it was requested in flags, + < 0 - other errors. +*/ +reiser4_internal int +try_capture(jnode * node, znode_lock_mode lock_mode, + txn_capture flags, int can_coc) +{ + txn_atom *atom_alloc = NULL; + txn_capture cap_mode; + txn_handle * txnh = get_current_context()->trans; +#if REISER4_COPY_ON_CAPTURE + int coc_enabled = 1; +#endif + int ret; + + assert("jmacd-604", spin_jnode_is_locked(node)); + +repeat: + cap_mode = build_capture_mode(node, lock_mode, flags); + if (cap_mode == 0) + return 0; + + /* Repeat try_capture as long as -E_REPEAT is returned. */ +#if REISER4_COPY_ON_CAPTURE + ret = try_capture_block(txnh, node, cap_mode, &atom_alloc, can_coc && coc_enabled); + coc_enabled = 1; +#else + ret = try_capture_block(txnh, node, cap_mode, &atom_alloc, can_coc); +#endif + /* Regardless of non_blocking: + + If ret == 0 then jnode is still locked. + If ret != 0 then jnode is unlocked. + */ + assert("nikita-2674", ergo(ret == 0, spin_jnode_is_locked(node))); + assert("nikita-2675", ergo(ret != 0, spin_jnode_is_not_locked(node))); + + assert("nikita-2974", spin_txnh_is_not_locked(txnh)); + + if (ret == -E_REPEAT) { + /* E_REPEAT implies all locks were released, therefore we need + to take the jnode's lock again. */ + LOCK_JNODE(node); + + /* Although this may appear to be a busy loop, it is not. + There are several conditions that cause E_REPEAT to be + returned by the call to try_capture_block, all cases + indicating some kind of state change that means you should + retry the request and will get a different result. In some + cases this could be avoided with some extra code, but + generally it is done because the necessary locks were + released as a result of the operation and repeating is the + simplest thing to do (less bug potential). The cases are: + atom fusion returns E_REPEAT after it completes (jnode and + txnh were unlocked); race conditions in assign_block, + assign_txnh, and init_fusion return E_REPEAT (trylock + failure); after going to sleep in capture_fuse_wait + (request was blocked but may now succeed). I'm not quite + sure how capture_copy works yet, but it may also return + E_REPEAT. When the request is legitimately blocked, the + requestor goes to sleep in fuse_wait, so this is not a busy + loop. */ + /* NOTE-NIKITA: still don't understand: + + try_capture_block->capture_assign_txnh->spin_trylock_atom->E_REPEAT + + looks like busy loop? + */ + goto repeat; + } + +#if REISER4_COPY_ON_CAPTURE + if (ret == -E_WAIT) { + reiser4_stat_inc(coc.coc_wait); + /* disable COC for the next loop iteration */ + coc_enabled = 0; + LOCK_JNODE(node); + goto repeat; + } +#endif + + /* free extra atom object that was possibly allocated by + try_capture_block(). + + Do this before acquiring jnode spin lock to + minimize time spent under lock. --nikita */ + if (atom_alloc != NULL) { + kmem_cache_free(_atom_slab, atom_alloc); + } + + if (ret != 0) { + if (ret == -E_BLOCK) { + assert("nikita-3360", cap_mode & TXN_CAPTURE_NONBLOCKING); + ret = -E_REPEAT; + } + + /* Failure means jnode is not locked. FIXME_LATER_JMACD May + want to fix the above code to avoid releasing the lock and + re-acquiring it, but there are cases were failure occurs + when the lock is not held, and those cases would need to be + modified to re-take the lock. */ + LOCK_JNODE(node); + } + + /* Jnode is still locked. */ + assert("jmacd-760", spin_jnode_is_locked(node)); + return ret; +} + +/* This function sets up a call to try_capture_block and repeats as long as -E_REPEAT is + returned by that routine. The txn_capture request mode is computed here depending on + the transaction handle's type and the lock request. This is called from the depths of + the lock manager with the jnode lock held and it always returns with the jnode lock + held. +*/ + +/* fuse all 'active' atoms of lock owners of given node. */ +static int +fuse_not_fused_lock_owners(txn_handle * txnh, znode * node) +{ + lock_handle *lh; + int repeat = 0; + txn_atom *atomh = txnh->atom; + +/* assert ("zam-689", znode_is_rlocked (node));*/ + assert("zam-690", spin_znode_is_locked(node)); + assert("zam-691", spin_txnh_is_locked(txnh)); + assert("zam-692", atomh != NULL); + + RLOCK_ZLOCK(&node->lock); + + if (!spin_trylock_atom(atomh)) { + repeat = 1; + goto fail; + } + + /* inspect list of lock owners */ + for_all_type_safe_list(owners, &node->lock.owners, lh) { + reiser4_context *ctx; + txn_atom *atomf; + + ctx = get_context_by_lock_stack(lh->owner); + + if (ctx == get_current_context()) + continue; + + if (!spin_trylock_txnh(ctx->trans)) { + repeat = 1; + continue; + } + + atomf = ctx->trans->atom; + + if (atomf == NULL) { + capture_assign_txnh_nolock(atomh, ctx->trans); + UNLOCK_TXNH(ctx->trans); + + reiser4_wake_up(lh->owner); + continue; + } + + if (atomf == atomh) { + UNLOCK_TXNH(ctx->trans); + continue; + } + + if (!spin_trylock_atom(atomf)) { + UNLOCK_TXNH(ctx->trans); + repeat = 1; + continue; + } + + UNLOCK_TXNH(ctx->trans); + + if (atomf == atomh || atomf->stage > ASTAGE_CAPTURE_WAIT) { + UNLOCK_ATOM(atomf); + continue; + } + // repeat = 1; + + reiser4_wake_up(lh->owner); + + UNLOCK_TXNH(txnh); + RUNLOCK_ZLOCK(&node->lock); + spin_unlock_znode(node); + + /* @atomf is "small" and @atomh is "large", by + definition. Small atom is destroyed and large is unlocked + inside capture_fuse_into() + */ + capture_fuse_into(atomf, atomh); + return RETERR(-E_REPEAT); + } + + UNLOCK_ATOM(atomh); + + if (repeat) { +fail: + UNLOCK_TXNH(txnh); + RUNLOCK_ZLOCK(&node->lock); + spin_unlock_znode(node); + return RETERR(-E_REPEAT); + } + + RUNLOCK_ZLOCK(&node->lock); + return 0; +} + +/* This is the interface to capture unformatted nodes via their struct page + reference. Currently it is only used in reiser4_invalidatepage */ +reiser4_internal int +try_capture_page_to_invalidate(struct page *pg) +{ + int ret; + jnode *node; + + assert("umka-292", pg != NULL); + assert("nikita-2597", PageLocked(pg)); + + if (IS_ERR(node = jnode_of_page(pg))) { + return PTR_ERR(node); + } + + LOCK_JNODE(node); + unlock_page(pg); + + ret = try_capture(node, ZNODE_WRITE_LOCK, 0, 0/* no copy on capture */); + UNLOCK_JNODE(node); + jput(node); + lock_page(pg); + return ret; +} + +/* This informs the transaction manager when a node is deleted. Add the block to the + atom's delete set and uncapture the block. + +VS-FIXME-HANS: this E_REPEAT paradigm clutters the code and creates a need for +explanations. find all the functions that use it, and unless there is some very +good reason to use it (I have not noticed one so far and I doubt it exists, but maybe somewhere somehow....), +move the loop to inside the function. + +VS-FIXME-HANS: can this code be at all streamlined? In particular, can you lock and unlock the jnode fewer times? + */ +reiser4_internal void +uncapture_page(struct page *pg) +{ + jnode *node; + txn_atom *atom; + + assert("umka-199", pg != NULL); + assert("nikita-3155", PageLocked(pg)); + + reiser4_clear_page_dirty(pg); + + reiser4_wait_page_writeback(pg); + + node = (jnode *) (pg->private); + if (node == NULL) + return; + + LOCK_JNODE(node); + + eflush_del(node, 1/* page is locked */); + /*assert ("zam-815", !JF_ISSET(node, JNODE_EFLUSH));*/ + + atom = jnode_get_atom(node); + if (atom == NULL) { + assert("jmacd-7111", !jnode_is_dirty(node)); + UNLOCK_JNODE (node); + return; + } + + /* We can remove jnode from transaction even if it is on flush queue + * prepped list, we only need to be sure that flush queue is not being + * written by write_fq(). write_fq() does not use atom spin lock for + * protection of the prepped nodes list, instead write_fq() increments + * atom's nr_running_queues counters for the time when prepped list is + * not protected by spin lock. Here we check this counter if we want + * to remove jnode from flush queue and, if the counter is not zero, + * wait all write_fq() for this atom to complete. This is not + * significant overhead. */ + while (JF_ISSET(node, JNODE_FLUSH_QUEUED) && atom->nr_running_queues) { + UNLOCK_JNODE(node); + /* + * at this moment we want to wait for "atom event", viz. wait + * until @node can be removed from flush queue. But + * atom_wait_event() cannot be called with page locked, because + * it deadlocks with jnode_extent_write(). Unlock page, after + * making sure (through page_cache_get()) that it cannot be + * released from memory. + */ + page_cache_get(pg); + unlock_page(pg); + atom_wait_event(atom); + lock_page(pg); + /* + * page may has been detached by ->writepage()->releasepage(). + */ + reiser4_wait_page_writeback(pg); + LOCK_JNODE(node); + eflush_del(node, 1); + page_cache_release(pg); + atom = jnode_get_atom(node); +/* VS-FIXME-HANS: improve the commenting in this function */ + if (atom == NULL) { + UNLOCK_JNODE(node); + return; + } + } + uncapture_block(node); + UNLOCK_ATOM(atom); + jput(node); +} + +/* this is used in extent's kill hook to uncapture and unhash jnodes attached to inode's tree of jnodes */ +reiser4_internal void +uncapture_jnode(jnode *node) +{ + txn_atom *atom; + + assert("vs-1462", spin_jnode_is_locked(node)); + assert("", node->pg == 0); + + if (JF_ISSET(node, JNODE_EFLUSH)) { + eflush_free(node); + JF_CLR(node, JNODE_EFLUSH); + } + /*eflush_del(node, 0);*/ + + /*jnode_make_clean(node);*/ + atom = jnode_get_atom(node); + if (atom == NULL) { + assert("jmacd-7111", !jnode_is_dirty(node)); + UNLOCK_JNODE (node); + return; + } + + uncapture_block(node); + UNLOCK_ATOM(atom); + jput(node); +} + +/* No-locking version of assign_txnh. Sets the transaction handle's atom pointer, + increases atom refcount and txnh_count, adds to txnh_list. */ +static void +capture_assign_txnh_nolock(txn_atom * atom, txn_handle * txnh) +{ + assert("umka-200", atom != NULL); + assert("umka-201", txnh != NULL); + + assert("jmacd-822", spin_txnh_is_locked(txnh)); + assert("jmacd-823", spin_atom_is_locked(atom)); + assert("jmacd-824", txnh->atom == NULL); + assert("nikita-3540", atom_isopen(atom)); + + atomic_inc(&atom->refcount); + txnh->atom = atom; + txnh_list_push_back(&atom->txnh_list, txnh); + atom->txnh_count += 1; +} + +/* No-locking version of assign_block. Sets the block's atom pointer, references the + block, adds it to the clean or dirty capture_jnode list, increments capture_count. */ +static void +capture_assign_block_nolock(txn_atom * atom, jnode * node) +{ + assert("umka-202", atom != NULL); + assert("umka-203", node != NULL); + assert("jmacd-321", spin_jnode_is_locked(node)); + assert("umka-295", spin_atom_is_locked(atom)); + assert("jmacd-323", node->atom == NULL); + BUG_ON(!capture_list_is_clean(node)); + assert("nikita-3470", !jnode_is_dirty(node)); + + /* Pointer from jnode to atom is not counted in atom->refcount. */ + node->atom = atom; + + capture_list_push_back(ATOM_CLEAN_LIST(atom), node); + atom->capture_count += 1; + /* reference to jnode is acquired by atom. */ + jref(node); + + ON_DEBUG(count_jnode(atom, node, NOT_CAPTURED, CLEAN_LIST, 1)); + + LOCK_CNT_INC(t_refs); +} + +#if REISER4_COPY_ON_CAPTURE +static void +set_cced_bit(jnode *node) +{ + BUG_ON(JF_ISSET(node, JNODE_CCED)); + JF_SET(node, JNODE_CCED); +} +#endif + +static void +clear_cced_bits(jnode *node) +{ + JF_CLR(node, JNODE_CCED); +} + +int +is_cced(const jnode *node) +{ + return JF_ISSET(node, JNODE_CCED); +} + +/* common code for dirtying both unformatted jnodes and formatted znodes. */ +static void +do_jnode_make_dirty(jnode * node, txn_atom * atom) +{ + assert("zam-748", spin_jnode_is_locked(node)); + assert("zam-750", spin_atom_is_locked(atom)); + assert("jmacd-3981", !jnode_is_dirty(node)); + + JF_SET(node, JNODE_DIRTY); + + get_current_context()->nr_marked_dirty ++; + + /* We grab2flush_reserve one additional block only if node was + not CREATED and jnode_flush did not sort it into neither + relocate set nor overwrite one. If node is in overwrite or + relocate set we assume that atom's flush reserved counter was + already adjusted. */ + if (!JF_ISSET(node, JNODE_CREATED) && !JF_ISSET(node, JNODE_RELOC) + && !JF_ISSET(node, JNODE_OVRWR) && jnode_is_leaf(node) + && !jnode_is_cluster_page(node)) { + assert("vs-1093", !blocknr_is_fake(&node->blocknr)); + assert("vs-1506", *jnode_get_block(node) != 0); + grabbed2flush_reserved_nolock(atom, (__u64)1); + JF_SET(node, JNODE_FLUSH_RESERVED); + } + + if (!JF_ISSET(node, JNODE_FLUSH_QUEUED)) { + /* If the atom is not set yet, it will be added to the appropriate list in + capture_assign_block_nolock. */ + /* Sometimes a node is set dirty before being captured -- the case for new + jnodes. In that case the jnode will be added to the appropriate list + in capture_assign_block_nolock. Another reason not to re-link jnode is + that jnode is on a flush queue (see flush.c for details) */ + + int level = jnode_get_level(node); + + assert("nikita-3152", !JF_ISSET(node, JNODE_OVRWR)); + assert("zam-654", atom->stage < ASTAGE_PRE_COMMIT); + assert("nikita-2607", 0 <= level); + assert("nikita-2606", level <= REAL_MAX_ZTREE_HEIGHT); + + capture_list_remove(node); + capture_list_push_back(ATOM_DIRTY_LIST(atom, level), node); + ON_DEBUG(count_jnode(atom, node, NODE_LIST(node), DIRTY_LIST, 1)); + + /* + * JNODE_CCED bit protects clean copy (page created by + * copy-on-capture) from being evicted from the memory. This + * is necessary, because otherwise jload() would load obsolete + * disk block (up-to-date original is still in memory). But + * once jnode is dirtied, it cannot be released without + * storing its content on the disk, so protection is no longer + * necessary. + */ + clear_cced_bits(node); + } +} + +/* Set the dirty status for this (spin locked) jnode. */ +reiser4_internal void +jnode_make_dirty_locked(jnode * node) +{ + assert("umka-204", node != NULL); + assert("zam-7481", spin_jnode_is_locked(node)); + + if (REISER4_DEBUG && rofs_jnode(node)) { + warning("nikita-3365", "Dirtying jnode on rofs"); + dump_stack(); + } + + /* Fast check for already dirty node */ + if (!jnode_is_dirty(node)) { + txn_atom * atom; + + atom = jnode_get_atom (node); + assert("vs-1094", atom); + /* Check jnode dirty status again because node spin lock might + * be released inside jnode_get_atom(). */ + if (likely(!jnode_is_dirty(node))) + do_jnode_make_dirty(node, atom); + UNLOCK_ATOM (atom); + } +} + +/* Set the dirty status for this znode. */ +reiser4_internal void +znode_make_dirty(znode * z) +{ + jnode *node; + struct page *page; + + assert("umka-204", z != NULL); + assert("nikita-3290", znode_above_root(z) || znode_is_loaded(z)); + assert("nikita-3291", !ZF_ISSET(z, JNODE_EFLUSH)); + assert("nikita-3560", znode_is_write_locked(z)); + + node = ZJNODE(z); + /* znode is longterm locked, we can check dirty bit without spinlock */ + if (JF_ISSET(node, JNODE_DIRTY)) { + /* znode is dirty already. All we have to do is to change znode version */ + z->version = znode_build_version(jnode_get_tree(node)); + return; + } + + LOCK_JNODE(node); + jnode_make_dirty_locked(node); + page = jnode_page(node); + if (page != NULL) { + /* this is useful assertion (allows one to check that no + * modifications are lost due to update of in-flight page), + * but it requires locking on page to check PG_writeback + * bit. */ + /* assert("nikita-3292", + !PageWriteback(page) || ZF_ISSET(z, JNODE_WRITEBACK)); */ + page_cache_get(page); + + /* jnode lock is not needed for the rest of + * znode_set_dirty(). */ + UNLOCK_JNODE(node); + /* reiser4 file write code calls set_page_dirty for + * unformatted nodes, for formatted nodes we do it here. */ + set_page_dirty_internal(page, 0); + page_cache_release(page); + /* bump version counter in znode */ + z->version = znode_build_version(jnode_get_tree(node)); + } else { + assert("zam-596", znode_above_root(JZNODE(node))); + UNLOCK_JNODE(node); + } + + assert("nikita-1900", znode_is_write_locked(z)); + assert("jmacd-9777", node->atom != NULL); +} + +reiser4_internal int +sync_atom(txn_atom *atom) +{ + int result; + txn_handle *txnh; + + txnh = get_current_context()->trans; + + result = 0; + if (atom != NULL) { + if (atom->stage < ASTAGE_PRE_COMMIT) { + LOCK_TXNH(txnh); + capture_assign_txnh_nolock(atom, txnh); + result = force_commit_atom_nolock(txnh); + } else if (atom->stage < ASTAGE_POST_COMMIT) { + /* wait atom commit */ + atom_wait_event(atom); + /* try once more */ + result = RETERR(-E_REPEAT); + } else + UNLOCK_ATOM(atom); + } + return result; +} + +#if REISER4_DEBUG + +void check_fq(const txn_atom *atom); + +/* move jnode form one list to another + call this after atom->capture_count is updated */ +void +count_jnode(txn_atom *atom, jnode *node, atom_list old_list, atom_list new_list, int check_lists) +{ +#if REISER4_COPY_ON_CAPTURE + assert("", spin_atom_is_locked(atom)); +#else + assert("zam-1018", atom_is_protected(atom)); +#endif + assert("", spin_jnode_is_locked(node)); + assert("", NODE_LIST(node) == old_list); + + switch(NODE_LIST(node)) { + case NOT_CAPTURED: + break; + case DIRTY_LIST: + assert("", atom->dirty > 0); + atom->dirty --; + break; + case CLEAN_LIST: + assert("", atom->clean > 0); + atom->clean --; + break; + case FQ_LIST: + assert("", atom->fq > 0); + atom->fq --; + break; + case WB_LIST: + assert("", atom->wb > 0); + atom->wb --; + break; + case OVRWR_LIST: + assert("", atom->ovrwr > 0); + atom->ovrwr --; + break; + case PROTECT_LIST: + /* protect list is an intermediate atom's list to which jnodes + get put from dirty list before disk space is allocated for + them. From this list jnodes can either go to flush queue list + or back to dirty list */ + assert("", atom->protect > 0); + assert("", new_list == FQ_LIST || new_list == DIRTY_LIST); + atom->protect --; + break; + default: + impossible("", ""); + } + + switch(new_list) { + case NOT_CAPTURED: + break; + case DIRTY_LIST: + atom->dirty ++; + break; + case CLEAN_LIST: + atom->clean ++; + break; + case FQ_LIST: + atom->fq ++; + break; + case WB_LIST: + atom->wb ++; + break; + case OVRWR_LIST: + atom->ovrwr ++; + break; + case PROTECT_LIST: + assert("", old_list == DIRTY_LIST); + atom->protect ++; + break; + default: + impossible("", ""); + } + ASSIGN_NODE_LIST(node, new_list); + if (0 && check_lists) { + int count; + tree_level level; + jnode *node; + + count = 0; + + /* flush queue list */ + /*check_fq(atom);*/ + + /* dirty list */ + count = 0; + for (level = 0; level < REAL_MAX_ZTREE_HEIGHT + 1; level += 1) { + for_all_type_safe_list(capture, ATOM_DIRTY_LIST(atom, level), node) + count ++; + } + if (count != atom->dirty) + warning("", "dirty counter %d, real %d\n", atom->dirty, count); + + /* clean list */ + count = 0; + for_all_type_safe_list(capture, ATOM_CLEAN_LIST(atom), node) + count ++; + if (count != atom->clean) + warning("", "clean counter %d, real %d\n", atom->clean, count); + + /* wb list */ + count = 0; + for_all_type_safe_list(capture, ATOM_WB_LIST(atom), node) + count ++; + if (count != atom->wb) + warning("", "wb counter %d, real %d\n", atom->wb, count); + + /* overwrite list */ + count = 0; + for_all_type_safe_list(capture, ATOM_OVRWR_LIST(atom), node) + count ++; + + if (count != atom->ovrwr) + warning("", "ovrwr counter %d, real %d\n", atom->ovrwr, count); + } + assert("vs-1624", atom->num_queued == atom->fq); + if (atom->capture_count != atom->dirty + atom->clean + atom->ovrwr + atom->wb + atom->fq + atom->protect) { + printk("count %d, dirty %d clean %d ovrwr %d wb %d fq %d protect %d\n", atom->capture_count, atom->dirty, atom->clean, atom->ovrwr, atom->wb, atom->fq, atom->protect); + assert("vs-1622", + atom->capture_count == atom->dirty + atom->clean + atom->ovrwr + atom->wb + atom->fq + atom->protect); + } +} + +#endif + +/* Make node OVRWR and put it on atom->overwrite_nodes list, atom lock and jnode + * lock should be taken before calling this function. */ +reiser4_internal void jnode_make_wander_nolock (jnode * node) +{ + txn_atom * atom; + + assert("nikita-2431", node != NULL); + assert("nikita-2432", !JF_ISSET(node, JNODE_RELOC)); + assert("nikita-3153", jnode_is_dirty(node)); + assert("zam-897", !JF_ISSET(node, JNODE_FLUSH_QUEUED)); + assert("nikita-3367", !blocknr_is_fake(jnode_get_block(node))); + + atom = node->atom; + + assert("zam-895", atom != NULL); + assert("zam-894", atom_is_protected(atom)); + + JF_SET(node, JNODE_OVRWR); + capture_list_remove_clean(node); + capture_list_push_back(ATOM_OVRWR_LIST(atom), node); + /*XXXX*/ON_DEBUG(count_jnode(atom, node, DIRTY_LIST, OVRWR_LIST, 1)); +} + +/* Same as jnode_make_wander_nolock, but all necessary locks are taken inside + * this function. */ +reiser4_internal void jnode_make_wander (jnode * node) +{ + txn_atom * atom; + + LOCK_JNODE(node); + atom = jnode_get_atom(node); + assert ("zam-913", atom != NULL); + assert ("zam-914", !JF_ISSET(node, JNODE_RELOC)); + + jnode_make_wander_nolock(node); + UNLOCK_ATOM(atom); + UNLOCK_JNODE(node); +} + +/* this just sets RELOC bit */ +static void +jnode_make_reloc_nolock(flush_queue_t *fq, jnode *node) +{ + assert("vs-1480", spin_jnode_is_locked(node)); + assert ("zam-916", jnode_is_dirty(node)); + assert ("zam-917", !JF_ISSET(node, JNODE_RELOC)); + assert ("zam-918", !JF_ISSET(node, JNODE_OVRWR)); + assert ("zam-920", !JF_ISSET(node, JNODE_FLUSH_QUEUED)); + assert ("nikita-3367", !blocknr_is_fake(jnode_get_block(node))); + + jnode_set_reloc(node); +} + +/* Make znode RELOC and put it on flush queue */ +reiser4_internal void znode_make_reloc (znode *z, flush_queue_t * fq) +{ + jnode *node; + txn_atom * atom; + + node = ZJNODE(z); + LOCK_JNODE(node); + + atom = jnode_get_atom(node); + assert ("zam-919", atom != NULL); + + jnode_make_reloc_nolock(fq, node); + queue_jnode(fq, node); + + UNLOCK_ATOM(atom); + UNLOCK_JNODE(node); + +} + +/* Make unformatted node RELOC and put it on flush queue */ +reiser4_internal void +unformatted_make_reloc(jnode *node, flush_queue_t * fq) +{ + assert("vs-1479", jnode_is_unformatted(node)); + + jnode_make_reloc_nolock(fq, node); + mark_jnode_queued(fq, node); +} + +static int +trylock_wait(txn_atom *atom, txn_handle * txnh, jnode * node) +{ + if (unlikely(!spin_trylock_atom(atom))) { + atomic_inc(&atom->refcount); + + UNLOCK_JNODE(node); + UNLOCK_TXNH(txnh); + + LOCK_ATOM(atom); + /* caller should eliminate extra reference by calling + * atom_dec_and_unlock() for this atom. */ + return 1; + } else + return 0; +} + +/* + * in transaction manager jnode spin lock and transaction handle spin lock + * nest within atom spin lock. During capturing we are in a situation when + * jnode and transaction handle spin locks are held and we want to manipulate + * atom's data (capture lists, and txnh list) to add node and/or handle to the + * atom. Releasing jnode (or txnh) spin lock at this point is unsafe, because + * concurrent fusion can render assumption made by capture so far (about + * ->atom pointers in jnode and txnh) invalid. Initial code used try-lock and + * if atom was busy returned -E_REPEAT to the top level. This can lead to the + * busy loop if atom is locked for long enough time. Function below tries to + * throttle this loop. + * + */ +/* ZAM-FIXME-HANS: how feasible would it be to use our hi-lo priority locking + mechanisms/code for this as well? Does that make any sense? */ +/* ANSWER(Zam): I am not sure that I understand you proposal right, but the idea + might be in inventing spin_lock_lopri() which should be a complex loop with + "release lock" messages check like we have in the znode locking. I think we + should not substitute spin locks by more complex busy loops. Once it was + done that way in try_capture_block() where spin lock waiting was spread in a + busy loop through several functions. The proper solution should be in + making spin lock contention rare. */ +static int +trylock_throttle(txn_atom *atom, txn_handle * txnh, jnode * node) +{ + assert("nikita-3224", atom != NULL); + assert("nikita-3225", txnh != NULL); + assert("nikita-3226", node != NULL); + + assert("nikita-3227", spin_txnh_is_locked(txnh)); + assert("nikita-3229", spin_jnode_is_locked(node)); + + if (unlikely(trylock_wait(atom, txnh, node) != 0)) { + atom_dec_and_unlock(atom); + return RETERR(-E_REPEAT); + } else + return 0; +} + +/* This function assigns a block to an atom, but first it must obtain the atom lock. If + the atom lock is busy, it returns -E_REPEAT to avoid deadlock with a fusing atom. Since + the transaction handle is currently open, we know the atom must also be open. */ +static int +capture_assign_block(txn_handle * txnh, jnode * node) +{ + txn_atom *atom; + int result; + + assert("umka-206", txnh != NULL); + assert("umka-207", node != NULL); + + atom = txnh->atom; + + assert("umka-297", atom != NULL); + + result = trylock_throttle(atom, txnh, node); + if (result != 0) { + /* this avoid busy loop, but we return -E_REPEAT anyway to + * simplify things. */ + return result; + } else { + assert("jmacd-19", atom_isopen(atom)); + + /* Add page to capture list. */ + capture_assign_block_nolock(atom, node); + + /* Success holds onto jnode & txnh locks. Unlock atom. */ + UNLOCK_ATOM(atom); + return 0; + } +} + +/* This function assigns a handle to an atom, but first it must obtain the atom lock. If + the atom is busy, it returns -E_REPEAT to avoid deadlock with a fusing atom. Unlike + capture_assign_block, the atom may be closed but we cannot know this until the atom is + locked. If the atom is closed and the request is to read, it is as if the block is + unmodified and the request is satisified without actually assigning the transaction + handle. If the atom is closed and the handle requests to write the block, then + initiate copy-on-capture. +*/ +static int +capture_assign_txnh(jnode * node, txn_handle * txnh, txn_capture mode, int can_coc) +{ + txn_atom *atom; + + assert("umka-208", node != NULL); + assert("umka-209", txnh != NULL); + + atom = node->atom; + + assert("umka-298", atom != NULL); + + /* + * optimization: this code went through three evolution stages. Main + * driving force of evolution here is lock ordering: + * + * at the entry to this function following pre-conditions are met: + * + * 1. txnh and node are both spin locked, + * + * 2. node belongs to atom, and + * + * 3. txnh don't. + * + * What we want to do here is to acquire spin lock on node's atom and + * modify it somehow depending on its ->stage. In the simplest case, + * where ->stage is ASTAGE_CAPTURE_FUSE, txnh should be added to + * atom's list. Problem is that atom spin lock nests outside of jnode + * and transaction handle ones. So, we cannot just LOCK_ATOM here. + * + * Solutions tried here: + * + * 1. spin_trylock(atom), return -E_REPEAT on failure. + * + * 2. spin_trylock(atom). On failure to acquire lock, increment + * atom->refcount, release all locks, and spin on atom lock. Then + * decrement ->refcount, unlock atom and return -E_REPEAT. + * + * 3. like previous one, but before unlocking atom, re-acquire + * spin locks on node and txnh and re-check whether function + * pre-condition are still met. Continue boldly if they are. + * + */ + if (trylock_wait(atom, txnh, node) != 0) { + LOCK_JNODE(node); + LOCK_TXNH(txnh); + /* NOTE-NIKITA is it at all possible that current txnh + * spontaneously changes ->atom from NULL to non-NULL? */ + if (node->atom == NULL || + txnh->atom != NULL || atom != node->atom) { + /* something changed. Caller have to re-decide */ + UNLOCK_TXNH(txnh); + UNLOCK_JNODE(node); + atom_dec_and_unlock(atom); + return RETERR(-E_REPEAT); + } else { + /* atom still has a jnode on its list (node->atom == + * atom), it means atom is not fused or finished + * (committed), we can safely decrement its refcount + * because it is not a last reference. */ + atomic_dec(&atom->refcount); + assert("zam-990", atomic_read(&atom->refcount) > 0); + } + } + + if (atom->stage == ASTAGE_CAPTURE_WAIT && + (atom->txnh_count != 0 || + atom_should_commit(atom) || atom_should_commit_asap(atom))) { + /* We don't fuse with the atom in ASTAGE_CAPTURE_WAIT only if + * there is open transaction handler. It makes sense: those + * atoms should not wait ktxnmgrd to flush and commit them. + * And, it solves deadlocks with loop back devices (reiser4 over + * loopback over reiser4), when ktxnmrgd is busy committing one + * atom (above the loop back device) and can't flush an atom + * below the loopback. */ + + /* The atom could be blocking requests--this is the first chance we've had + to test it. Since this txnh is not yet assigned, the fuse_wait logic + is not to avoid deadlock, its just waiting. Releases all three locks + and returns E_REPEAT. */ + + return capture_fuse_wait(node, txnh, atom, NULL, mode); + + } else if (atom->stage > ASTAGE_CAPTURE_WAIT) { + + /* The block is involved with a committing atom. */ + if (CAPTURE_TYPE(mode) == TXN_CAPTURE_READ_ATOMIC) { + + /* A read request for a committing block can be satisfied w/o + COPY-ON-CAPTURE. */ + + /* Success holds onto the jnode & txnh lock. Continue to unlock + atom below. */ + + } else { + + /* Perform COPY-ON-CAPTURE. Copy and try again. This function + releases all three locks. */ + return capture_copy(node, txnh, atom, NULL, mode, can_coc); + } + + } else { + + assert("jmacd-160", atom->stage == ASTAGE_CAPTURE_FUSE || + (atom->stage == ASTAGE_CAPTURE_WAIT && atom->txnh_count == 0)); + + /* Add txnh to active list. */ + capture_assign_txnh_nolock(atom, txnh); + + /* Success holds onto the jnode & txnh lock. Continue to unlock atom + below. */ + } + + /* Unlock the atom */ + UNLOCK_ATOM(atom); + return 0; +} + +reiser4_internal int +capture_super_block(struct super_block *s) +{ + int result; + znode *uber; + lock_handle lh; + + init_lh(&lh); + result = get_uber_znode(get_tree(s), + ZNODE_WRITE_LOCK, ZNODE_LOCK_LOPRI, &lh); + if (result) + return result; + + uber = lh.node; + /* Grabbing one block for superblock */ + result = reiser4_grab_space_force((__u64)1, BA_RESERVED); + if (result != 0) + return result; + + znode_make_dirty(uber); + + done_lh(&lh); + return 0; +} + +/* Wakeup every handle on the atom's WAITFOR list */ +static void +wakeup_atom_waitfor_list(txn_atom * atom) +{ + txn_wait_links *wlinks; + + assert("umka-210", atom != NULL); + + /* atom is locked */ + for_all_type_safe_list(fwaitfor, &atom->fwaitfor_list, wlinks) { + if (wlinks->waitfor_cb == NULL || + wlinks->waitfor_cb(atom, wlinks)) + /* Wake up. */ + reiser4_wake_up(wlinks->_lock_stack); + } +} + +/* Wakeup every handle on the atom's WAITING list */ +static void +wakeup_atom_waiting_list(txn_atom * atom) +{ + txn_wait_links *wlinks; + + assert("umka-211", atom != NULL); + + /* atom is locked */ + for_all_type_safe_list(fwaiting, &atom->fwaiting_list, wlinks) { + if (wlinks->waiting_cb == NULL || + wlinks->waiting_cb(atom, wlinks)) + /* Wake up. */ + reiser4_wake_up(wlinks->_lock_stack); + } +} + +/* helper function used by capture_fuse_wait() to avoid "spurious wake-ups" */ +static int wait_for_fusion(txn_atom * atom, txn_wait_links * wlinks) +{ + assert("nikita-3330", atom != NULL); + assert("nikita-3331", spin_atom_is_locked(atom)); + + + /* atom->txnh_count == 1 is for waking waiters up if we are releasing + * last transaction handle. */ + return atom->stage != ASTAGE_CAPTURE_WAIT || atom->txnh_count == 1; +} + +/* The general purpose of this function is to wait on the first of two possible events. + The situation is that a handle (and its atom atomh) is blocked trying to capture a + block (i.e., node) but the node's atom (atomf) is in the CAPTURE_WAIT state. The + handle's atom (atomh) is not in the CAPTURE_WAIT state. However, atomh could fuse with + another atom or, due to age, enter the CAPTURE_WAIT state itself, at which point it + needs to unblock the handle to avoid deadlock. When the txnh is unblocked it will + proceed and fuse the two atoms in the CAPTURE_WAIT state. + + In other words, if either atomh or atomf change state, the handle will be awakened, + thus there are two lists per atom: WAITING and WAITFOR. + + This is also called by capture_assign_txnh with (atomh == NULL) to wait for atomf to + close but it is not assigned to an atom of its own. + + Lock ordering in this method: all four locks are held: JNODE_LOCK, TXNH_LOCK, + BOTH_ATOM_LOCKS. Result: all four locks are released. +*/ +static int +capture_fuse_wait(jnode * node, txn_handle * txnh, txn_atom * atomf, txn_atom * atomh, txn_capture mode) +{ + int ret; + + /* Initialize the waiting list links. */ + txn_wait_links wlinks; + + assert("umka-212", node != NULL); + assert("umka-213", txnh != NULL); + assert("umka-214", atomf != NULL); + + /* We do not need the node lock. */ + UNLOCK_JNODE(node); + + if ((mode & TXN_CAPTURE_NONBLOCKING) != 0) { + UNLOCK_TXNH(txnh); + UNLOCK_ATOM(atomf); + + if (atomh) { + UNLOCK_ATOM(atomh); + } + + return RETERR(-E_BLOCK); + } + + init_wlinks(&wlinks); + + /* Add txnh to atomf's waitfor list, unlock atomf. */ + fwaitfor_list_push_back(&atomf->fwaitfor_list, &wlinks); + wlinks.waitfor_cb = wait_for_fusion; + atomic_inc(&atomf->refcount); + UNLOCK_ATOM(atomf); + + if (atomh) { + /* Add txnh to atomh's waiting list, unlock atomh. */ + fwaiting_list_push_back(&atomh->fwaiting_list, &wlinks); + atomic_inc(&atomh->refcount); + UNLOCK_ATOM(atomh); + } + + /* Go to sleep. */ + UNLOCK_TXNH(txnh); + + ret = prepare_to_sleep(wlinks._lock_stack); + if (ret == 0) { + go_to_sleep(wlinks._lock_stack); + ret = RETERR(-E_REPEAT); + } + + /* Remove from the waitfor list. */ + LOCK_ATOM(atomf); + fwaitfor_list_remove(&wlinks); + atom_dec_and_unlock(atomf); + + if (atomh) { + /* Remove from the waiting list. */ + LOCK_ATOM(atomh); + fwaiting_list_remove(&wlinks); + atom_dec_and_unlock(atomh); + } + + assert("nikita-2186", ergo(ret, spin_jnode_is_not_locked(node))); + return ret; +} + +static inline int +capture_init_fusion_locked(jnode * node, txn_handle * txnh, txn_capture mode, int can_coc) +{ + txn_atom *atomf; + txn_atom *atomh; + + assert("umka-216", txnh != NULL); + assert("umka-217", node != NULL); + + atomh = txnh->atom; + atomf = node->atom; + + /* The txnh atom must still be open (since the txnh is active)... the node atom may + be in some later stage (checked next). */ + assert("jmacd-20", atom_isopen(atomh)); + + /* If the node atom is in the FUSE_WAIT state then we should wait, except to + avoid deadlock we still must fuse if the txnh atom is also in FUSE_WAIT. */ + if (atomf->stage == ASTAGE_CAPTURE_WAIT && + atomh->stage != ASTAGE_CAPTURE_WAIT && + (atomf->txnh_count != 0 || + atom_should_commit(atomf) || atom_should_commit_asap(atomf))) { + /* see comment in capture_assign_txnh() about the + * "atomf->txnh_count != 0" condition. */ + /* This unlocks all four locks and returns E_REPEAT. */ + return capture_fuse_wait(node, txnh, atomf, atomh, mode); + + } else if (atomf->stage > ASTAGE_CAPTURE_WAIT) { + + /* The block is involved with a comitting atom. */ + if (CAPTURE_TYPE(mode) == TXN_CAPTURE_READ_ATOMIC) { + /* A read request for a committing block can be satisfied w/o + COPY-ON-CAPTURE. Success holds onto the jnode & txnh + locks. */ + UNLOCK_ATOM(atomf); + UNLOCK_ATOM(atomh); + return 0; + } else { + /* Perform COPY-ON-CAPTURE. Copy and try again. This function + releases all four locks. */ + return capture_copy(node, txnh, atomf, atomh, mode, can_coc); + } + } + + /* Because atomf's stage <= CAPTURE_WAIT */ + assert("jmacd-175", atom_isopen(atomf)); + + /* If we got here its either because the atomh is in CAPTURE_WAIT or because the + atomf is not in CAPTURE_WAIT. */ + assert("jmacd-176", (atomh->stage == ASTAGE_CAPTURE_WAIT || atomf->stage != ASTAGE_CAPTURE_WAIT) || atomf->txnh_count == 0); + + /* Now release the txnh lock: only holding the atoms at this point. */ + UNLOCK_TXNH(txnh); + UNLOCK_JNODE(node); + + /* Decide which should be kept and which should be merged. */ + if (atom_pointer_count(atomf) < atom_pointer_count(atomh)) { + capture_fuse_into(atomf, atomh); + } else { + capture_fuse_into(atomh, atomf); + } + + /* Atoms are unlocked in capture_fuse_into. No locks held. */ + return RETERR(-E_REPEAT); +} + +/* Perform the necessary work to prepare for fusing two atoms, which involves + * acquiring two atom locks in the proper order. If one of the node's atom is + * blocking fusion (i.e., it is in the CAPTURE_WAIT stage) and the handle's + * atom is not then the handle's request is put to sleep. If the node's atom + * is committing, then the node can be copy-on-captured. Otherwise, pick the + * atom with fewer pointers to be fused into the atom with more pointer and + * call capture_fuse_into. + */ +static int +capture_init_fusion(jnode * node, txn_handle * txnh, txn_capture mode, int can_coc) +{ + /* Have to perform two trylocks here. */ + if (likely(spin_trylock_atom(node->atom))) { + if (likely(spin_trylock_atom(txnh->atom))) + return capture_init_fusion_locked(node, txnh, mode, can_coc); + else { + UNLOCK_ATOM(node->atom); + } + } + + UNLOCK_JNODE(node); + UNLOCK_TXNH(txnh); + return RETERR(-E_REPEAT); +} +/* This function splices together two jnode lists (small and large) and sets all jnodes in + the small list to point to the large atom. Returns the length of the list. */ +static int +capture_fuse_jnode_lists(txn_atom * large, capture_list_head * large_head, capture_list_head * small_head) +{ + int count = 0; + jnode *node; + + assert("umka-218", large != NULL); + assert("umka-219", large_head != NULL); + assert("umka-220", small_head != NULL); + /* small atom should be locked also. */ + assert("zam-968", spin_atom_is_locked(large)); + + /* For every jnode on small's capture list... */ + for_all_type_safe_list(capture, small_head, node) { + count += 1; + + /* With the jnode lock held, update atom pointer. */ + UNDER_SPIN_VOID(jnode, node, node->atom = large); + } + + /* Splice the lists. */ + capture_list_splice(large_head, small_head); + + return count; +} + +/* This function splices together two txnh lists (small and large) and sets all txn handles in + the small list to point to the large atom. Returns the length of the list. */ +/* Audited by: umka (2002.06.13) */ +static int +capture_fuse_txnh_lists(txn_atom * large, txnh_list_head * large_head, txnh_list_head * small_head) +{ + int count = 0; + txn_handle *txnh; + + assert("umka-221", large != NULL); + assert("umka-222", large_head != NULL); + assert("umka-223", small_head != NULL); + + /* Adjust every txnh to the new atom. */ + for_all_type_safe_list(txnh, small_head, txnh) { + count += 1; + + /* With the txnh lock held, update atom pointer. */ + UNDER_SPIN_VOID(txnh, txnh, txnh->atom = large); + } + + /* Splice the txn_handle list. */ + txnh_list_splice(large_head, small_head); + + return count; +} + +/* This function fuses two atoms. The captured nodes and handles belonging to SMALL are + added to LARGE and their ->atom pointers are all updated. The associated counts are + updated as well, and any waiting handles belonging to either are awakened. Finally the + smaller atom's refcount is decremented. +*/ +static void +capture_fuse_into(txn_atom * small, txn_atom * large) +{ + int level; + unsigned zcount = 0; + unsigned tcount = 0; + protected_jnodes *prot_list; + + assert("umka-224", small != NULL); + assert("umka-225", small != NULL); + + assert("umka-299", spin_atom_is_locked(large)); + assert("umka-300", spin_atom_is_locked(small)); + + assert("jmacd-201", atom_isopen(small)); + assert("jmacd-202", atom_isopen(large)); + + /* Splice and update the per-level dirty jnode lists */ + for (level = 0; level < REAL_MAX_ZTREE_HEIGHT + 1; level += 1) { + zcount += capture_fuse_jnode_lists(large, ATOM_DIRTY_LIST(large, level), ATOM_DIRTY_LIST(small, level)); + } + + /* Splice and update the [clean,dirty] jnode and txnh lists */ + zcount += capture_fuse_jnode_lists(large, ATOM_CLEAN_LIST(large), ATOM_CLEAN_LIST(small)); + zcount += capture_fuse_jnode_lists(large, ATOM_OVRWR_LIST(large), ATOM_OVRWR_LIST(small)); + zcount += capture_fuse_jnode_lists(large, ATOM_WB_LIST(large), ATOM_WB_LIST(small)); + zcount += capture_fuse_jnode_lists(large, &large->inodes, &small->inodes); + tcount += capture_fuse_txnh_lists(large, &large->txnh_list, &small->txnh_list); + + for_all_type_safe_list(prot, &small->protected, prot_list) { + jnode *node; + + for_all_type_safe_list(capture, &prot_list->nodes, node) { + zcount += 1; + + LOCK_JNODE(node); + assert("nikita-3375", node->atom == small); + /* With the jnode lock held, update atom pointer. */ + node->atom = large; + UNLOCK_JNODE(node); + } + } + /* Splice the lists of lists. */ + prot_list_splice(&large->protected, &small->protected); + + /* Check our accounting. */ + assert("jmacd-1063", zcount + small->num_queued == small->capture_count); + assert("jmacd-1065", tcount == small->txnh_count); + + /* sum numbers of waiters threads */ + large->nr_waiters += small->nr_waiters; + small->nr_waiters = 0; + + /* splice flush queues */ + fuse_fq(large, small); + + /* update counter of jnode on every atom' list */ + ON_DEBUG(large->dirty += small->dirty; + small->dirty = 0; + large->clean += small->clean; + small->clean = 0; + large->ovrwr += small->ovrwr; + small->ovrwr = 0; + large->wb += small->wb; + small->wb = 0; + large->fq += small->fq; + small->fq = 0; + large->protect += small->protect; + small->protect = 0; + ); + + /* count flushers in result atom */ + large->nr_flushers += small->nr_flushers; + small->nr_flushers = 0; + + /* update counts of flushed nodes */ + large->flushed += small->flushed; + small->flushed = 0; + + /* Transfer list counts to large. */ + large->txnh_count += small->txnh_count; + large->capture_count += small->capture_count; + + /* Add all txnh references to large. */ + atomic_add(small->txnh_count, &large->refcount); + atomic_sub(small->txnh_count, &small->refcount); + + /* Reset small counts */ + small->txnh_count = 0; + small->capture_count = 0; + + /* Assign the oldest start_time, merge flags. */ + large->start_time = min(large->start_time, small->start_time); + large->flags |= small->flags; + + /* Merge blocknr sets. */ + blocknr_set_merge(&small->delete_set, &large->delete_set); + blocknr_set_merge(&small->wandered_map, &large->wandered_map); + + /* Merge allocated/deleted file counts */ + large->nr_objects_deleted += small->nr_objects_deleted; + large->nr_objects_created += small->nr_objects_created; + + small->nr_objects_deleted = 0; + small->nr_objects_created = 0; + + /* Merge allocated blocks counts */ + large->nr_blocks_allocated += small->nr_blocks_allocated; + + large->nr_running_queues += small->nr_running_queues; + small->nr_running_queues = 0; + + /* Merge blocks reserved for overwrite set. */ + large->flush_reserved += small->flush_reserved; + small->flush_reserved = 0; + + if (large->stage < small->stage) { + /* Large only needs to notify if it has changed state. */ + atom_set_stage(large, small->stage); + wakeup_atom_waiting_list(large); + } + + atom_set_stage(small, ASTAGE_INVALID); + + /* Notify any waiters--small needs to unload its wait lists. Waiters + actually remove themselves from the list before returning from the + fuse_wait function. */ + wakeup_atom_waiting_list(small); + + /* Unlock atoms */ + UNLOCK_ATOM(large); + atom_dec_and_unlock(small); +} + +reiser4_internal void +protected_jnodes_init(protected_jnodes *list) +{ + txn_atom *atom; + + assert("nikita-3376", list != NULL); + + atom = get_current_atom_locked(); + prot_list_push_front(&atom->protected, list); + capture_list_init(&list->nodes); + UNLOCK_ATOM(atom); +} + +reiser4_internal void +protected_jnodes_done(protected_jnodes *list) +{ + txn_atom *atom; + + assert("nikita-3379", capture_list_empty(&list->nodes)); + + atom = get_current_atom_locked(); + prot_list_remove(list); + UNLOCK_ATOM(atom); +} + +/* TXNMGR STUFF */ + +#if REISER4_COPY_ON_CAPTURE + +/* copy on capture steals jnode (J) from capture list. It may replace (J) with + special newly created jnode (CCJ) to which J's page gets attached. J in its + turn gets newly created copy of page. + Or, it may merely take J from capture list if J was never dirtied + + The problem with this replacement is that capture lists are being contiguously + scanned. + Race between replacement and scanning are avoided with one global spin lock + (scan_lock) and JNODE_SCANNED state of jnode. Replacement (in capture copy) + goes under scan_lock locked only if jnode is not in JNODE_SCANNED state. This + state gets set under scan_lock locked whenever scanning is working with that + jnode. +*/ + +/* remove jnode page from mapping's tree and insert new page with the same index */ +static void +replace_page_in_mapping(jnode *node, struct page *new_page) +{ + struct address_space *mapping; + unsigned long index; + + mapping = jnode_get_mapping(node); + index = jnode_get_index(node); + + spin_lock(&mapping->page_lock); + + /* delete old page from. This resembles __remove_from_page_cache */ + assert("vs-1416", radix_tree_lookup(&mapping->page_tree, index) == node->pg); + assert("vs-1428", node->pg->mapping == mapping); + __remove_from_page_cache(node->pg); + + /* insert new page into mapping */ + check_me("vs-1411", + radix_tree_insert(&mapping->page_tree, index, new_page) == 0); + + /* this resembles add_to_page_cache */ + page_cache_get(new_page); + ___add_to_page_cache(new_page, mapping, index); + + spin_unlock(&mapping->page_lock); + lru_cache_add(new_page); +} + +/* attach page of @node to @copy, @new_page to @node */ +static void +swap_jnode_pages(jnode *node, jnode *copy, struct page *new_page) +{ + /* attach old page to new jnode */ + assert("vs-1414", jnode_by_page(node->pg) == node); + copy->pg = node->pg; + copy->data = page_address(copy->pg); + jnode_set_block(copy, jnode_get_block(node)); + copy->pg->private = (unsigned long)copy; + + /* attach new page to jnode */ + assert("vs-1412", !PagePrivate(new_page)); + page_cache_get(new_page); + node->pg = new_page; + node->data = page_address(new_page); + new_page->private = (unsigned long)node; + SetPagePrivate(new_page); + + { + /* insert old page to new mapping */ + struct address_space *mapping; + unsigned long index; + + mapping = get_current_super_private()->cc->i_mapping; + index = (unsigned long)copy; + spin_lock(&mapping->page_lock); + + /* insert old page into new (fake) mapping. No page_cache_get + because page reference counter was not decreased on removing + it from old mapping */ + assert("vs-1416", radix_tree_lookup(&mapping->page_tree, index) == NULL); + check_me("vs-1418", radix_tree_insert(&mapping->page_tree, index, copy->pg) == 0); + ___add_to_page_cache(copy->pg, mapping, index); + + /* corresponding page_cache_release is in invalidate_list */ + page_cache_get(copy->pg); + spin_unlock(&mapping->page_lock); + } +} + +/* this is to make capture copied jnode looking like if there were jload called for it */ +static void +fake_jload(jnode *node) +{ + jref(node); + atomic_inc(&node->d_count); + JF_SET(node, JNODE_PARSED); +} + +/* for now - refuse to copy-on-capture any suspicious nodes (WRITEBACK, DIRTY, FLUSH_QUEUED) */ +static int +check_capturable(const jnode *node, const txn_atom *atom) +{ + assert("vs-1429", spin_jnode_is_locked(node)); + assert("vs-1487", check_spin_is_locked(&scan_lock)); + + if (JF_ISSET(node, JNODE_WRITEBACK)) { + reiser4_stat_inc(coc.writeback); + return RETERR(-E_WAIT); + } + if (JF_ISSET(node, JNODE_FLUSH_QUEUED)) { + reiser4_stat_inc(coc.flush_queued); + return RETERR(-E_WAIT); + } + if (JF_ISSET(node, JNODE_DIRTY)) { + reiser4_stat_inc(coc.dirty); + return RETERR(-E_WAIT); + } + if (JF_ISSET(node, JNODE_SCANNED)) { + reiser4_stat_inc(coc.scan_race); + return RETERR(-E_REPEAT); + } + if (node->atom != atom) { + reiser4_stat_inc(coc.atom_changed); + return RETERR(-E_WAIT); + } + return 0; /* OK */ +} + +static void +remove_from_capture_list(jnode *node) +{ + ON_DEBUG_MODIFY(znode_set_checksum(node, 1)); + JF_CLR(node, JNODE_DIRTY); + JF_CLR(node, JNODE_RELOC); + JF_CLR(node, JNODE_OVRWR); + JF_CLR(node, JNODE_CREATED); + JF_CLR(node, JNODE_WRITEBACK); + JF_CLR(node, JNODE_REPACK); + + capture_list_remove_clean(node); + node->atom->capture_count --; + atomic_dec(&node->x_count); + /*XXXX*/ON_DEBUG(count_jnode(node->atom, node, NODE_LIST(node), NOT_CAPTURED, 1)); + node->atom = 0; +} + +/* insert new jnode (copy) to capture list instead of old one */ +static void +replace_on_capture_list(jnode *node, jnode *copy) +{ + assert("vs-1415", node->atom); + assert("vs-1489", !capture_list_is_clean(node)); + assert("vs-1493", JF_ISSET(copy, JNODE_CC) && JF_ISSET(copy, JNODE_HEARD_BANSHEE)); + + copy->state |= node->state; + + /* insert cc-jnode @copy into capture list before old jnode @node */ + capture_list_insert_before(node, copy); + jref(copy); + copy->atom = node->atom; + node->atom->capture_count ++; + /*XXXX*/ON_DEBUG(count_jnode(node->atom, copy, NODE_LIST(copy), NODE_LIST(node), 1)); + + /* remove old jnode from capture list */ + remove_from_capture_list(node); +} + +/* when capture request is made for a node which is captured but was never + dirtied copy on capture will merely uncapture it */ +static int +copy_on_capture_clean(jnode *node, txn_atom *atom) +{ + int result; + + assert("vs-1625", spin_atom_is_locked(atom)); + assert("vs-1432", spin_jnode_is_locked(node)); + assert("vs-1627", !JF_ISSET(node, JNODE_WRITEBACK)); + + spin_lock(&scan_lock); + result = check_capturable(node, atom); + if (result == 0) { + /* remove jnode from capture list */ + remove_from_capture_list(node); + reiser4_stat_inc(coc.ok_clean); + } + spin_unlock(&scan_lock); + UNLOCK_JNODE(node); + UNLOCK_ATOM(atom); + + return result; +} + +static void +lock_two_nodes(jnode *node1, jnode *node2) +{ + if (node1 > node2) { + LOCK_JNODE(node2); + LOCK_JNODE(node1); + } else { + LOCK_JNODE(node1); + LOCK_JNODE(node2); + } +} + +/* capture request is made for node which does not have page. In most cases this + is "uber" znode */ +static int +copy_on_capture_nopage(jnode *node, txn_atom *atom) +{ + int result; + jnode *copy; + + assert("vs-1432", spin_atom_is_locked(atom)); + assert("vs-1432", spin_jnode_is_locked(node)); + + jref(node); + UNLOCK_JNODE(node); + UNLOCK_ATOM(atom); + assert("nikita-3475", schedulable()); + copy = jclone(node); + if (IS_ERR(copy)) { + jput(node); + return PTR_ERR(copy); + } + + LOCK_ATOM(atom); + lock_two_nodes(node, copy); + spin_lock(&scan_lock); + + result = check_capturable(node, atom); + if (result == 0) { + if (jnode_page(node) == NULL) { + replace_on_capture_list(node, copy); +#if REISER4_STATS + if (znode_above_root(JZNODE(node))) + reiser4_stat_inc(coc.ok_uber); + else + reiser4_stat_inc(coc.ok_nopage); +#endif + } else + result = RETERR(-E_REPEAT); + } + + spin_unlock(&scan_lock); + UNLOCK_JNODE(node); + UNLOCK_JNODE(copy); + UNLOCK_ATOM(atom); + assert("nikita-3476", schedulable()); + jput(copy); + assert("nikita-3477", schedulable()); + jput(node); + assert("nikita-3478", schedulable()); + ON_TRACE(TRACE_CAPTURE_COPY, "nopage\n"); + return result; +} + +static int +handle_coc(jnode *node, jnode *copy, struct page *page, struct page *new_page, + txn_atom *atom) +{ + char *to; + char *from; + int result; + + to = kmap(new_page); + lock_page(page); + from = kmap(page); + /* + * FIXME(zam): one preloaded radix tree node may be not enough for two + * insertions, one insertion is in replace_page_in_mapping(), another + * one is in swap_jnode_pages(). The radix_tree_delete() call might + * not help, because an empty radix tree node is freed and the node's + * free space may not be re-used in insertion. + */ + radix_tree_preload(GFP_KERNEL); + LOCK_ATOM(atom); + lock_two_nodes(node, copy); + spin_lock(&scan_lock); + + result = check_capturable(node, atom); + if (result == 0) { + /* if node was jloaded by get_overwrite_set, we have to jrelse + it here, because we remove jnode from atom's capture list - + put_overwrite_set will not jrelse it */ + int was_jloaded; + + was_jloaded = JF_ISSET(node, JNODE_JLOADED_BY_GET_OVERWRITE_SET); + + replace_page_in_mapping(node, new_page); + swap_jnode_pages(node, copy, new_page); + replace_on_capture_list(node, copy); + /* statistics */ + if (JF_ISSET(copy, JNODE_RELOC)) { + reiser4_stat_inc(coc.ok_reloc); + } else if (JF_ISSET(copy, JNODE_OVRWR)) { + reiser4_stat_inc(coc.ok_ovrwr); + } else + impossible("", ""); + + memcpy(to, from, PAGE_CACHE_SIZE); + SetPageUptodate(new_page); + if (was_jloaded) + fake_jload(copy); + else + kunmap(page); + + assert("vs-1419", page_count(new_page) >= 3); + spin_unlock(&scan_lock); + UNLOCK_JNODE(node); + UNLOCK_JNODE(copy); + UNLOCK_ATOM(atom); + radix_tree_preload_end(); + unlock_page(page); + + if (was_jloaded) { + jrelse_tail(node); + assert("vs-1494", JF_ISSET(node, JNODE_JLOADED_BY_GET_OVERWRITE_SET)); + JF_CLR(node, JNODE_JLOADED_BY_GET_OVERWRITE_SET); + } else + kunmap(new_page); + + jput(copy); + jrelse(node); + jput(node); + page_cache_release(page); + page_cache_release(new_page); + ON_TRACE(TRACE_CAPTURE_COPY, "copy on capture done\n"); + } else { + spin_unlock(&scan_lock); + UNLOCK_JNODE(node); + UNLOCK_JNODE(copy); + UNLOCK_ATOM(atom); + radix_tree_preload_end(); + kunmap(page); + unlock_page(page); + kunmap(new_page); + page_cache_release(new_page); + } + return result; +} + +static int +real_copy_on_capture(jnode *node, txn_atom *atom) +{ + int result; + jnode *copy; + struct page *page; + struct page *new_page; + + assert("vs-1432", spin_jnode_is_locked(node)); + assert("vs-1490", !JF_ISSET(node, JNODE_EFLUSH)); + assert("vs-1491", node->pg); + assert("vs-1492", jprivate(node->pg) == node); + + page = node->pg; + page_cache_get(page); + jref(node); + UNLOCK_JNODE(node); + UNLOCK_ATOM(atom); + + /* prevent node from eflushing */ + result = jload(node); + if (!result) { + copy = jclone(node); + if (likely(!IS_ERR(copy))) { + new_page = alloc_page(GFP_KERNEL); + if (new_page) { + result = handle_coc(node, + copy, page, new_page, atom); + if (result == 0) + return 0; + } else + result = RETERR(-ENOMEM); + jput(copy); + } + jrelse(node); + } + + jput(node); + page_cache_release(page); + return result; +} + +/* create new jnode, create new page, jload old jnode, copy data, detach old + page from old jnode, attach new page to old jnode, attach old page to new + jnode this returns 0 if copy on capture succeeded, E_REPEAT to have + capture_fuse_wait to be called */ +static int +create_copy_and_replace(jnode *node, txn_atom *atom) +{ + int result; + struct inode *inode; /* inode for which filemap_nopage is blocked */ + + assert("jmacd-321", spin_jnode_is_locked(node)); + assert("umka-295", spin_atom_is_locked(atom)); + assert("vs-1381", node->atom == atom); + assert("vs-1409", atom->stage > ASTAGE_CAPTURE_WAIT && atom->stage < ASTAGE_DONE); + assert("vs-1410", jnode_is_znode(node) || jnode_is_unformatted(node)); + + + if (JF_ISSET(node, JNODE_CCED)) { + /* node is under copy on capture already */ + reiser4_stat_inc(coc.coc_race); + UNLOCK_JNODE(node); + UNLOCK_ATOM(atom); + return RETERR(-E_WAIT); + } + + /* measure how often suspicious (WRITEBACK, DIRTY, FLUSH_QUEUED) appear + here. For most often case we can return EAGAIN right here and avoid + all the preparations made for copy on capture */ + ON_TRACE(TRACE_CAPTURE_COPY, "copy_on_capture: node %p, atom %p..", node, atom); + if (JF_ISSET(node, JNODE_EFLUSH)) { + UNLOCK_JNODE(node); + UNLOCK_ATOM(atom); + + reiser4_stat_inc(coc.eflush); + ON_TRACE(TRACE_CAPTURE_COPY, "eflushed\n"); + result = jload(node); + if (result) + return RETERR(result); + jrelse(node); + return RETERR(-E_REPEAT); + } + + set_cced_bit(node); + + if (jnode_is_unformatted(node)) { + /* to capture_copy unformatted node we have to take care of its + page mappings. Page gets unmapped here and concurrent + mappings are blocked on reiser4 inodes's coc_sem in reiser4's + filemap_nopage */ + struct page *page; + + inode = mapping_jnode(node)->host; + page = jnode_page(node); + assert("vs-1640", inode != NULL); + assert("vs-1641", page != NULL); + assert("vs-1642", page->mapping != NULL); + UNLOCK_JNODE(node); + UNLOCK_ATOM(atom); + + down_write(&reiser4_inode_data(inode)->coc_sem); + lock_page(page); + pte_chain_lock(page); + + if (page_mapped(page)) { + result = try_to_unmap(page); + if (result == SWAP_AGAIN) { + result = RETERR(-E_REPEAT); + + } else if (result == SWAP_FAIL) + result = RETERR(-E_WAIT); + else { + assert("vs-1643", result == SWAP_SUCCESS); + result = 0; + } + if (result != 0) { + unlock_page(page); + pte_chain_unlock(page); + up_write(&reiser4_inode_data(inode)->coc_sem); + return result; + } + } + pte_chain_unlock(page); + unlock_page(page); + LOCK_ATOM(atom); + LOCK_JNODE(node); + } else + inode = NULL; + + if (!JF_ISSET(node, JNODE_OVRWR) && !JF_ISSET(node, JNODE_RELOC)) { + /* clean node can be made available for capturing. Just take + care to preserve atom list during uncapturing */ + ON_TRACE(TRACE_CAPTURE_COPY, "clean\n"); + result = copy_on_capture_clean(node, atom); + } else if (!node->pg) { + ON_TRACE(TRACE_CAPTURE_COPY, "uber\n"); + result = copy_on_capture_nopage(node, atom); + } else + result = real_copy_on_capture(node, atom); + if (result != 0) + clear_cced_bits(node); + assert("vs-1626", spin_atom_is_not_locked(atom)); + + if (inode != NULL) + up_write(&reiser4_inode_data(inode)->coc_sem); + + return result; +} +#endif /* REISER4_COPY_ON_CAPTURE */ + +/* Perform copy-on-capture of a block. */ +static int +capture_copy(jnode * node, txn_handle * txnh, txn_atom * atomf, txn_atom * atomh, txn_capture mode, int can_coc) +{ +#if REISER4_COPY_ON_CAPTURE + reiser4_stat_inc(coc.calls); + + /* do not copy on capture in ent thread to avoid deadlock on coc semaphore */ + if (can_coc && get_current_context()->entd == 0) { + int result; + + ON_TRACE(TRACE_TXN, "capture_copy\n"); + + /* The txnh and its (possibly NULL) atom's locks are not needed + at this point. */ + UNLOCK_TXNH(txnh); + if (atomh != NULL) + UNLOCK_ATOM(atomh); + + /* create a copy of node, detach node from atom and attach its copy + instead */ + atomic_inc(&atomf->refcount); + result = create_copy_and_replace(node, atomf); + assert("nikita-3474", schedulable()); + preempt_point(); + LOCK_ATOM(atomf); + atom_dec_and_unlock(atomf); + preempt_point(); + + if (result == 0) { + if (jnode_is_znode(node)) { + znode *z; + + z = JZNODE(node); + z->version = znode_build_version(jnode_get_tree(node)); + } + result = RETERR(-E_REPEAT); + } + return result; + } + + reiser4_stat_inc(coc.forbidden); + return capture_fuse_wait(node, txnh, atomf, atomh, mode); +#else + return capture_fuse_wait(node, txnh, atomf, atomh, mode); + +#endif +} + +/* Release a block from the atom, reversing the effects of being captured, + do not release atom's reference to jnode due to holding spin-locks. + Currently this is only called when the atom commits. + + NOTE: this function does not release a (journal) reference to jnode + due to locking optimizations, you should call jput() somewhere after + calling uncapture_block(). */ +reiser4_internal void uncapture_block(jnode * node) +{ + txn_atom * atom; + + assert("umka-226", node != NULL); + atom = node->atom; + assert("umka-228", atom != NULL); + + assert("jmacd-1021", node->atom == atom); + assert("jmacd-1022", spin_jnode_is_locked(node)); +#if REISER4_COPY_ON_CAPTURE + assert("jmacd-1023", spin_atom_is_locked(atom)); +#else + assert("jmacd-1023", atom_is_protected(atom)); +#endif + + JF_CLR(node, JNODE_DIRTY); + JF_CLR(node, JNODE_RELOC); + JF_CLR(node, JNODE_OVRWR); + JF_CLR(node, JNODE_CREATED); + JF_CLR(node, JNODE_WRITEBACK); + JF_CLR(node, JNODE_REPACK); + clear_cced_bits(node); +#if REISER4_DEBUG + node->written = 0; +#endif + + capture_list_remove_clean(node); + if (JF_ISSET(node, JNODE_FLUSH_QUEUED)) { + assert("zam-925", atom_isopen(atom)); + assert("vs-1623", NODE_LIST(node) == FQ_LIST); + ON_DEBUG(atom->num_queued --); + JF_CLR(node, JNODE_FLUSH_QUEUED); + } + atom->capture_count -= 1; + ON_DEBUG(count_jnode(atom, node, NODE_LIST(node), NOT_CAPTURED, 1)); + node->atom = NULL; + + UNLOCK_JNODE(node); + LOCK_CNT_DEC(t_refs); +} + +/* Unconditional insert of jnode into atom's overwrite list. Currently used in + bitmap-based allocator code for adding modified bitmap blocks the + transaction. @atom and @node are spin locked */ +reiser4_internal void +insert_into_atom_ovrwr_list(txn_atom * atom, jnode * node) +{ + assert("zam-538", spin_atom_is_locked(atom) || atom->stage >= ASTAGE_PRE_COMMIT); + assert("zam-539", spin_jnode_is_locked(node)); + assert("zam-899", JF_ISSET(node, JNODE_OVRWR)); + assert("zam-543", node->atom == NULL); + assert("vs-1433", !jnode_is_unformatted(node) && !jnode_is_znode(node)); + + capture_list_push_front(ATOM_OVRWR_LIST(atom), node); + jref(node); + node->atom = atom; + atom->capture_count++; + ON_DEBUG(count_jnode(atom, node, NODE_LIST(node), OVRWR_LIST, 1)); +} + +/* when atom becomes that big, commit it as soon as possible. This was found + * to be most effective by testing. */ +reiser4_internal unsigned int +txnmgr_get_max_atom_size(struct super_block *super UNUSED_ARG) +{ + return totalram_pages / 4; +} + +#if REISER4_DEBUG + +reiser4_internal void +info_atom(const char *prefix, const txn_atom * atom) +{ + if (atom == NULL) { + printk("%s: no atom\n", prefix); + return; + } + + printk("%s: refcount: %i id: %i flags: %x txnh_count: %i" + " capture_count: %i stage: %x start: %lu, flushed: %i\n", prefix, + atomic_read(&atom->refcount), atom->atom_id, atom->flags, atom->txnh_count, + atom->capture_count, atom->stage, atom->start_time, atom->flushed); +} + +#endif + +static int count_deleted_blocks_actor ( + txn_atom *atom, const reiser4_block_nr * a, const reiser4_block_nr *b, void * data) +{ + reiser4_block_nr *counter = data; + + assert ("zam-995", data != NULL); + assert ("zam-996", a != NULL); + if (b == NULL) + *counter += 1; + else + *counter += *b; + return 0; +} +reiser4_internal reiser4_block_nr txnmgr_count_deleted_blocks (void) +{ + reiser4_block_nr result; + txn_mgr *tmgr = &get_super_private(reiser4_get_current_sb())->tmgr; + txn_atom * atom; + + result = 0; + + spin_lock_txnmgr(tmgr); + for_all_type_safe_list(atom, &tmgr->atoms_list, atom) { + LOCK_ATOM(atom); + blocknr_set_iterator(atom, &atom->delete_set, + count_deleted_blocks_actor, &result, 0); + UNLOCK_ATOM(atom); + } + spin_unlock_txnmgr(tmgr); + + return result; +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 80 + End: +*/ diff -puN /dev/null fs/reiser4/txnmgr.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/txnmgr.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,646 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* data-types and function declarations for transaction manager. See txnmgr.c + * for details. */ + +#ifndef __REISER4_TXNMGR_H__ +#define __REISER4_TXNMGR_H__ + +#include "forward.h" +#include "spin_macros.h" +#include "dformat.h" +#include "type_safe_list.h" + +#include +#include +#include +#include +#include +#include + +/* LIST TYPES */ + +/* list of all atoms controlled by single transaction manager (that is, file + * system) */ +TYPE_SAFE_LIST_DECLARE(atom); +/* list of transaction handles attached to given atom */ +TYPE_SAFE_LIST_DECLARE(txnh); + +/* + * ->fwaitfor and ->fwaiting lists. + * + * Each atom has one of these lists: one for its own handles waiting on + * another atom and one for reverse mapping. Used to prevent deadlock in the + * ASTAGE_CAPTURE_WAIT state. + * + * Thread that needs to wait for a given atom, attaches itself to the atom's + * ->fwaitfor list. This is done in atom_wait_event() (and, in + * capture_fuse_wait()). All threads waiting on this list are waked up + * whenever "event" occurs for this atom: it changes stage, commits, flush + * queue is released, etc. This is used, in particular, to implement sync(), + * where thread has to wait until atom commits. + */ +TYPE_SAFE_LIST_DECLARE(fwaitfor); + +/* + * This list is used to wait for atom fusion (in capture_fuse_wait()). Threads + * waiting on this list are waked up if atom commits or is fused into another. + * + * This is used in capture_fuse_wait() which see for more comments. + */ +TYPE_SAFE_LIST_DECLARE(fwaiting); + +/* The transaction's list of captured jnodes */ +TYPE_SAFE_LIST_DECLARE(capture); +#if REISER4_DEBUG +TYPE_SAFE_LIST_DECLARE(inode_jnodes); +#endif + +TYPE_SAFE_LIST_DECLARE(blocknr_set); /* Used for the transaction's delete set + * and wandered mapping. */ + +/* list of flush queues attached to a given atom */ +TYPE_SAFE_LIST_DECLARE(fq); + +/* list of lists of jnodes that threads take into exclusive ownership during + * allocate-on-flush.*/ +TYPE_SAFE_LIST_DECLARE(prot); + +/* TYPE DECLARATIONS */ + +/* This enumeration describes the possible types of a capture request (try_capture). + A capture request dynamically assigns a block to the calling thread's transaction + handle. */ +typedef enum { + /* A READ_ATOMIC request indicates that a block will be read and that the caller's + atom should fuse in order to ensure that the block commits atomically with the + caller. */ + TXN_CAPTURE_READ_ATOMIC = (1 << 0), + + /* A READ_NONCOM request indicates that a block will be read and that the caller is + willing to read a non-committed block without causing atoms to fuse. */ + TXN_CAPTURE_READ_NONCOM = (1 << 1), + + /* A READ_MODIFY request indicates that a block will be read but that the caller + wishes for the block to be captured as it will be written. This capture request + mode is not currently used, but eventually it will be useful for preventing + deadlock in read-modify-write cycles. */ + TXN_CAPTURE_READ_MODIFY = (1 << 2), + + /* A WRITE capture request indicates that a block will be modified and that atoms + should fuse to make the commit atomic. */ + TXN_CAPTURE_WRITE = (1 << 3), + + /* CAPTURE_TYPES is a mask of the four above capture types, used to separate the + exclusive type designation from extra bits that may be supplied -- see + below. */ + TXN_CAPTURE_TYPES = (TXN_CAPTURE_READ_ATOMIC | + TXN_CAPTURE_READ_NONCOM | TXN_CAPTURE_READ_MODIFY | TXN_CAPTURE_WRITE), + + /* A subset of CAPTURE_TYPES, CAPTURE_WTYPES is a mask of request types that + indicate modification will occur. */ + TXN_CAPTURE_WTYPES = (TXN_CAPTURE_READ_MODIFY | TXN_CAPTURE_WRITE), + + /* An option to try_capture, NONBLOCKING indicates that the caller would + prefer not to sleep waiting for an aging atom to commit. */ + TXN_CAPTURE_NONBLOCKING = (1 << 4), + + /* An option to try_capture to prevent atom fusion, just simple capturing is allowed */ + TXN_CAPTURE_DONT_FUSE = (1 << 5), + + /* if it is set - copy on capture is allowed */ + /*TXN_CAPTURE_CAN_COC = (1 << 6)*/ + + /* This macro selects only the exclusive capture request types, stripping out any + options that were supplied (i.e., NONBLOCKING). */ +#define CAPTURE_TYPE(x) ((x) & TXN_CAPTURE_TYPES) +} txn_capture; + +/* There are two kinds of transaction handle: WRITE_FUSING and READ_FUSING, the only + difference is in the handling of read requests. A WRITE_FUSING transaction handle + defaults read capture requests to TXN_CAPTURE_READ_NONCOM whereas a READ_FUSIONG + transaction handle defaults to TXN_CAPTURE_READ_ATOMIC. */ +typedef enum { + TXN_WRITE_FUSING = (1 << 0), + TXN_READ_FUSING = (1 << 1) | TXN_WRITE_FUSING, /* READ implies WRITE */ +} txn_mode; + +/* Every atom has a stage, which is one of these exclusive values: */ +typedef enum { + /* Initially an atom is free. */ + ASTAGE_FREE = 0, + + /* An atom begins by entering the CAPTURE_FUSE stage, where it proceeds to capture + blocks and fuse with other atoms. */ + ASTAGE_CAPTURE_FUSE = 1, + + /* We need to have a ASTAGE_CAPTURE_SLOW in which an atom fuses with one node for every X nodes it flushes to disk where X > 1. */ + + /* When an atom reaches a certain age it must do all it can to commit. An atom in + the CAPTURE_WAIT stage refuses new transaction handles and prevents fusion from + atoms in the CAPTURE_FUSE stage. */ + ASTAGE_CAPTURE_WAIT = 2, + + /* Waiting for I/O before commit. Copy-on-capture (see + http://namesys.com/v4/v4.html). */ + ASTAGE_PRE_COMMIT = 3, + + /* Post-commit overwrite I/O. Steal-on-capture. */ + ASTAGE_POST_COMMIT = 4, + + /* Atom which waits for the removal of the last reference to (it? ) to + * be deleted from memory */ + ASTAGE_DONE = 5, + + /* invalid atom. */ + ASTAGE_INVALID = 6, + +} txn_stage; + +/* Certain flags may be set in the txn_atom->flags field. */ +typedef enum { + /* Indicates that the atom should commit as soon as possible. */ + ATOM_FORCE_COMMIT = (1 << 0) +} txn_flags; + +/* Flags for controlling commit_txnh */ +typedef enum { + /* Wait commit atom completion in commit_txnh */ + TXNH_WAIT_COMMIT = 0x2, + /* Don't commit atom when this handle is closed */ + TXNH_DONT_COMMIT = 0x4 +} txn_handle_flags_t; + +/* TYPE DEFINITIONS */ + +/* A note on lock ordering: the handle & jnode spinlock protects reading of their ->atom + fields, so typically an operation on the atom through either of these objects must (1) + lock the object, (2) read the atom pointer, (3) lock the atom. + + During atom fusion, the process holds locks on both atoms at once. Then, it iterates + through the list of handles and pages held by the smaller of the two atoms. For each + handle and page referencing the smaller atom, the fusing process must: (1) lock the + object, and (2) update the atom pointer. + + You can see that there is a conflict of lock ordering here, so the more-complex + procedure should have priority, i.e., the fusing process has priority so that it is + guaranteed to make progress and to avoid restarts. + + This decision, however, means additional complexity for aquiring the atom lock in the + first place. + + The general original procedure followed in the code was: + + TXN_OBJECT *obj = ...; + TXN_ATOM *atom; + + spin_lock (& obj->_lock); + + atom = obj->_atom; + + if (! spin_trylock_atom (atom)) + { + spin_unlock (& obj->_lock); + RESTART OPERATION, THERE WAS A RACE; + } + + ELSE YOU HAVE BOTH ATOM AND OBJ LOCKED + + + It has however been found that this wastes CPU a lot in a manner that is + hard to profile. So, proper refcounting was added to atoms, and new + standard locking sequence is like following: + + TXN_OBJECT *obj = ...; + TXN_ATOM *atom; + + spin_lock (& obj->_lock); + + atom = obj->_atom; + + if (! spin_trylock_atom (atom)) + { + atomic_inc (& atom->refcount); + spin_unlock (& obj->_lock); + spin_lock (&atom->_lock); + atomic_dec (& atom->refcount); + // HERE atom is locked + spin_unlock (&atom->_lock); + RESTART OPERATION, THERE WAS A RACE; + } + + ELSE YOU HAVE BOTH ATOM AND OBJ LOCKED + + (core of this is implemented in trylock_throttle() function) + + See the jnode_get_atom() function for a common case. + + As an additional (and important) optimization allowing to avoid restarts, + it is possible to re-check required pre-conditions at the HERE point in + code above and proceed without restarting if they are still satisfied. +*/ + +/* A block number set consists of only the list head. */ +struct blocknr_set { + blocknr_set_list_head entries; /* blocknr_set_list_head defined from a template from tslist.h */ +}; + +/* An atomic transaction: this is the underlying system representation + of a transaction, not the one seen by clients. + + Invariants involving this data-type: + + [sb-fake-allocated] +*/ +struct txn_atom { + /* The spinlock protecting the atom, held during fusion and various other state + changes. */ + reiser4_spin_data alock; + + /* The atom's reference counter, increasing (in case of a duplication + of an existing reference or when we are sure that some other + reference exists) may be done without taking spinlock, decrementing + of the ref. counter requires a spinlock to be held. + + Each transaction handle counts in ->refcount. All jnodes count as + one reference acquired in atom_begin_andlock(), released in + commit_current_atom(). + */ + atomic_t refcount; + + /* The atom_id identifies the atom in persistent records such as the log. */ + __u32 atom_id; + + /* Flags holding any of the txn_flags enumerated values (e.g., + ATOM_FORCE_COMMIT). */ + __u32 flags; + + /* Number of open handles. */ + __u32 txnh_count; + + /* The number of znodes captured by this atom. Equal to the sum of lengths of the + dirty_nodes[level] and clean_nodes lists. */ + __u32 capture_count; + +#if REISER4_DEBUG + int clean; + int dirty; + int ovrwr; + int wb; + int fq; + int protect; +#endif + + __u32 flushed; + + /* Current transaction stage. */ + txn_stage stage; + + /* Start time. */ + unsigned long start_time; + + /* The atom's delete set. It collects block numbers of the nodes + which were deleted during the transaction. */ + blocknr_set delete_set; + + /* The atom's wandered_block mapping. */ + blocknr_set wandered_map; + + /* The transaction's list of dirty captured nodes--per level. Index + by (level). dirty_nodes[0] is for znode-above-root */ + capture_list_head dirty_nodes1[REAL_MAX_ZTREE_HEIGHT + 1]; + + /* The transaction's list of clean captured nodes. */ + capture_list_head clean_nodes1; + + /* The atom's overwrite set */ + capture_list_head ovrwr_nodes1; + + /* nodes which are being written to disk */ + capture_list_head writeback_nodes1; + + /* list of inodes */ + capture_list_head inodes; + + /* List of handles associated with this atom. */ + txnh_list_head txnh_list; + + /* Transaction list link: list of atoms in the transaction manager. */ + atom_list_link atom_link; + + /* List of handles waiting FOR this atom: see 'capture_fuse_wait' comment. */ + fwaitfor_list_head fwaitfor_list; + + /* List of this atom's handles that are waiting: see 'capture_fuse_wait' comment. */ + fwaiting_list_head fwaiting_list; + + prot_list_head protected; + + /* Numbers of objects which were deleted/created in this transaction + thereby numbers of objects IDs which were released/deallocated. */ + int nr_objects_deleted; + int nr_objects_created; + /* number of blocks allocated during the transaction */ + __u64 nr_blocks_allocated; + /* All atom's flush queue objects are on this list */ + fq_list_head flush_queues; +#if REISER4_DEBUG + /* number of flush queues for this atom. */ + int nr_flush_queues; + /* Number of jnodes which were removed from atom's lists and put + on flush_queue */ + int num_queued; +#endif + /* number of threads who wait for this atom to complete commit */ + int nr_waiters; + /* number of threads which do jnode_flush() over this atom */ + int nr_flushers; + /* number of flush queues which are IN_USE and jnodes from fq->prepped + are submitted to disk by the write_fq() routine. */ + int nr_running_queues; + /* A counter of grabbed unformatted nodes, see a description of the + * reiser4 space reservation scheme at block_alloc.c */ + reiser4_block_nr flush_reserved; +#if REISER4_DEBUG + void *committer; +#endif +}; + +#define ATOM_DIRTY_LIST(atom, level) (&(atom)->dirty_nodes1[level]) +#define ATOM_CLEAN_LIST(atom) (&(atom)->clean_nodes1) +#define ATOM_OVRWR_LIST(atom) (&(atom)->ovrwr_nodes1) +#define ATOM_WB_LIST(atom) (&(atom)->writeback_nodes1) +#define ATOM_FQ_LIST(fq) (&(fq)->prepped1) + +#define NODE_LIST(node) (node)->list1 +#define ASSIGN_NODE_LIST(node, list) ON_DEBUG(NODE_LIST(node) = list) +ON_DEBUG(void count_jnode(txn_atom *, jnode *, atom_list old_list, atom_list new_list, int check_lists)); + +typedef struct protected_jnodes { + prot_list_link inatom; + capture_list_head nodes; +} protected_jnodes; + +TYPE_SAFE_LIST_DEFINE(prot, protected_jnodes, inatom); + +TYPE_SAFE_LIST_DEFINE(atom, txn_atom, atom_link); + +/* A transaction handle: the client obtains and commits this handle which is assigned by + the system to a txn_atom. */ +struct txn_handle { + /* Spinlock protecting ->atom pointer */ + reiser4_spin_data hlock; + + /* Flags for controlling commit_txnh() behavior */ + /* from txn_handle_flags_t */ + txn_handle_flags_t flags; + + /* Whether it is READ_FUSING or WRITE_FUSING. */ + txn_mode mode; + + /* If assigned, the atom it is part of. */ + txn_atom *atom; + + /* Transaction list link. */ + txnh_list_link txnh_link; +}; + +TYPE_SAFE_LIST_DECLARE(txn_mgrs); + +/* The transaction manager: one is contained in the reiser4_super_info_data */ +struct txn_mgr { + /* A spinlock protecting the atom list, id_count, flush_control */ + reiser4_spin_data tmgr_lock; + + /* List of atoms. */ + atom_list_head atoms_list; + + /* Number of atoms. */ + int atom_count; + + /* A counter used to assign atom->atom_id values. */ + __u32 id_count; + + /* a semaphore object for commit serialization */ + struct semaphore commit_semaphore; + + /* a list of all txnmrgs served by particular daemon. */ + txn_mgrs_list_link linkage; + + /* description of daemon for this txnmgr */ + ktxnmgrd_context *daemon; + + /* parameters. Adjustable through mount options. */ + unsigned int atom_max_size; + unsigned int atom_max_age; + /* max number of concurrent flushers for one atom, 0 - unlimited. */ + unsigned int atom_max_flushers; +}; + +/* list of all transaction managers in a system */ +TYPE_SAFE_LIST_DEFINE(txn_mgrs, txn_mgr, linkage); + +/* FUNCTION DECLARATIONS */ + +/* These are the externally (within Reiser4) visible transaction functions, therefore they + are prefixed with "txn_". For comments, see txnmgr.c. */ + +extern int txnmgr_init_static(void); +extern void txnmgr_init(txn_mgr * mgr); + +extern int txnmgr_done_static(void); +extern int txnmgr_done(txn_mgr * mgr); + +extern int txn_reserve(int reserved); + +extern void txn_begin(reiser4_context * context); +extern long txn_end(reiser4_context * context); + +extern void txn_restart(reiser4_context * context); +extern void txn_restart_current(void); + +extern int txnmgr_force_commit_all(struct super_block *, int); +extern int current_atom_should_commit(void); + +extern jnode * find_first_dirty_jnode (txn_atom *, int); + +extern int commit_some_atoms(txn_mgr *); +extern int flush_current_atom (int, long *, txn_atom **); + +extern int flush_some_atom(long *, const struct writeback_control *, int); + +extern void atom_set_stage(txn_atom *atom, txn_stage stage); + +extern int same_slum_check(jnode * base, jnode * check, int alloc_check, int alloc_value); +extern void atom_dec_and_unlock(txn_atom * atom); + +extern int try_capture(jnode * node, znode_lock_mode mode, txn_capture flags, int can_coc); +extern int try_capture_page_to_invalidate(struct page *pg); + +extern void uncapture_page(struct page *pg); +extern void uncapture_block(jnode *); +extern void uncapture_jnode(jnode *); + +extern int capture_inode(struct inode *); +extern int uncapture_inode(struct inode *); + +extern txn_atom *get_current_atom_locked_nocheck(void); + +#define atom_is_protected(atom) (spin_atom_is_locked(atom) || (atom)->stage >= ASTAGE_PRE_COMMIT) + +/* Get the current atom and spinlock it if current atom present. May not return NULL */ +static inline txn_atom * +get_current_atom_locked(void) +{ + txn_atom *atom; + + atom = get_current_atom_locked_nocheck(); + assert("zam-761", atom != NULL); + + return atom; +} + +extern txn_atom *jnode_get_atom(jnode *); + +extern void atom_wait_event(txn_atom *); +extern void atom_send_event(txn_atom *); + +extern void insert_into_atom_ovrwr_list(txn_atom * atom, jnode * node); +extern int capture_super_block(struct super_block *s); + +/* See the comment on the function blocknrset.c:blocknr_set_add for the + calling convention of these three routines. */ +extern void blocknr_set_init(blocknr_set * bset); +extern void blocknr_set_destroy(blocknr_set * bset); +extern void blocknr_set_merge(blocknr_set * from, blocknr_set * into); +extern int blocknr_set_add_extent(txn_atom * atom, + blocknr_set * bset, + blocknr_set_entry ** new_bsep, + const reiser4_block_nr * start, const reiser4_block_nr * len); +extern int blocknr_set_add_pair(txn_atom * atom, + blocknr_set * bset, + blocknr_set_entry ** new_bsep, const reiser4_block_nr * a, const reiser4_block_nr * b); + +typedef int (*blocknr_set_actor_f) (txn_atom *, const reiser4_block_nr *, const reiser4_block_nr *, void *); + +extern int blocknr_set_iterator(txn_atom * atom, blocknr_set * bset, blocknr_set_actor_f actor, void *data, int delete); + +/* flush code takes care about how to fuse flush queues */ +extern void flush_init_atom(txn_atom * atom); +extern void flush_fuse_queues(txn_atom * large, txn_atom * small); + +/* INLINE FUNCTIONS */ + +#define spin_ordering_pred_atom(atom) \ + ( ( lock_counters() -> spin_locked_txnh == 0 ) && \ + ( lock_counters() -> spin_locked_jnode == 0 ) && \ + ( lock_counters() -> rw_locked_zlock == 0 ) && \ + ( lock_counters() -> rw_locked_dk == 0 ) && \ + ( lock_counters() -> rw_locked_tree == 0 ) ) + +#define spin_ordering_pred_txnh(txnh) \ + ( ( lock_counters() -> rw_locked_dk == 0 ) && \ + ( lock_counters() -> rw_locked_zlock == 0 ) && \ + ( lock_counters() -> rw_locked_tree == 0 ) ) + +#define spin_ordering_pred_txnmgr(tmgr) \ + ( ( lock_counters() -> spin_locked_atom == 0 ) && \ + ( lock_counters() -> spin_locked_txnh == 0 ) && \ + ( lock_counters() -> spin_locked_jnode == 0 ) && \ + ( lock_counters() -> rw_locked_zlock == 0 ) && \ + ( lock_counters() -> rw_locked_dk == 0 ) && \ + ( lock_counters() -> rw_locked_tree == 0 ) ) + +SPIN_LOCK_FUNCTIONS(atom, txn_atom, alock); +SPIN_LOCK_FUNCTIONS(txnh, txn_handle, hlock); +SPIN_LOCK_FUNCTIONS(txnmgr, txn_mgr, tmgr_lock); + +typedef enum { + FQ_IN_USE = 0x1 +} flush_queue_state_t; + +typedef struct flush_queue flush_queue_t; + +/* This is an accumulator for jnodes prepared for writing to disk. A flush queue + is filled by the jnode_flush() routine, and written to disk under memory + pressure or at atom commit time. */ +/* LOCKING: fq state and fq->atom are protected by guard spinlock, fq->nr_queued + field and fq->prepped list can be modified if atom is spin-locked and fq + object is "in-use" state. For read-only traversal of the fq->prepped list + and reading of the fq->nr_queued field it is enough to keep fq "in-use" or + only have atom spin-locked. */ +struct flush_queue { + /* linkage element is the first in this structure to make debugging + easier. See field in atom struct for description of list. */ + fq_list_link alink; + /* A spinlock to protect changes of fq state and fq->atom pointer */ + reiser4_spin_data guard; + /* flush_queue state: [in_use | ready] */ + flush_queue_state_t state; + /* A list which contains queued nodes, queued nodes are removed from any + * atom's list and put on this ->prepped one. */ + capture_list_head prepped1; + /* number of submitted i/o requests */ + atomic_t nr_submitted; + /* number of i/o errors */ + atomic_t nr_errors; + /* An atom this flush queue is attached to */ + txn_atom *atom; + /* A semaphore for waiting on i/o completion */ + struct semaphore io_sem; +#if REISER4_DEBUG + /* A thread which took this fq in exclusive use, NULL if fq is free, + * used for debugging. */ + struct task_struct *owner; +#endif +}; + +extern int fq_by_atom(txn_atom *, flush_queue_t **); +extern int fq_by_jnode_gfp(jnode *, flush_queue_t **, int); +extern void fq_put_nolock(flush_queue_t *); +extern void fq_put(flush_queue_t *); +extern void fuse_fq(txn_atom * to, txn_atom * from); +extern void queue_jnode(flush_queue_t *, jnode *); +extern void mark_jnode_queued(flush_queue_t *, jnode *); + +extern int write_fq(flush_queue_t *, long *, int); +extern int current_atom_finish_all_fq(void); +extern void init_atom_fq_parts(txn_atom *); + +extern unsigned int txnmgr_get_max_atom_size(struct super_block *super); +extern reiser4_block_nr txnmgr_count_deleted_blocks (void); + +extern void znode_make_dirty(znode * node); +extern void jnode_make_dirty_locked(jnode * node); + +extern int sync_atom(txn_atom *atom); + +#if REISER4_DEBUG +extern int atom_fq_parts_are_clean (txn_atom *); +#endif + +extern void add_fq_to_bio(flush_queue_t *, struct bio *); +extern flush_queue_t *get_fq_for_current_atom(void); + +void protected_jnodes_init(protected_jnodes *list); +void protected_jnodes_done(protected_jnodes *list); +void invalidate_list(capture_list_head * head); + +#if REISER4_DEBUG +void info_atom(const char *prefix, const txn_atom * atom); +#else +#define info_atom(p,a) noop +#endif + +# endif /* __REISER4_TXNMGR_H__ */ + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/type_safe_hash.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/type_safe_hash.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,320 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* A hash table class that uses hash chains (singly-linked) and is + parametrized to provide type safety. */ + +#ifndef __REISER4_TYPE_SAFE_HASH_H__ +#define __REISER4_TYPE_SAFE_HASH_H__ + +#include "debug.h" + +#include +/* Step 1: Use TYPE_SAFE_HASH_DECLARE() to define the TABLE and LINK objects + based on the object type. You need to declare the item type before + this definition, define it after this definition. */ +#define TYPE_SAFE_HASH_DECLARE(PREFIX,ITEM_TYPE) \ + \ +typedef struct PREFIX##_hash_table_ PREFIX##_hash_table; \ +typedef struct PREFIX##_hash_link_ PREFIX##_hash_link; \ + \ +struct PREFIX##_hash_table_ \ +{ \ + ITEM_TYPE **_table; \ + __u32 _buckets; \ +}; \ + \ +struct PREFIX##_hash_link_ \ +{ \ + ITEM_TYPE *_next; \ +} + +/* Step 2: Define the object type of the hash: give it field of type + PREFIX_hash_link. */ + +/* Step 3: Use TYPE_SAFE_HASH_DEFINE to define the hash table interface using + the type and field name used in step 3. The arguments are: + + ITEM_TYPE The item type being hashed + KEY_TYPE The type of key being hashed + KEY_NAME The name of the key field within the item + LINK_NAME The name of the link field within the item, which you must make type PREFIX_hash_link) + HASH_FUNC The name of the hash function (or macro, takes const pointer to key) + EQ_FUNC The name of the equality function (or macro, takes const pointer to two keys) + + It implements these functions: + + prefix_hash_init Initialize the table given its size. + prefix_hash_insert Insert an item + prefix_hash_insert_index Insert an item w/ precomputed hash_index + prefix_hash_find Find an item by key + prefix_hash_find_index Find an item w/ precomputed hash_index + prefix_hash_remove Remove an item, returns 1 if found, 0 if not found + prefix_hash_remove_index Remove an item w/ precomputed hash_index + + If you'd like something to be done differently, feel free to ask me + for modifications. Additional features that could be added but + have not been: + + prefix_hash_remove_key Find and remove an item by key + prefix_hash_remove_key_index Find and remove an item by key w/ precomputed hash_index + + The hash_function currently receives only the key as an argument, + meaning it must somehow know the number of buckets. If this is a + problem let me know. + + This hash table uses a single-linked hash chain. This means + insertion is fast but deletion requires searching the chain. + + There is also the doubly-linked hash chain approach, under which + deletion requires no search but the code is longer and it takes two + pointers per item. + + The circularly-linked approach has the shortest code but requires + two pointers per bucket, doubling the size of the bucket array (in + addition to two pointers per item). +*/ +#define TYPE_SAFE_HASH_DEFINE(PREFIX,ITEM_TYPE,KEY_TYPE,KEY_NAME,LINK_NAME,HASH_FUNC,EQ_FUNC) \ + \ +static __inline__ void \ +PREFIX##_check_hash (PREFIX##_hash_table *table UNUSED_ARG, \ + __u32 hash UNUSED_ARG) \ +{ \ + assert("nikita-2780", hash < table->_buckets); \ +} \ + \ +static __inline__ int \ +PREFIX##_hash_init (PREFIX##_hash_table *hash, \ + __u32 buckets) \ +{ \ + hash->_table = (ITEM_TYPE**) KMALLOC (sizeof (ITEM_TYPE*) * buckets); \ + hash->_buckets = buckets; \ + if (hash->_table == NULL) \ + { \ + return RETERR(-ENOMEM); \ + } \ + memset (hash->_table, 0, sizeof (ITEM_TYPE*) * buckets); \ + ON_DEBUG(printk(#PREFIX "_hash_table: %i buckets\n", buckets)); \ + return 0; \ +} \ + \ +static __inline__ void \ +PREFIX##_hash_done (PREFIX##_hash_table *hash) \ +{ \ + if (REISER4_DEBUG && hash->_table != NULL) { \ + __u32 i; \ + for (i = 0 ; i < hash->_buckets ; ++ i) \ + assert("nikita-2905", hash->_table[i] == NULL); \ + } \ + if (hash->_table != NULL) \ + KFREE (hash->_table, sizeof (ITEM_TYPE*) * hash->_buckets); \ + hash->_table = NULL; \ +} \ + \ +static __inline__ void \ +PREFIX##_hash_prefetch_next (ITEM_TYPE *item) \ +{ \ + prefetch(item->LINK_NAME._next); \ +} \ + \ +static __inline__ void \ +PREFIX##_hash_prefetch_bucket (PREFIX##_hash_table *hash, \ + __u32 index) \ +{ \ + prefetch(hash->_table[index]); \ +} \ + \ +static __inline__ ITEM_TYPE* \ +PREFIX##_hash_find_index (PREFIX##_hash_table *hash, \ + __u32 hash_index, \ + KEY_TYPE const *find_key) \ +{ \ + ITEM_TYPE *item; \ + \ + PREFIX##_check_hash(hash, hash_index); \ + \ + for (item = hash->_table[hash_index]; \ + item != NULL; \ + item = item->LINK_NAME._next) \ + { \ + prefetch(item->LINK_NAME._next); \ + prefetch(item->LINK_NAME._next + offsetof(ITEM_TYPE, KEY_NAME)); \ + if (EQ_FUNC (& item->KEY_NAME, find_key)) \ + { \ + return item; \ + } \ + } \ + \ + return NULL; \ +} \ + \ +static __inline__ ITEM_TYPE* \ +PREFIX##_hash_find_index_lru (PREFIX##_hash_table *hash, \ + __u32 hash_index, \ + KEY_TYPE const *find_key) \ +{ \ + ITEM_TYPE ** item = &hash->_table[hash_index]; \ + \ + PREFIX##_check_hash(hash, hash_index); \ + \ + while (*item != NULL) { \ + prefetch(&(*item)->LINK_NAME._next); \ + if (EQ_FUNC (&(*item)->KEY_NAME, find_key)) { \ + ITEM_TYPE *found; \ + \ + found = *item; \ + *item = found->LINK_NAME._next; \ + found->LINK_NAME._next = hash->_table[hash_index]; \ + hash->_table[hash_index] = found; \ + return found; \ + } \ + item = &(*item)->LINK_NAME._next; \ + } \ + return NULL; \ +} \ + \ +static __inline__ int \ +PREFIX##_hash_remove_index (PREFIX##_hash_table *hash, \ + __u32 hash_index, \ + ITEM_TYPE *del_item) \ +{ \ + ITEM_TYPE ** hash_item_p = &hash->_table[hash_index]; \ + \ + PREFIX##_check_hash(hash, hash_index); \ + \ + while (*hash_item_p != NULL) { \ + prefetch(&(*hash_item_p)->LINK_NAME._next); \ + if (*hash_item_p == del_item) { \ + *hash_item_p = (*hash_item_p)->LINK_NAME._next; \ + return 1; \ + } \ + hash_item_p = &(*hash_item_p)->LINK_NAME._next; \ + } \ + return 0; \ +} \ + \ +static __inline__ void \ +PREFIX##_hash_insert_index (PREFIX##_hash_table *hash, \ + __u32 hash_index, \ + ITEM_TYPE *ins_item) \ +{ \ + PREFIX##_check_hash(hash, hash_index); \ + \ + ins_item->LINK_NAME._next = hash->_table[hash_index]; \ + hash->_table[hash_index] = ins_item; \ +} \ + \ +static __inline__ void \ +PREFIX##_hash_insert_index_rcu (PREFIX##_hash_table *hash, \ + __u32 hash_index, \ + ITEM_TYPE *ins_item) \ +{ \ + PREFIX##_check_hash(hash, hash_index); \ + \ + ins_item->LINK_NAME._next = hash->_table[hash_index]; \ + smp_wmb(); \ + hash->_table[hash_index] = ins_item; \ +} \ + \ +static __inline__ ITEM_TYPE* \ +PREFIX##_hash_find (PREFIX##_hash_table *hash, \ + KEY_TYPE const *find_key) \ +{ \ + return PREFIX##_hash_find_index (hash, HASH_FUNC(hash, find_key), find_key); \ +} \ + \ +static __inline__ ITEM_TYPE* \ +PREFIX##_hash_find_lru (PREFIX##_hash_table *hash, \ + KEY_TYPE const *find_key) \ +{ \ + return PREFIX##_hash_find_index_lru (hash, HASH_FUNC(hash, find_key), find_key); \ +} \ + \ +static __inline__ int \ +PREFIX##_hash_remove (PREFIX##_hash_table *hash, \ + ITEM_TYPE *del_item) \ +{ \ + return PREFIX##_hash_remove_index (hash, \ + HASH_FUNC(hash, &del_item->KEY_NAME), del_item); \ +} \ + \ +static __inline__ int \ +PREFIX##_hash_remove_rcu (PREFIX##_hash_table *hash, \ + ITEM_TYPE *del_item) \ +{ \ + return PREFIX##_hash_remove (hash, del_item); \ +} \ + \ +static __inline__ void \ +PREFIX##_hash_insert (PREFIX##_hash_table *hash, \ + ITEM_TYPE *ins_item) \ +{ \ + return PREFIX##_hash_insert_index (hash, \ + HASH_FUNC(hash, &ins_item->KEY_NAME), ins_item); \ +} \ + \ +static __inline__ void \ +PREFIX##_hash_insert_rcu (PREFIX##_hash_table *hash, \ + ITEM_TYPE *ins_item) \ +{ \ + return PREFIX##_hash_insert_index_rcu (hash, HASH_FUNC(hash, &ins_item->KEY_NAME), \ + ins_item); \ +} \ + \ +static __inline__ ITEM_TYPE * \ +PREFIX##_hash_first (PREFIX##_hash_table *hash, __u32 ind) \ +{ \ + ITEM_TYPE *first; \ + \ + for (first = NULL; ind < hash->_buckets; ++ ind) { \ + first = hash->_table[ind]; \ + if (first != NULL) \ + break; \ + } \ + return first; \ +} \ + \ +static __inline__ ITEM_TYPE * \ +PREFIX##_hash_next (PREFIX##_hash_table *hash, \ + ITEM_TYPE *item) \ +{ \ + ITEM_TYPE *next; \ + \ + if (item == NULL) \ + return NULL; \ + next = item->LINK_NAME._next; \ + if (next == NULL) \ + next = PREFIX##_hash_first (hash, HASH_FUNC(hash, &item->KEY_NAME) + 1); \ + return next; \ +} \ + \ +typedef struct {} PREFIX##_hash_dummy + +#define for_all_ht_buckets(table, head) \ +for ((head) = &(table) -> _table[ 0 ] ; \ + (head) != &(table) -> _table[ (table) -> _buckets ] ; ++ (head)) + +#define for_all_in_bucket(bucket, item, next, field) \ +for ((item) = *(bucket), (next) = (item) ? (item) -> field._next : NULL ; \ + (item) != NULL ; \ + (item) = (next), (next) = (item) ? (item) -> field._next : NULL ) + +#define for_all_in_htable(table, prefix, item, next) \ +for ((item) = prefix ## _hash_first ((table), 0), \ + (next) = prefix ## _hash_next ((table), (item)) ; \ + (item) != NULL ; \ + (item) = (next), \ + (next) = prefix ## _hash_next ((table), (item))) + +/* __REISER4_TYPE_SAFE_HASH_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/type_safe_list.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/type_safe_list.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,436 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#ifndef __REISER4_TYPE_SAFE_LIST_H__ +#define __REISER4_TYPE_SAFE_LIST_H__ + +#include "debug.h" +/* A circular doubly linked list that differs from the previous + implementation because it is parametrized to provide + type safety. This data structure is also useful as a queue or stack. + + The "list template" consists of a set of types and methods for + implementing list operations. All of the types and methods + associated with a single list class are assigned unique names and + type signatures, thus allowing the compiler to verify correct + usage. + + The first parameter of a list class is the item type being stored + in the list. The list class maintains two pointers within each + item structure for its "next" and "prev" pointers. + + There are two structures associated with the list, in addition to + the item type itself. The "list link" contains the two pointers + that are embedded within the item itself. The "list head" also + contains two pointers which refer to the first item ("front") and + last item ("back") of the list. + + The list maintains a "circular" invariant, in that you can always + begin at the front and follow "next" pointers until eventually you + reach the same point. The "list head" is included within the + cycle, even though it does not have the correct item type. The + "list head" and "list link" types are different objects from the + user's perspective, but the core algorithms that operate on this + style of list treat the "list head" and "list link" as identical + types. That is why these algorithms are so simple. + + The implementation uses the same algorithms as those + in this file but uses only a single type "struct list_head". There + are two problems with this approach. First, there are no type + distinctions made between the two objects despite their distinct + types, which greatly increases the possibility for mistakes. For + example, the list_add function takes two "struct + list_head" arguments: the first is the item being inserted and the + second is the "struct list_head" which should precede the new + insertion to the list. You can use this function to insert at any + point in the list, but by far the most common list operations are + to insert at the front or back of the list. This common case + should accept two different argument types: a "list head" and an + "item", this allows for no confusion. + + The second problem with using a single "struct list_head" is that + it does not distinguish between list objects of distinct list + classes. If a single item can belong to two separate lists, there + is easily the possibility of a mistake being made that causes the + item to be added to a "list head" using the wrong "list link". By + using a parametrized list class we can statically detect such + mistakes, detecting mistakes as soon as they happen. + + To create a new list class takes several steps which are described + below. Suppose for this example that you would like to link + together items of type "rx_event". You should decide on + prefix-name to be used on all list functions and structures. For + example, the string "rx_event" can be as a prefix for all the list + operations, resulting in a "list head" named rx_event_list_head and + a "list link" named rx_event_list_link. The list operations on + this list class would be named "rx_event_list_empty", + "rx_event_list_init", "rx_event_list_push_front", + "rx_event_list_push_back", and so on. +*/ + +#define TYPE_SAFE_LIST_LINK_INIT(name) { &(name), &(name) } +#define TYPE_SAFE_LIST_HEAD_INIT(name) { (void *)&(name), (void *)&(name) } +#define TYPE_SAFE_LIST_LINK_ZERO { NULL, NULL } +#define TYPE_SAFE_LIST_HEAD_ZERO { NULL, NULL } + +#define TS_LINK_TO_ITEM(ITEM_TYPE,LINK_NAME,LINK) \ + ((ITEM_TYPE *)((char *)(LINK)-(unsigned long)(&((ITEM_TYPE *)0)->LINK_NAME))) + +/* Step 1: Use the TYPE_SAFE_LIST_DECLARE() macro to define the "list head" + and "list link" objects. This macro takes one arguments, the + prefix-name, which is prepended to every structure and function + name of the list class. Following the example, this will create + types named rx_event_list_head and rx_event_list_link. In the + example you would write: + + TYPE_SAFE_LIST_DECLARE(rx_event); + +*/ +#define TYPE_SAFE_LIST_DECLARE(PREFIX) \ + \ +typedef struct _##PREFIX##_list_head PREFIX##_list_head; \ +typedef struct _##PREFIX##_list_link PREFIX##_list_link; \ + \ +struct _##PREFIX##_list_link \ +{ \ + PREFIX##_list_link *_next; \ + PREFIX##_list_link *_prev; \ +}; \ + \ +struct _##PREFIX##_list_head \ +{ \ + PREFIX##_list_link *_next; \ + PREFIX##_list_link *_prev; \ +} + +/* Step 2: Once you have defined the two list classes, you should + define the item type you intend to use. The list classes must be + declared before the item type because the item type must contain an + embedded "list link" object. Following the example, you might define + rx_event as follows: + + typedef struct _rx_event rx_event; + + struct _rx_event + { + ... other members ... + + rx_event_list_link _link; + }; + + In this case we have given the rx_event a field named "_link" of + the appropriate type. +*/ + +/* Step 3: The final step will define the list-functions for a + specific list class using the macro TYPE_SAFE_LIST_DEFINE. There are + three arguments to the TYPE_SAFE_LIST_DEFINE macro: the prefix-name, the + item type name, and field name of the "list link" element within + the item type. In the above example you would supply "rx_event" as + the type name and "_link" as the field name (without quotes). + E.g., + + TYPE_SAFE_LIST_DEFINE(rx_event,rx_event,_link) + + The list class you define is now complete with the functions: + + rx_event_list_init Initialize a list_head + rx_event_list_clean Initialize a list_link + rx_event_list_is_clean True if list_link is not in a list + rx_event_list_push_front Insert to the front of the list + rx_event_list_push_back Insert to the back of the list + rx_event_list_insert_before Insert just before given item in the list + rx_event_list_insert_after Insert just after given item in the list + rx_event_list_remove Remove an item from anywhere in the list + rx_event_list_remove_clean Remove an item from anywhere in the list and clean link_item + rx_event_list_remove_get_next Remove an item from anywhere in the list and return the next element + rx_event_list_remove_get_prev Remove an item from anywhere in the list and return the prev element + rx_event_list_pop_front Remove and return the front of the list, cannot be empty + rx_event_list_pop_back Remove and return the back of the list, cannot be empty + rx_event_list_front Get the front of the list + rx_event_list_back Get the back of the list + rx_event_list_next Iterate front-to-back through the list + rx_event_list_prev Iterate back-to-front through the list + rx_event_list_end Test to end an iteration, either direction + rx_event_list_splice Join two lists at the head + rx_event_list_empty True if the list is empty + rx_event_list_object_ok Check that list element satisfies double + list invariants. For debugging. + + To iterate over such a list use a for-loop such as: + + rx_event_list_head *head = ...; + rx_event *item; + + for (item = rx_event_list_front (head); + ! rx_event_list_end (head, item); + item = rx_event_list_next (item)) + {...} +*/ +#define TYPE_SAFE_LIST_DEFINE(PREFIX,ITEM_TYPE,LINK_NAME) \ + \ +static __inline__ int \ +PREFIX##_list_link_invariant (const PREFIX##_list_link *_link) \ +{ \ + return (_link != NULL) && \ + (_link->_prev != NULL) && (_link->_next != NULL ) && \ + (_link->_prev->_next == _link) && \ + (_link->_next->_prev == _link); \ +} \ + \ +static __inline__ void \ +PREFIX##_list_link_ok (const PREFIX##_list_link *_link UNUSED_ARG) \ +{ \ + assert ("nikita-1054", PREFIX##_list_link_invariant (_link)); \ +} \ + \ +static __inline__ void \ +PREFIX##_list_object_ok (const ITEM_TYPE *item) \ +{ \ + PREFIX##_list_link_ok (&item->LINK_NAME); \ +} \ + \ +static __inline__ void \ +PREFIX##_list_init (PREFIX##_list_head *head) \ +{ \ + head->_next = (PREFIX##_list_link*) head; \ + head->_prev = (PREFIX##_list_link*) head; \ +} \ + \ +static __inline__ void \ +PREFIX##_list_clean (ITEM_TYPE *item) \ +{ \ + PREFIX##_list_link *_link = &item->LINK_NAME; \ + \ + _link->_next = _link; \ + _link->_prev = _link; \ +} \ + \ +static __inline__ int \ +PREFIX##_list_is_clean (const ITEM_TYPE *item) \ +{ \ + const PREFIX##_list_link *_link = &item->LINK_NAME; \ + \ + PREFIX##_list_link_ok (_link); \ + return (_link == _link->_next) && (_link == _link->_prev); \ +} \ + \ +static __inline__ void \ +PREFIX##_list_insert_int (PREFIX##_list_link *next, \ + PREFIX##_list_link *item) \ +{ \ + PREFIX##_list_link *prev = next->_prev; \ + PREFIX##_list_link_ok (next); \ + PREFIX##_list_link_ok (prev); \ + next->_prev = item; \ + item->_next = next; \ + item->_prev = prev; \ + prev->_next = item; \ + PREFIX##_list_link_ok (next); \ + PREFIX##_list_link_ok (prev); \ + PREFIX##_list_link_ok (item); \ +} \ + \ +static __inline__ void \ +PREFIX##_list_push_front (PREFIX##_list_head *head, \ + ITEM_TYPE *item) \ +{ \ + PREFIX##_list_insert_int (head->_next, & item->LINK_NAME); \ +} \ + \ +static __inline__ void \ +PREFIX##_list_push_back (PREFIX##_list_head *head, \ + ITEM_TYPE *item) \ +{ \ + PREFIX##_list_insert_int ((PREFIX##_list_link *) head, & item->LINK_NAME); \ +} \ + \ +static __inline__ void \ +PREFIX##_list_insert_before (ITEM_TYPE *reference, \ + ITEM_TYPE *item) \ +{ \ + PREFIX##_list_insert_int (& reference->LINK_NAME, & item->LINK_NAME); \ +} \ + \ +static __inline__ void \ +PREFIX##_list_insert_after (ITEM_TYPE *reference, \ + ITEM_TYPE *item) \ +{ \ + PREFIX##_list_insert_int (reference->LINK_NAME._next, & item->LINK_NAME); \ +} \ + \ +static __inline__ PREFIX##_list_link* \ +PREFIX##_list_remove_int (PREFIX##_list_link *list_link) \ +{ \ + PREFIX##_list_link *next = list_link->_next; \ + PREFIX##_list_link *prev = list_link->_prev; \ + PREFIX##_list_link_ok (list_link); \ + PREFIX##_list_link_ok (next); \ + PREFIX##_list_link_ok (prev); \ + next->_prev = prev; \ + prev->_next = next; \ + PREFIX##_list_link_ok (next); \ + PREFIX##_list_link_ok (prev); \ + return list_link; \ +} \ + \ +static __inline__ void \ +PREFIX##_list_remove (ITEM_TYPE *item) \ +{ \ + PREFIX##_list_remove_int (& item->LINK_NAME); \ +} \ + \ +static __inline__ void \ +PREFIX##_list_remove_clean (ITEM_TYPE *item) \ +{ \ + PREFIX##_list_remove_int (& item->LINK_NAME); \ + PREFIX##_list_clean (item); \ +} \ + \ +static __inline__ ITEM_TYPE* \ +PREFIX##_list_remove_get_next (ITEM_TYPE *item) \ +{ \ + PREFIX##_list_link *next = item->LINK_NAME._next; \ + PREFIX##_list_remove_int (& item->LINK_NAME); \ + return TS_LINK_TO_ITEM(ITEM_TYPE,LINK_NAME,next); \ +} \ + \ +static __inline__ ITEM_TYPE* \ +PREFIX##_list_remove_get_prev (ITEM_TYPE *item) \ +{ \ + PREFIX##_list_link *prev = item->LINK_NAME._prev; \ + PREFIX##_list_remove_int (& item->LINK_NAME); \ + return TS_LINK_TO_ITEM(ITEM_TYPE,LINK_NAME,prev); \ +} \ + \ +static __inline__ int \ +PREFIX##_list_empty (const PREFIX##_list_head *head) \ +{ \ + return head == (PREFIX##_list_head*) head->_next; \ +} \ + \ +static __inline__ ITEM_TYPE* \ +PREFIX##_list_pop_front (PREFIX##_list_head *head) \ +{ \ + assert ("nikita-1913", ! PREFIX##_list_empty (head)); \ + return TS_LINK_TO_ITEM(ITEM_TYPE,LINK_NAME,PREFIX##_list_remove_int (head->_next)); \ +} \ + \ +static __inline__ ITEM_TYPE* \ +PREFIX##_list_pop_back (PREFIX##_list_head *head) \ +{ \ + assert ("nikita-1914", ! PREFIX##_list_empty (head)); /* WWI started */ \ + return TS_LINK_TO_ITEM(ITEM_TYPE,LINK_NAME,PREFIX##_list_remove_int (head->_prev)); \ +} \ + \ +static __inline__ ITEM_TYPE* \ +PREFIX##_list_front (const PREFIX##_list_head *head) \ +{ \ + return TS_LINK_TO_ITEM(ITEM_TYPE,LINK_NAME,head->_next); \ +} \ + \ +static __inline__ ITEM_TYPE* \ +PREFIX##_list_back (const PREFIX##_list_head *head) \ +{ \ + return TS_LINK_TO_ITEM(ITEM_TYPE,LINK_NAME,head->_prev); \ +} \ + \ +static __inline__ ITEM_TYPE* \ +PREFIX##_list_next (const ITEM_TYPE *item) \ +{ \ + return TS_LINK_TO_ITEM(ITEM_TYPE,LINK_NAME,item->LINK_NAME._next); \ +} \ + \ +static __inline__ ITEM_TYPE* \ +PREFIX##_list_prev (const ITEM_TYPE *item) \ +{ \ + return TS_LINK_TO_ITEM(ITEM_TYPE,LINK_NAME,item->LINK_NAME._prev); \ +} \ + \ +static __inline__ int \ +PREFIX##_list_end (const PREFIX##_list_head *head, \ + const ITEM_TYPE *item) \ +{ \ + return ((PREFIX##_list_link *) head) == (& item->LINK_NAME); \ +} \ + \ +static __inline__ void \ +PREFIX##_list_splice (PREFIX##_list_head *head_join, \ + PREFIX##_list_head *head_empty) \ +{ \ + if (PREFIX##_list_empty (head_empty)) { \ + return; \ + } \ + \ + head_empty->_prev->_next = (PREFIX##_list_link*) head_join; \ + head_empty->_next->_prev = head_join->_prev; \ + \ + head_join->_prev->_next = head_empty->_next; \ + head_join->_prev = head_empty->_prev; \ + \ + PREFIX##_list_link_ok ((PREFIX##_list_link*) head_join); \ + PREFIX##_list_link_ok (head_join->_prev); \ + PREFIX##_list_link_ok (head_join->_next); \ + \ + PREFIX##_list_init (head_empty); \ +} \ + \ +static __inline__ void \ +PREFIX##_list_split(PREFIX##_list_head *head_split, \ + PREFIX##_list_head *head_new, \ + ITEM_TYPE *item) \ +{ \ + assert("vs-1471", PREFIX##_list_empty(head_new)); \ + \ + /* attach to new list */ \ + head_new->_next = (& item->LINK_NAME); \ + head_new->_prev = head_split->_prev; \ + \ + /* cut from old list */ \ + item->LINK_NAME._prev->_next = (PREFIX##_list_link*)head_split; \ + head_split->_prev = item->LINK_NAME._prev; \ + \ + /* link new list */ \ + head_new->_next->_prev = (PREFIX##_list_link*)head_new; \ + head_new->_prev->_next = (PREFIX##_list_link*)head_new; \ +} \ + \ +static __inline__ void \ +PREFIX##_list_check (const PREFIX##_list_head *head) \ +{ \ + const PREFIX##_list_link *link; \ + \ + for (link = head->_next ; link != ((PREFIX##_list_link *) head) ; link = link->_next) \ + PREFIX##_list_link_ok (link); \ +} \ + \ +typedef struct { int foo; } PREFIX##_list_dummy_decl + +/* The final typedef is to allow a semicolon at the end of + * TYPE_SAFE_LIST_DEFINE(); */ + +#define for_all_type_safe_list(prefix, head, item) \ + for(item = prefix ## _list_front(head), \ + prefetch(prefix ## _list_next(item)); \ + !prefix ## _list_end(head, item) ; \ + item = prefix ## _list_next(item), \ + prefetch(prefix ## _list_next(item))) + +#define for_all_type_safe_list_safe(prefix, head, item, next) \ + for(item = prefix ## _list_front(head), \ + next = prefix ## _list_next(item); \ + !prefix ## _list_end(head, item) ; \ + item = next, \ + next = prefix ## _list_next(item)) + +/* __REISER4_TYPE_SAFE_LIST_H__ */ +#endif + +/* + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/vfs_ops.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/vfs_ops.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1640 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Interface to VFS. Reiser4 {super|export|dentry}_operations are defined + here. */ + +#include "forward.h" +#include "debug.h" +#include "dformat.h" +#include "coord.h" +#include "plugin/item/item.h" +#include "plugin/file/file.h" +#include "plugin/security/perm.h" +#include "plugin/disk_format/disk_format.h" +#include "plugin/dir/dir.h" +#include "plugin/plugin.h" +#include "plugin/plugin_set.h" +#include "plugin/object.h" +#include "txnmgr.h" +#include "jnode.h" +#include "znode.h" +#include "block_alloc.h" +#include "tree.h" +#include "vfs_ops.h" +#include "inode.h" +#include "page_cache.h" +#include "ktxnmgrd.h" +#include "super.h" +#include "reiser4.h" +#include "entd.h" +#include "emergency_flush.h" +#include "init_super.h" +#include "status_flags.h" +#include "flush.h" +#include "dscale.h" + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* super operations */ + +static struct inode *reiser4_alloc_inode(struct super_block *super); +static void reiser4_destroy_inode(struct inode *inode); +static void reiser4_drop_inode(struct inode *); +static void reiser4_delete_inode(struct inode *); +static void reiser4_write_super(struct super_block *); +static int reiser4_statfs(struct super_block *, struct kstatfs *); +static int reiser4_show_options(struct seq_file *m, struct vfsmount *mnt); +static void reiser4_sync_inodes(struct super_block *s, struct writeback_control * wbc); + +extern struct dentry_operations reiser4_dentry_operation; + +static struct file_system_type reiser4_fs_type; + +/* ->statfs() VFS method in reiser4 super_operations */ +static int +reiser4_statfs(struct super_block *super /* super block of file + * system in queried */ , + struct kstatfs *statfs /* buffer to fill with + * statistics */ ) +{ + sector_t total; + sector_t reserved; + sector_t free; + sector_t forroot; + sector_t deleted; + reiser4_context ctx; + + assert("nikita-408", super != NULL); + assert("nikita-409", statfs != NULL); + + init_context(&ctx, super); + + statfs->f_type = statfs_type(super); + statfs->f_bsize = super->s_blocksize; + + /* + * 5% of total block space is reserved. This is needed for flush and + * for truncates (so that we are able to perform truncate/unlink even + * on the otherwise completely full file system). If this reservation + * is hidden from statfs(2), users will mistakenly guess that they + * have enough free space to complete some operation, which is + * frustrating. + * + * Another possible solution is to subtract ->blocks_reserved from + * ->f_bfree, but changing available space seems less intrusive than + * letting user to see 5% of disk space to be used directly after + * mkfs. + */ + total = reiser4_block_count(super); + reserved = get_super_private(super)->blocks_reserved; + deleted = txnmgr_count_deleted_blocks(); + free = reiser4_free_blocks(super) + deleted; + forroot = reiser4_reserved_blocks(super, 0, 0); + + /* These counters may be in inconsistent state because we take the + * values without keeping any global spinlock. Here we do a sanity + * check that free block counter does not exceed the number of all + * blocks. */ + if (free > total) + free = total; + statfs->f_blocks = total - reserved; + /* make sure statfs->f_bfree is never larger than statfs->f_blocks */ + if (free > reserved) + free -= reserved; + else + free = 0; + statfs->f_bfree = free; + + if (free > forroot) + free -= forroot; + else + free = 0; + statfs->f_bavail = free; + +/* FIXME: Seems that various df implementations are way unhappy by such big numbers. + So we will leave those as zeroes. + statfs->f_files = oids_used(super) + oids_free(super); + statfs->f_ffree = oids_free(super); +*/ + + /* maximal acceptable name length depends on directory plugin. */ + assert("nikita-3351", super->s_root->d_inode != NULL); + statfs->f_namelen = reiser4_max_filename_len(super->s_root->d_inode); + reiser4_exit_context(&ctx); + return 0; +} + +/* this is called whenever mark_inode_dirty is to be called. Stat-data are + * updated in the tree. */ +reiser4_internal int +reiser4_mark_inode_dirty(struct inode *inode) +{ + assert("vs-1207", is_in_reiser4_context()); + return reiser4_update_sd(inode); +} + +/* update inode stat-data by calling plugin */ +reiser4_internal int +reiser4_update_sd(struct inode *object) +{ + file_plugin *fplug; + + assert("nikita-2338", object != NULL); + /* check for read-only file system. */ + if (IS_RDONLY(object)) + return 0; + + fplug = inode_file_plugin(object); + assert("nikita-2339", fplug != NULL); + return fplug->write_sd_by_inode(object); +} + +/* helper function: increase inode nlink count and call plugin method to save + updated stat-data. + + Used by link/create and during creation of dot and dotdot in mkdir +*/ +reiser4_internal int +reiser4_add_nlink(struct inode *object /* object to which link is added */ , + struct inode *parent /* parent where new entry will be */ , + int write_sd_p /* true if stat-data has to be + * updated */ ) +{ + file_plugin *fplug; + int result; + + assert("nikita-1351", object != NULL); + + fplug = inode_file_plugin(object); + assert("nikita-1445", fplug != NULL); + + /* ask plugin whether it can add yet another link to this + object */ + if (!fplug->can_add_link(object)) + return RETERR(-EMLINK); + + assert("nikita-2211", fplug->add_link != NULL); + /* call plugin to do actual addition of link */ + result = fplug->add_link(object, parent); + + mark_inode_update(object, write_sd_p); + + /* optionally update stat data */ + if (result == 0 && write_sd_p) + result = fplug->write_sd_by_inode(object); + return result; +} + +/* helper function: decrease inode nlink count and call plugin method to save + updated stat-data. + + Used by unlink/create +*/ +reiser4_internal int +reiser4_del_nlink(struct inode *object /* object from which link is + * removed */ , + struct inode *parent /* parent where entry was */ , + int write_sd_p /* true is stat-data has to be + * updated */ ) +{ + file_plugin *fplug; + int result; + + assert("nikita-1349", object != NULL); + + fplug = inode_file_plugin(object); + assert("nikita-1350", fplug != NULL); + assert("nikita-1446", object->i_nlink > 0); + assert("nikita-2210", fplug->rem_link != NULL); + + /* call plugin to do actual deletion of link */ + result = fplug->rem_link(object, parent); + mark_inode_update(object, write_sd_p); + /* optionally update stat data */ + if (result == 0 && write_sd_p) + result = fplug->write_sd_by_inode(object); + return result; +} + +/* slab for reiser4_dentry_fsdata */ +static kmem_cache_t *dentry_fsdata_slab; + +/* + * initializer for dentry_fsdata_slab called during boot or module load. + */ +static int +init_dentry_fsdata(void) +{ + dentry_fsdata_slab = kmem_cache_create("dentry_fsdata", + sizeof (reiser4_dentry_fsdata), + 0, + SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT, + NULL, + NULL); + return (dentry_fsdata_slab == NULL) ? RETERR(-ENOMEM) : 0; +} + +/* + * dual to init_dentry_fsdata(). Called on module unload. + */ +static void +done_dentry_fsdata(void) +{ + kmem_cache_destroy(dentry_fsdata_slab); +} + + +/* Return and lazily allocate if necessary per-dentry data that we + attach to each dentry. */ +reiser4_internal reiser4_dentry_fsdata * +reiser4_get_dentry_fsdata(struct dentry *dentry /* dentry + * queried */ ) +{ + assert("nikita-1365", dentry != NULL); + + if (dentry->d_fsdata == NULL) { + dentry->d_fsdata = kmem_cache_alloc(dentry_fsdata_slab, + GFP_KERNEL); + if (dentry->d_fsdata == NULL) + return ERR_PTR(RETERR(-ENOMEM)); + memset(dentry->d_fsdata, 0, sizeof (reiser4_dentry_fsdata)); + } + return dentry->d_fsdata; +} + +/* opposite to reiser4_get_dentry_fsdata(), returns per-dentry data into slab + * allocator */ +reiser4_internal void +reiser4_free_dentry_fsdata(struct dentry *dentry /* dentry released */ ) +{ + if (dentry->d_fsdata != NULL) { + kmem_cache_free(dentry_fsdata_slab, dentry->d_fsdata); + dentry->d_fsdata = NULL; + } +} + +/* Release reiser4 dentry. This is d_op->d_release() method. */ +static void +reiser4_d_release(struct dentry *dentry /* dentry released */ ) +{ + reiser4_free_dentry_fsdata(dentry); +} + +/* slab for reiser4_dentry_fsdata */ +static kmem_cache_t *file_fsdata_slab; + +/* + * initialize file_fsdata_slab. This is called during boot or module load. + */ +static int +init_file_fsdata(void) +{ + file_fsdata_slab = kmem_cache_create("file_fsdata", + sizeof (reiser4_file_fsdata), + 0, + SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT, + NULL, + NULL); + return (file_fsdata_slab == NULL) ? RETERR(-ENOMEM) : 0; +} + +/* + * dual to init_file_fsdata(). Called during module unload. + */ +static void +done_file_fsdata(void) +{ + kmem_cache_destroy(file_fsdata_slab); +} + +/* + * Create reiser4 specific per-file data: reiser4_file_fsdata. + */ +reiser4_internal reiser4_file_fsdata * +create_fsdata(struct file *file, int gfp) +{ + reiser4_file_fsdata *fsdata; + + fsdata = kmem_cache_alloc(file_fsdata_slab, gfp); + if (fsdata != NULL) { + memset(fsdata, 0, sizeof *fsdata); + fsdata->ra1.max_window_size = VM_MAX_READAHEAD * 1024; + fsdata->back = file; + readdir_list_clean(fsdata); + } + return fsdata; +} + +/* Return and lazily allocate if necessary per-file data that we attach + to each struct file. */ +reiser4_internal reiser4_file_fsdata * +reiser4_get_file_fsdata(struct file *f /* file + * queried */ ) +{ + assert("nikita-1603", f != NULL); + + if (f->private_data == NULL) { + reiser4_file_fsdata *fsdata; + struct inode *inode; + + fsdata = create_fsdata(f, GFP_KERNEL); + if (fsdata == NULL) + return ERR_PTR(RETERR(-ENOMEM)); + + inode = f->f_dentry->d_inode; + spin_lock_inode(inode); + if (f->private_data == NULL) { + f->private_data = fsdata; + fsdata = NULL; + } + spin_unlock_inode(inode); + if (fsdata != NULL) + /* other thread initialized ->fsdata */ + kmem_cache_free(file_fsdata_slab, fsdata); + } + assert("nikita-2665", f->private_data != NULL); + return f->private_data; +} + +/* + * Dual to create_fsdata(). Free reiser4_file_fsdata. + */ +reiser4_internal void +reiser4_free_fsdata(reiser4_file_fsdata *fsdata) +{ + if (fsdata != NULL) + kmem_cache_free(file_fsdata_slab, fsdata); +} + +/* + * Dual to reiser4_get_file_fsdata(). + */ +reiser4_internal void +reiser4_free_file_fsdata(struct file *f) +{ + reiser4_file_fsdata *fsdata; + fsdata = f->private_data; + if (fsdata != NULL) { + readdir_list_remove_clean(fsdata); + if (fsdata->cursor == NULL) + reiser4_free_fsdata(fsdata); + } + f->private_data = NULL; +} + +/* our ->read_inode() is no-op. Reiser4 inodes should be loaded + through fs/reiser4/inode.c:reiser4_iget() */ +static void +noop_read_inode(struct inode *inode UNUSED_ARG) +{ +} + +/* initialization and shutdown */ + +/* slab cache for inodes */ +static kmem_cache_t *inode_cache; + +/* initialization function passed to the kmem_cache_create() to init new pages + grabbed by our inodecache. */ +static void +init_once(void *obj /* pointer to new inode */ , + kmem_cache_t * cache UNUSED_ARG /* slab cache */ , + unsigned long flags /* cache flags */ ) +{ + reiser4_inode_object *info; + + info = obj; + + if ((flags & (SLAB_CTOR_VERIFY | SLAB_CTOR_CONSTRUCTOR)) == SLAB_CTOR_CONSTRUCTOR) { + /* NOTE-NIKITA add here initializations for locks, list heads, + etc. that will be added to our private inode part. */ + inode_init_once(&info->vfs_inode); + readdir_list_init(get_readdir_list(&info->vfs_inode)); + init_rwsem(&info->p.coc_sem); + /* init semaphore which is used during inode loading */ + loading_init_once(&info->p); + INIT_RADIX_TREE(jnode_tree_by_reiser4_inode(&info->p), GFP_ATOMIC); +#if REISER4_DEBUG + info->p.nr_jnodes = 0; + info->p.captured_eflushed = 0; + info->p.anonymous_eflushed = 0; +#endif + } +} + +/* initialize slab cache where reiser4 inodes will live */ +static int +init_inodecache(void) +{ + inode_cache = kmem_cache_create("reiser4_inode", + sizeof (reiser4_inode_object), + 0, + SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT, + init_once, + NULL); + return (inode_cache != NULL) ? 0 : RETERR(-ENOMEM); +} + +/* initialize slab cache where reiser4 inodes lived */ +static void +destroy_inodecache(void) +{ + if (kmem_cache_destroy(inode_cache) != 0) + warning("nikita-1695", "not all inodes were freed"); +} + +/* ->alloc_inode() super operation: allocate new inode */ +static struct inode * +reiser4_alloc_inode(struct super_block *super UNUSED_ARG /* super block new + * inode is + * allocated for */ ) +{ + reiser4_inode_object *obj; + + assert("nikita-1696", super != NULL); + obj = kmem_cache_alloc(inode_cache, SLAB_KERNEL); + if (obj != NULL) { + reiser4_inode *info; + + info = &obj->p; + + info->hset = info->pset = plugin_set_get_empty(); + info->extmask = 0; + info->locality_id = 0ull; + info->plugin_mask = 0; +#if !REISER4_INO_IS_OID + info->oid_hi = 0; +#endif + seal_init(&info->sd_seal, NULL, NULL); + coord_init_invalid(&info->sd_coord, NULL); + info->cluster_shift = 0; + info->crypt = NULL; + info->flags = 0; + spin_inode_object_init(info); + /* this deals with info's loading semaphore */ + loading_alloc(info); + info->vroot = UBER_TREE_ADDR; + return &obj->vfs_inode; + } else + return NULL; +} + +/* ->destroy_inode() super operation: recycle inode */ +static void +reiser4_destroy_inode(struct inode *inode /* inode being destroyed */) +{ + reiser4_inode *info; + + info = reiser4_inode_data(inode); + + assert("vs-1220", inode_has_no_jnodes(info)); + + if (!is_bad_inode(inode) && is_inode_loaded(inode)) { + file_plugin * fplug = inode_file_plugin(inode); + if (fplug->destroy_inode != NULL) + fplug->destroy_inode(inode); + } + dispose_cursors(inode); + if (info->pset) + plugin_set_put(info->pset); + + /* cannot add similar assertion about ->i_list as prune_icache return + * inode into slab with dangling ->list.{next,prev}. This is safe, + * because they are re-initialized in the new_inode(). */ + assert("nikita-2895", list_empty(&inode->i_dentry)); + assert("nikita-2896", hlist_unhashed(&inode->i_hash)); + assert("nikita-2898", readdir_list_empty(get_readdir_list(inode))); + + /* this deals with info's loading semaphore */ + loading_destroy(info); + + kmem_cache_free(inode_cache, container_of(info, reiser4_inode_object, p)); +} + +/* our ->drop_inode() method. This is called by iput_final() when last + * reference on inode is released */ +static void +reiser4_drop_inode(struct inode *object) +{ + file_plugin *fplug; + + assert("nikita-2643", object != NULL); + + /* -not- creating context in this method, because it is frequently + called and all existing ->not_linked() methods are one liners. */ + + fplug = inode_file_plugin(object); + /* fplug is NULL for fake inode */ + if (fplug != NULL) { + assert("nikita-3251", fplug->drop != NULL); + fplug->drop(object); + } else + generic_forget_inode(object); +} + +/* + * Called by reiser4_sync_inodes(), during speculative write-back (through + * pdflush, or balance_dirty_pages()). + */ +static void +writeout(struct super_block *sb, struct writeback_control *wbc) +{ + long written = 0; + int repeats = 0; + + /* + * Performs early flushing, trying to free some memory. If there is + * nothing to flush, commits some atoms. + */ + + /* reiser4 has its own means of periodical write-out */ + if (wbc->for_kupdate) + return; + + /* Commit all atoms if reiser4_writepages() is called from sys_sync() or + sys_fsync(). */ + if (wbc->sync_mode != WB_SYNC_NONE) { + txnmgr_force_commit_all(sb, 1); + return; + } + + do { + long nr_submitted = 0; + struct inode *fake; + + fake = get_super_fake(sb); + if (fake != NULL) { + struct address_space *mapping; + + mapping = fake->i_mapping; + /* do not put more requests to overload write queue */ + if (wbc->nonblocking && + bdi_write_congested(mapping->backing_dev_info)) { + + blk_run_address_space(mapping); + /*blk_run_queues();*/ + wbc->encountered_congestion = 1; + break; + } + } + repeats ++; + flush_some_atom(&nr_submitted, wbc, JNODE_FLUSH_WRITE_BLOCKS); + if (!nr_submitted) + break; + + wbc->nr_to_write -= nr_submitted; + + written += nr_submitted; + + } while (wbc->nr_to_write > 0); + +} + +/* ->sync_inodes() method. This is called by pdflush, and synchronous + * writeback (throttling by balance_dirty_pages()). */ +static void +reiser4_sync_inodes(struct super_block * sb, struct writeback_control * wbc) +{ + reiser4_context ctx; + + init_context(&ctx, sb); + wbc->older_than_this = NULL; + + /* + * What we are trying to do here is to capture all "anonymous" pages. + */ + generic_sync_sb_inodes(sb, wbc); + /*capture_reiser4_inodes(sb, wbc);*/ + spin_unlock(&inode_lock); + writeout(sb, wbc); + + /* avoid recursive calls to ->sync_inodes */ + context_set_commit_async(&ctx); + reiser4_exit_context(&ctx); + spin_lock(&inode_lock); +} + +void reiser4_throttle_write(struct inode * inode) +{ + txn_restart_current(); + balance_dirty_pages_ratelimited(inode->i_mapping); +} + +/* ->delete_inode() super operation */ +static void +reiser4_delete_inode(struct inode *object) +{ + reiser4_context ctx; + + init_context(&ctx, object->i_sb); + if (is_inode_loaded(object)) { + file_plugin *fplug; + + fplug = inode_file_plugin(object); + if (fplug != NULL && fplug->delete != NULL) + fplug->delete(object); + } + + object->i_blocks = 0; + clear_inode(object); + reiser4_exit_context(&ctx); +} + +/* ->delete_inode() super operation */ +static void +reiser4_clear_inode(struct inode *object) +{ +#if REISER4_DEBUG + reiser4_inode *r4_inode; + + r4_inode = reiser4_inode_data(object); + if (!inode_has_no_jnodes(r4_inode)) + warning("vs-1732", "reiser4 inode is not clear: ae %d, ce %d, jnodes %lu\n", + r4_inode->anonymous_eflushed, r4_inode->captured_eflushed, + r4_inode->nr_jnodes); +#endif +} + +const char *REISER4_SUPER_MAGIC_STRING = "ReIsEr4"; +const int REISER4_MAGIC_OFFSET = 16 * 4096; /* offset to magic string from the + * beginning of device */ + +/* type of option parseable by parse_option() */ +typedef enum { + /* value of option is arbitrary string */ + OPT_STRING, + /* option specifies bit in a bitmask */ + OPT_BIT, + /* value of option should conform to sprintf() format */ + OPT_FORMAT, + /* option can take one of predefined values */ + OPT_ONEOF, +} opt_type_t; + +typedef struct opt_bitmask_bit { + const char *bit_name; + int bit_nr; +} opt_bitmask_bit; + +/* description of option parseable by parse_option() */ +typedef struct opt_desc { + /* option name. + + parsed portion of string has a form "name=value". + */ + const char *name; + /* type of option */ + opt_type_t type; + union { + /* where to store value of string option (type == OPT_STRING) */ + char **string; + /* description of bits for bit option (type == OPT_BIT) */ + struct { + int nr; + void *addr; + } bit; + /* description of format and targets for format option (type + == OPT_FORMAT) */ + struct { + const char *format; + int nr_args; + void *arg1; + void *arg2; + void *arg3; + void *arg4; + } f; + struct { + int *result; + const char *list[10]; + } oneof; + struct { + void *addr; + int nr_bits; + opt_bitmask_bit *bits; + } bitmask; + } u; +} opt_desc_t; + +/* parse one option */ +static int +parse_option(char *opt_string /* starting point of parsing */ , + opt_desc_t * opt /* option description */ ) +{ + /* foo=bar, + ^ ^ ^ + | | +-- replaced to '\0' + | +-- val_start + +-- opt_string + */ + char *val_start; + int result; + const char *err_msg; + + /* NOTE-NIKITA think about using lib/cmdline.c functions here. */ + + val_start = strchr(opt_string, '='); + if (val_start != NULL) { + *val_start = '\0'; + ++val_start; + } + + err_msg = NULL; + result = 0; + switch (opt->type) { + case OPT_STRING: + if (val_start == NULL) { + err_msg = "String arg missing"; + result = RETERR(-EINVAL); + } else + *opt->u.string = val_start; + break; + case OPT_BIT: + if (val_start != NULL) + err_msg = "Value ignored"; + else + set_bit(opt->u.bit.nr, opt->u.bit.addr); + break; + case OPT_FORMAT: + if (val_start == NULL) { + err_msg = "Formatted arg missing"; + result = RETERR(-EINVAL); + break; + } + if (sscanf(val_start, opt->u.f.format, + opt->u.f.arg1, opt->u.f.arg2, opt->u.f.arg3, opt->u.f.arg4) != opt->u.f.nr_args) { + err_msg = "Wrong conversion"; + result = RETERR(-EINVAL); + } + break; + case OPT_ONEOF: + { + int i = 0; + + if (val_start == NULL) { + err_msg = "Value is missing"; + result = RETERR(-EINVAL); + break; + } + err_msg = "Wrong option value"; + result = RETERR(-EINVAL); + while ( opt->u.oneof.list[i] ) { + if ( !strcmp(opt->u.oneof.list[i], val_start) ) { + result = 0; + err_msg = NULL; + *opt->u.oneof.result = i; + break; + } + i++; + } + break; + } + default: + wrong_return_value("nikita-2100", "opt -> type"); + break; + } + if (err_msg != NULL) { + warning("nikita-2496", "%s when parsing option \"%s%s%s\"", + err_msg, opt->name, val_start ? "=" : "", val_start ? : ""); + } + return result; +} + +/* parse options */ +static int +parse_options(char *opt_string /* starting point */ , + opt_desc_t * opts /* array with option description */ , + int nr_opts /* number of elements in @opts */ ) +{ + int result; + + result = 0; + while ((result == 0) && opt_string && *opt_string) { + int j; + char *next; + + next = strchr(opt_string, ','); + if (next != NULL) { + *next = '\0'; + ++next; + } + for (j = 0; j < nr_opts; ++j) { + if (!strncmp(opt_string, opts[j].name, strlen(opts[j].name))) { + result = parse_option(opt_string, &opts[j]); + break; + } + } + if (j == nr_opts) { + warning("nikita-2307", "Unrecognized option: \"%s\"", opt_string); + /* traditionally, -EINVAL is returned on wrong mount + option */ + result = RETERR(-EINVAL); + } + opt_string = next; + } + return result; +} + +#define NUM_OPT( label, fmt, addr ) \ + { \ + .name = ( label ), \ + .type = OPT_FORMAT, \ + .u = { \ + .f = { \ + .format = ( fmt ), \ + .nr_args = 1, \ + .arg1 = ( addr ), \ + .arg2 = NULL, \ + .arg3 = NULL, \ + .arg4 = NULL \ + } \ + } \ + } + +#define SB_FIELD_OPT( field, fmt ) NUM_OPT( #field, fmt, &sbinfo -> field ) + +#define BIT_OPT(label, bitnr) \ + { \ + .name = label, \ + .type = OPT_BIT, \ + .u = { \ + .bit = { \ + .nr = bitnr, \ + .addr = &sbinfo->fs_flags \ + } \ + } \ + } + + +#define MAX_NR_OPTIONS (30) + +/* parse options during mount */ +reiser4_internal int +reiser4_parse_options(struct super_block *s, char *opt_string) +{ + int result; + reiser4_super_info_data *sbinfo = get_super_private(s); + char *log_file_name; + opt_desc_t *opts, *p; + + opts = kmalloc(sizeof(opt_desc_t) * MAX_NR_OPTIONS, GFP_KERNEL); + if (opts == NULL) + return RETERR(-ENOMEM); + + p = opts; + +#if REISER4_DEBUG +# define OPT_ARRAY_CHECK if ((p) > (opts) + MAX_NR_OPTIONS) { \ + warning ("zam-1046", "opt array is overloaded"); break; \ + } +#else +# define OPT_ARRAY_CHECK noop +#endif + +#define PUSH_OPT(...) \ +do { \ + opt_desc_t o = __VA_ARGS__; \ + OPT_ARRAY_CHECK; \ + *p ++ = o; \ +} while (0) + +#define PUSH_SB_FIELD_OPT(field, format) PUSH_OPT(SB_FIELD_OPT(field, format)) +#define PUSH_BIT_OPT(name, bit) PUSH_OPT(BIT_OPT(name, bit)) + + /* trace_flags=N + + set trace flags to be N for this mount. N can be C numeric + literal recognized by %i scanf specifier. It is treated as + bitfield filled by values of debug.h:reiser4_trace_flags + enum + */ + PUSH_SB_FIELD_OPT(trace_flags, "%i"); + /* log_flags=N + + set log flags to be N for this mount. N can be C numeric + literal recognized by %i scanf specifier. It is treated as + bitfield filled by values of debug.h:reiser4_log_flags + enum + */ + PUSH_SB_FIELD_OPT(log_flags, "%i"); + /* debug_flags=N + + set debug flags to be N for this mount. N can be C numeric + literal recognized by %i scanf specifier. It is treated as + bitfield filled by values of debug.h:reiser4_debug_flags + enum + */ + PUSH_SB_FIELD_OPT(debug_flags, "%i"); + /* tmgr.atom_max_size=N + + Atoms containing more than N blocks will be forced to + commit. N is decimal. + */ + PUSH_SB_FIELD_OPT(tmgr.atom_max_size, "%u"); + /* tmgr.atom_max_age=N + + Atoms older than N seconds will be forced to commit. N is + decimal. + */ + PUSH_SB_FIELD_OPT(tmgr.atom_max_age, "%u"); + /* tmgr.atom_max_flushers=N + + limit of concurrent flushers for one atom. 0 means no limit. + */ + PUSH_SB_FIELD_OPT(tmgr.atom_max_flushers, "%u"); + /* tree.cbk_cache_slots=N + + Number of slots in the cbk cache. + */ + PUSH_SB_FIELD_OPT(tree.cbk_cache.nr_slots, "%u"); + + /* If flush finds more than FLUSH_RELOCATE_THRESHOLD adjacent + dirty leaf-level blocks it will force them to be + relocated. */ + PUSH_SB_FIELD_OPT(flush.relocate_threshold, "%u"); + /* If flush finds can find a block allocation closer than at + most FLUSH_RELOCATE_DISTANCE from the preceder it will + relocate to that position. */ + PUSH_SB_FIELD_OPT(flush.relocate_distance, "%u"); + /* If we have written this much or more blocks before + encountering busy jnode in flush list - abort flushing + hoping that next time we get called this jnode will be + clean already, and we will save some seeks. */ + PUSH_SB_FIELD_OPT(flush.written_threshold, "%u"); + /* The maximum number of nodes to scan left on a level during + flush. */ + PUSH_SB_FIELD_OPT(flush.scan_maxnodes, "%u"); + + /* preferred IO size */ + PUSH_SB_FIELD_OPT(optimal_io_size, "%u"); + + /* carry flags used for insertion of new nodes */ + PUSH_SB_FIELD_OPT(tree.carry.new_node_flags, "%u"); + /* carry flags used for insertion of new extents */ + PUSH_SB_FIELD_OPT(tree.carry.new_extent_flags, "%u"); + /* carry flags used for paste operations */ + PUSH_SB_FIELD_OPT(tree.carry.paste_flags, "%u"); + /* carry flags used for insert operations */ + PUSH_SB_FIELD_OPT(tree.carry.insert_flags, "%u"); + +#ifdef CONFIG_REISER4_BADBLOCKS + /* Alternative master superblock location in case if it's original + location is not writeable/accessable. This is offset in BYTES. */ + PUSH_SB_FIELD_OPT(altsuper, "%lu"); +#endif + + /* turn on BSD-style gid assignment */ + PUSH_BIT_OPT("bsdgroups", REISER4_BSD_GID); + /* turn on 32 bit times */ + PUSH_BIT_OPT("32bittimes", REISER4_32_BIT_TIMES); + /* turn off concurrent flushing */ + PUSH_BIT_OPT("mtflush", REISER4_MTFLUSH); + /* disable pseudo files support */ + PUSH_BIT_OPT("nopseudo", REISER4_NO_PSEUDO); + /* Don't load all bitmap blocks at mount time, it is useful + for machines with tiny RAM and large disks. */ + PUSH_BIT_OPT("dont_load_bitmap", REISER4_DONT_LOAD_BITMAP); + + PUSH_OPT ({ + /* tree traversal readahead parameters: + -o readahead:MAXNUM:FLAGS + MAXNUM - max number fo nodes to request readahead for: -1UL will set it to max_sane_readahead() + FLAGS - combination of bits: RA_ADJCENT_ONLY, RA_ALL_LEVELS, CONTINUE_ON_PRESENT + */ + .name = "readahead", + .type = OPT_FORMAT, + .u = { + .f = { + .format = "%u:%u", + .nr_args = 2, + .arg1 = &sbinfo->ra_params.max, + .arg2 = &sbinfo->ra_params.flags, + .arg3 = NULL, + .arg4 = NULL + } + } + }); + + /* What to do in case of fs error */ + PUSH_OPT ({ + .name = "onerror", + .type = OPT_ONEOF, + .u = { + .oneof = { + .result = &sbinfo->onerror, + .list = {"panic", "remount-ro", "reboot", NULL}, + } + } + }); + + sbinfo->tmgr.atom_max_size = txnmgr_get_max_atom_size(s); + sbinfo->tmgr.atom_max_age = REISER4_ATOM_MAX_AGE / HZ; + sbinfo->tmgr.atom_max_flushers = ATOM_MAX_FLUSHERS; + + sbinfo->tree.cbk_cache.nr_slots = CBK_CACHE_SLOTS; + + sbinfo->flush.relocate_threshold = FLUSH_RELOCATE_THRESHOLD; + sbinfo->flush.relocate_distance = FLUSH_RELOCATE_DISTANCE; + sbinfo->flush.written_threshold = FLUSH_WRITTEN_THRESHOLD; + sbinfo->flush.scan_maxnodes = FLUSH_SCAN_MAXNODES; + + sbinfo->optimal_io_size = REISER4_OPTIMAL_IO_SIZE; + + sbinfo->tree.carry.new_node_flags = REISER4_NEW_NODE_FLAGS; + sbinfo->tree.carry.new_extent_flags = REISER4_NEW_EXTENT_FLAGS; + sbinfo->tree.carry.paste_flags = REISER4_PASTE_FLAGS; + sbinfo->tree.carry.insert_flags = REISER4_INSERT_FLAGS; + + log_file_name = NULL; + + /* + init default readahead params + */ + sbinfo->ra_params.max = num_physpages / 4; + sbinfo->ra_params.flags = 0; + + result = parse_options(opt_string, opts, p - opts); + kfree(opts); + if (result != 0) + return result; + + sbinfo->tmgr.atom_max_age *= HZ; + if (sbinfo->tmgr.atom_max_age <= 0) + /* overflow */ + sbinfo->tmgr.atom_max_age = REISER4_ATOM_MAX_AGE; + + /* round optimal io size up to 512 bytes */ + sbinfo->optimal_io_size >>= VFS_BLKSIZE_BITS; + sbinfo->optimal_io_size <<= VFS_BLKSIZE_BITS; + if (sbinfo->optimal_io_size == 0) { + warning("nikita-2497", "optimal_io_size is too small"); + return RETERR(-EINVAL); + } + + /* disable single-threaded flush as it leads to deadlock */ + sbinfo->fs_flags |= (1 << REISER4_MTFLUSH); + return result; +} + +/* show mount options in /proc/mounts */ +static int +reiser4_show_options(struct seq_file *m, struct vfsmount *mnt) +{ + struct super_block *super; + reiser4_super_info_data *sbinfo; + + super = mnt->mnt_sb; + sbinfo = get_super_private(super); + + seq_printf(m, ",trace=0x%x", sbinfo->trace_flags); + seq_printf(m, ",log=0x%x", sbinfo->log_flags); + seq_printf(m, ",debug=0x%x", sbinfo->debug_flags); + seq_printf(m, ",atom_max_size=0x%x", sbinfo->tmgr.atom_max_size); + + return 0; +} + +/* ->write_super() method. Called by sync(2). */ +static void +reiser4_write_super(struct super_block *s) +{ + int ret; + reiser4_context ctx; + + assert("vs-1700", !rofs_super(s)); + + init_context(&ctx, s); + + ret = capture_super_block(s); + if (ret != 0) + warning("vs-1701", + "capture_super_block failed in write_super: %d", ret); + ret = txnmgr_force_commit_all(s, 1); + if (ret != 0) + warning("jmacd-77113", + "txn_force failed in write_super: %d", ret); + + s->s_dirt = 0; + + reiser4_exit_context(&ctx); +} + +static void +reiser4_put_super(struct super_block *s) +{ + reiser4_super_info_data *sbinfo; + reiser4_context context; + + sbinfo = get_super_private(s); + assert("vs-1699", sbinfo); + + init_context(&context, s); + stop_ktxnmgrd(&sbinfo->tmgr); + + /* have disk format plugin to free its resources */ + if (get_super_private(s)->df_plug->release) + get_super_private(s)->df_plug->release(s); + + done_ktxnmgrd_context(&sbinfo->tmgr); + done_entd_context(s); + + check_block_counters(s); + + rcu_barrier(); + /* done_formatted_fake just has finished with last jnodes (bitmap + * ones) */ + done_tree(&sbinfo->tree); + /* call finish_rcu(), because some znode were "released" in + * done_tree(). */ + rcu_barrier(); + done_formatted_fake(s); + + /* no assertions below this line */ + reiser4_exit_context(&context); + + kfree(sbinfo); + s->s_fs_info = NULL; +} + +/* ->get_sb() method of file_system operations. */ +static struct super_block * +reiser4_get_sb(struct file_system_type *fs_type /* file + * system + * type */ , + int flags /* flags */ , + const char *dev_name /* device name */ , + void *data /* mount options */ ) +{ + return get_sb_bdev(fs_type, flags, dev_name, data, reiser4_fill_super); +} + +int d_cursor_init(void); +void d_cursor_done(void); + +/* + * Reiser4 initialization/shutdown. + * + * Code below performs global reiser4 initialization that is done either as + * part of kernel initialization (when reiser4 is statically built-in), or + * during reiser4 module load (when compiled as module). + */ + +/* + * Initialization stages for reiser4. + * + * These enumerate various things that have to be done during reiser4 + * startup. Initialization code (init_reiser4()) keeps track of what stage was + * reached, so that proper undo can be done if error occurs during + * initialization. + */ +typedef enum { + INIT_NONE, /* nothing is initialized yet */ + INIT_INODECACHE, /* inode cache created */ + INIT_CONTEXT_MGR, /* list of active contexts created */ + INIT_ZNODES, /* znode slab created */ + INIT_PLUGINS, /* plugins initialized */ + INIT_PLUGIN_SET, /* psets initialized */ + INIT_TXN, /* transaction manager initialized */ + INIT_FAKES, /* fake inode initialized */ + INIT_JNODES, /* jnode slab initialized */ + INIT_EFLUSH, /* emergency flush initialized */ + INIT_FQS, /* flush queues initialized */ + INIT_DENTRY_FSDATA, /* dentry_fsdata slab initialized */ + INIT_FILE_FSDATA, /* file_fsdata slab initialized */ + INIT_D_CURSOR, /* d_cursor suport initialized */ + INIT_FS_REGISTERED, /* reiser4 file system type registered */ +} reiser4_init_stage; + +static reiser4_init_stage init_stage; + +/* finish with reiser4: this is called either at shutdown or at module unload. */ +static void +shutdown_reiser4(void) +{ +#define DONE_IF( stage, exp ) \ + if( init_stage == ( stage ) ) { \ + exp; \ + -- init_stage; \ + } + + /* + * undo initializations already done by init_reiser4(). + */ + + DONE_IF(INIT_FS_REGISTERED, unregister_filesystem(&reiser4_fs_type)); + DONE_IF(INIT_D_CURSOR, d_cursor_done()); + DONE_IF(INIT_FILE_FSDATA, done_file_fsdata()); + DONE_IF(INIT_DENTRY_FSDATA, done_dentry_fsdata()); + DONE_IF(INIT_FQS, done_fqs()); + DONE_IF(INIT_EFLUSH, eflush_done()); + DONE_IF(INIT_JNODES, jnode_done_static()); + DONE_IF(INIT_FAKES,;); + DONE_IF(INIT_TXN, txnmgr_done_static()); + DONE_IF(INIT_PLUGIN_SET,plugin_set_done()); + DONE_IF(INIT_PLUGINS,;); + DONE_IF(INIT_ZNODES, znodes_done()); + DONE_IF(INIT_CONTEXT_MGR,;); + DONE_IF(INIT_INODECACHE, destroy_inodecache()); + assert("nikita-2516", init_stage == INIT_NONE); + +#undef DONE_IF +} + +/* initialize reiser4: this is called either at bootup or at module load. */ +static int __init +init_reiser4(void) +{ +#define CHECK_INIT_RESULT( exp ) \ +({ \ + result = exp; \ + if( result == 0 ) \ + ++ init_stage; \ + else { \ + shutdown_reiser4(); \ + return result; \ + } \ +}) + + int result; + /* + printk(KERN_INFO + "Loading Reiser4. " + "See www.namesys.com for a description of Reiser4.\n"); + */ + init_stage = INIT_NONE; + + CHECK_INIT_RESULT(init_inodecache()); + CHECK_INIT_RESULT(init_context_mgr()); + CHECK_INIT_RESULT(znodes_init()); + CHECK_INIT_RESULT(init_plugins()); + CHECK_INIT_RESULT(plugin_set_init()); + CHECK_INIT_RESULT(txnmgr_init_static()); + CHECK_INIT_RESULT(init_fakes()); + CHECK_INIT_RESULT(jnode_init_static()); + CHECK_INIT_RESULT(eflush_init()); + CHECK_INIT_RESULT(init_fqs()); + CHECK_INIT_RESULT(init_dentry_fsdata()); + CHECK_INIT_RESULT(init_file_fsdata()); + CHECK_INIT_RESULT(d_cursor_init()); + CHECK_INIT_RESULT(register_filesystem(&reiser4_fs_type)); + + assert("nikita-2515", init_stage == INIT_FS_REGISTERED); + return 0; +#undef CHECK_INIT_RESULT +} + +static void __exit +done_reiser4(void) +{ + shutdown_reiser4(); +} + +reiser4_internal void reiser4_handle_error(void) +{ + struct super_block *sb = reiser4_get_current_sb(); + + if ( !sb ) + return; + reiser4_status_write(REISER4_STATUS_DAMAGED, 0, "Filesystem error occured"); + switch ( get_super_private(sb)->onerror ) { + case 0: + reiser4_panic("foobar-42", "Filesystem error occured\n"); + case 1: + if ( sb->s_flags & MS_RDONLY ) + return; + sb->s_flags |= MS_RDONLY; + break; + case 2: + machine_restart(NULL); + } +} + +module_init(init_reiser4); +module_exit(done_reiser4); + +MODULE_DESCRIPTION("Reiser4 filesystem"); +MODULE_AUTHOR("Hans Reiser "); + +MODULE_LICENSE("GPL"); + +/* description of the reiser4 file system type in the VFS eyes. */ +static struct file_system_type reiser4_fs_type = { + .owner = THIS_MODULE, + .name = "reiser4", + .fs_flags = FS_REQUIRES_DEV, + .get_sb = reiser4_get_sb, + .kill_sb = kill_block_super,/*reiser4_kill_super,*/ + .next = NULL +}; + +struct super_operations reiser4_super_operations = { + .alloc_inode = reiser4_alloc_inode, + .destroy_inode = reiser4_destroy_inode, + .read_inode = noop_read_inode, + .dirty_inode = NULL, + .write_inode = NULL, + .put_inode = NULL, + .drop_inode = reiser4_drop_inode, + .delete_inode = reiser4_delete_inode, + .put_super = reiser4_put_super, + .write_super = reiser4_write_super, + .sync_fs = NULL, + .write_super_lockfs = NULL, + .unlockfs = NULL, + .statfs = reiser4_statfs, + .remount_fs = NULL, + .clear_inode = reiser4_clear_inode, + .umount_begin = NULL, + .sync_inodes = reiser4_sync_inodes, + .show_options = reiser4_show_options +}; + +/* + * Object serialization support. + * + * To support knfsd file system provides export_operations that are used to + * construct and interpret NFS file handles. As a generalization of this, + * reiser4 object plugins have serialization support: it provides methods to + * create on-wire representation of identity of reiser4 object, and + * re-create/locate object given its on-wire identity. + * + */ + +/* + * return number of bytes that on-wire representation of @inode's identity + * consumes. + */ +static int +encode_inode_size(struct inode *inode) +{ + assert("nikita-3514", inode != NULL); + assert("nikita-3515", inode_file_plugin(inode) != NULL); + assert("nikita-3516", inode_file_plugin(inode)->wire.size != NULL); + + return inode_file_plugin(inode)->wire.size(inode) + sizeof(d16); +} + +/* + * store on-wire representation of @inode's identity at the area beginning at + * @start. + */ +static char * +encode_inode(struct inode *inode, char *start) +{ + assert("nikita-3517", inode != NULL); + assert("nikita-3518", inode_file_plugin(inode) != NULL); + assert("nikita-3519", inode_file_plugin(inode)->wire.write != NULL); + + /* + * first, store two-byte identifier of object plugin, then + */ + save_plugin_id(file_plugin_to_plugin(inode_file_plugin(inode)), + (d16 *)start); + start += sizeof(d16); + /* + * call plugin to serialize object's identity + */ + return inode_file_plugin(inode)->wire.write(inode, start); +} + +/* + * Supported file-handle types + */ +typedef enum { + FH_WITH_PARENT = 0x10, /* file handle with parent */ + FH_WITHOUT_PARENT = 0x11 /* file handle without parent */ +} reiser4_fhtype; + +#define NFSERROR (255) + +/* this returns number of 32 bit long numbers encoded in @lenp. 255 is + * returned if file handle can not be stored */ +static int +reiser4_encode_fh(struct dentry *dentry, __u32 *data, int *lenp, int need_parent) +{ + struct inode *inode; + struct inode *parent; + char *addr; + int need; + int delta; + int result; + reiser4_context context; + + /* + * knfsd asks as to serialize object in @dentry, and, optionally its + * parent (if need_parent != 0). + * + * encode_inode() and encode_inode_size() is used to build + * representation of object and its parent. All hard work is done by + * object plugins. + */ + + inode = dentry->d_inode; + parent = dentry->d_parent->d_inode; + + addr = (char *)data; + + need = encode_inode_size(inode); + if (need < 0) + return NFSERROR; + if (need_parent) { + delta = encode_inode_size(parent); + if (delta < 0) + return NFSERROR; + need += delta; + } + + init_context(&context, dentry->d_inode->i_sb); + + if (need <= sizeof(__u32) * (*lenp)) { + addr = encode_inode(inode, addr); + if (need_parent) + addr = encode_inode(parent, addr); + + /* store in lenp number of 32bit words required for file + * handle. */ + *lenp = (need + sizeof(__u32) - 1) >> 2; + result = need_parent ? FH_WITH_PARENT : FH_WITHOUT_PARENT; + } else + /* no enough space in file handle */ + result = NFSERROR; + reiser4_exit_context(&context); + return result; +} + +/* + * read serialized object identity from @addr and store information about + * object in @obj. This is dual to encode_inode(). + */ +static char * +decode_inode(struct super_block *s, char *addr, reiser4_object_on_wire *obj) +{ + file_plugin *fplug; + + /* identifier of object plugin is stored in the first two bytes, + * followed by... */ + fplug = file_plugin_by_disk_id(get_tree(s), (d16 *)addr); + if (fplug != NULL) { + addr += sizeof(d16); + obj->plugin = fplug; + assert("nikita-3520", fplug->wire.read != NULL); + /* plugin specific encoding of object identity. */ + addr = fplug->wire.read(addr, obj); + } else + addr = ERR_PTR(RETERR(-EINVAL)); + return addr; +} + +/* initialize place-holder for object */ +static void +object_on_wire_init(reiser4_object_on_wire *o) +{ + o->plugin = NULL; +} + +/* finish with @o */ +static void +object_on_wire_done(reiser4_object_on_wire *o) +{ + if (o->plugin != NULL) + o->plugin->wire.done(o); +} + +/* decode knfsd file handle. This is dual to reiser4_encode_fh() */ +static struct dentry * +reiser4_decode_fh(struct super_block *s, __u32 *data, + int len, int fhtype, + int (*acceptable)(void *context, struct dentry *de), + void *context) +{ + reiser4_context ctx; + reiser4_object_on_wire object; + reiser4_object_on_wire parent; + char *addr; + int with_parent; + + init_context(&ctx, s); + + assert("vs-1482", + fhtype == FH_WITH_PARENT || fhtype == FH_WITHOUT_PARENT); + + with_parent = (fhtype == FH_WITH_PARENT); + + addr = (char *)data; + + object_on_wire_init(&object); + object_on_wire_init(&parent); + + addr = decode_inode(s, addr, &object); + if (!IS_ERR(addr)) { + if (with_parent) + addr = decode_inode(s, addr, &parent); + if (!IS_ERR(addr)) { + struct dentry *d; + typeof(s->s_export_op->find_exported_dentry) fn; + + fn = s->s_export_op->find_exported_dentry; + assert("nikita-3521", fn != NULL); + d = fn(s, &object, with_parent ? &parent : NULL, + acceptable, context); + if (d != NULL && !IS_ERR(d)) + /* FIXME check for -ENOMEM */ + reiser4_get_dentry_fsdata(d)->stateless = 1; + addr = (char *)d; + } + } + + object_on_wire_done(&object); + object_on_wire_done(&parent); + + reiser4_exit_context(&ctx); + return (void *)addr; +} + +static struct dentry * +reiser4_get_dentry(struct super_block *sb, void *data) +{ + reiser4_object_on_wire *o; + + assert("nikita-3522", sb != NULL); + assert("nikita-3523", data != NULL); + /* + * this is only supposed to be called by + * + * reiser4_decode_fh->find_exported_dentry + * + * so, reiser4_context should be here already. + * + */ + assert("nikita-3526", is_in_reiser4_context()); + + o = (reiser4_object_on_wire *)data; + assert("nikita-3524", o->plugin != NULL); + assert("nikita-3525", o->plugin->wire.get != NULL); + + return o->plugin->wire.get(sb, o); +} + +static struct dentry * +reiser4_get_dentry_parent(struct dentry *child) +{ + struct inode *dir; + dir_plugin *dplug; + + assert("nikita-3527", child != NULL); + /* see comment in reiser4_get_dentry() about following assertion */ + assert("nikita-3528", is_in_reiser4_context()); + + dir = child->d_inode; + assert("nikita-3529", dir != NULL); + dplug = inode_dir_plugin(dir); + assert("nikita-3531", ergo(dplug != NULL, dplug->get_parent != NULL)); + if (dplug != NULL) + return dplug->get_parent(dir); + else + return ERR_PTR(RETERR(-ENOTDIR)); +} + +struct export_operations reiser4_export_operations = { + .encode_fh = reiser4_encode_fh, + .decode_fh = reiser4_decode_fh, + .get_parent = reiser4_get_dentry_parent, + .get_dentry = reiser4_get_dentry +}; + +struct dentry_operations reiser4_dentry_operations = { + .d_revalidate = NULL, + .d_hash = NULL, + .d_compare = NULL, + .d_delete = NULL, + .d_release = reiser4_d_release, + .d_iput = NULL, +}; + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/vfs_ops.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/vfs_ops.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,139 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* vfs_ops.c's exported symbols */ + +#if !defined( __FS_REISER4_VFS_OPS_H__ ) +#define __FS_REISER4_VFS_OPS_H__ + +#include "forward.h" +#include "coord.h" +#include "seal.h" +#include "type_safe_list.h" +#include "plugin/dir/dir.h" +#include "plugin/file/file.h" +#include "super.h" +#include "readahead.h" + +#include /* for loff_t */ +#include /* for struct address_space */ +#include /* for struct dentry */ +#include +#include + +extern int reiser4_mark_inode_dirty(struct inode *object); +extern int reiser4_update_sd(struct inode *object); +extern int reiser4_add_nlink(struct inode *, struct inode *, int); +extern int reiser4_del_nlink(struct inode *, struct inode *, int); + +extern struct file_operations reiser4_file_operations; +extern struct inode_operations reiser4_inode_operations; +extern struct inode_operations reiser4_symlink_inode_operations; +extern struct inode_operations reiser4_special_inode_operations; +extern struct super_operations reiser4_super_operations; +extern struct export_operations reiser4_export_operations; +extern struct address_space_operations reiser4_as_operations; +extern struct dentry_operations reiser4_dentry_operations; +extern int reiser4_invalidatepage(struct page *page, unsigned long offset); +extern int reiser4_releasepage(struct page *page, int gfp); +extern int reiser4_writepages(struct address_space *, struct writeback_control *wbc); +extern int reiser4_start_up_io(struct page *page); +extern void move_inode_out_from_sync_inodes_loop(struct address_space * mapping); +extern void reiser4_clear_page_dirty(struct page *); +extern void reiser4_throttle_write(struct inode*); +/* + * this is used to speed up lookups for directory entry: on initial call to + * ->lookup() seal and coord of directory entry (if found, that is) are stored + * in struct dentry and reused later to avoid tree traversals. + */ +typedef struct de_location { + /* seal covering directory entry */ + seal_t entry_seal; + /* coord of directory entry */ + coord_t entry_coord; + /* ordinal number of directory entry among all entries with the same + key. (Starting from 0.) */ + int pos; +} de_location; + +/* &reiser4_dentry_fsdata - reiser4-specific data attached to dentries. + + This is allocated dynamically and released in d_op->d_release() + + Currently it only contains cached location (hint) of directory entry, but + it is expected that other information will be accumulated here. +*/ +typedef struct reiser4_dentry_fsdata { + /* here will go fields filled by ->lookup() to speedup next + create/unlink, like blocknr of znode with stat-data, or key + of stat-data. + */ + de_location dec; + int stateless; /* created through reiser4_decode_fh, needs special + * treatment in readdir. */ +} reiser4_dentry_fsdata; + +/* declare data types and manipulation functions for readdir list. */ +TYPE_SAFE_LIST_DECLARE(readdir); + +struct dir_cursor; + +/* &reiser4_dentry_fsdata - reiser4-specific data attached to files. + + This is allocated dynamically and released in reiser4_release() */ +struct reiser4_file_fsdata { + /* pointer back to the struct file which this reiser4_file_fsdata is + * part of */ + struct file *back; + /* detached cursor for stateless readdir. */ + struct dir_cursor *cursor; + /* We need both directory and regular file parts here, because there + are file system objects that are files and directories. */ + struct { + readdir_pos readdir; + readdir_list_link linkage; + } dir; + /* hints to speed up operations with regular files: read and write. */ + struct { + hint_t hint; + } reg; + /* */ + struct { + /* this is called by reiser4_readpages if set */ + void (*readpages)(struct address_space *, + struct list_head *pages, + void *data); + /* reiser4_readpaextended coord. It is set by read_extent before + calling page_cache_readahead */ + void *data; + } ra2; + struct reiser4_file_ra_state ra1; + +}; + +TYPE_SAFE_LIST_DEFINE(readdir, reiser4_file_fsdata, dir.linkage); + +extern reiser4_dentry_fsdata *reiser4_get_dentry_fsdata(struct dentry *dentry); +extern reiser4_file_fsdata *reiser4_get_file_fsdata(struct file *f); +extern void reiser4_free_dentry_fsdata(struct dentry *dentry); +extern void reiser4_free_file_fsdata(struct file *f); +extern void reiser4_free_fsdata(reiser4_file_fsdata *fsdata); + +extern reiser4_file_fsdata *create_fsdata(struct file *file, int gfp); + +extern void reiser4_handle_error(void); +extern int reiser4_parse_options (struct super_block *, char *); + +/* __FS_REISER4_VFS_OPS_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/wander.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/wander.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,2185 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Reiser4 Wandering Log */ + +/* You should read http://www.namesys.com/txn-doc.html + + That describes how filesystem operations are performed as atomic + transactions, and how we try to arrange it so that we can write most of the + data only once while performing the operation atomically. + + For the purposes of this code, it is enough for it to understand that it + has been told a given block should be written either once, or twice (if + twice then once to the wandered location and once to the real location). + + This code guarantees that those blocks that are defined to be part of an + atom either all take effect or none of them take effect. + + Relocate set nodes are submitted to write by the jnode_flush() routine, and + the overwrite set is submitted by reiser4_write_log(). This is because with + the overwrite set we seek to optimize writes, and with the relocate set we + seek to cause disk order to correlate with the parent first pre-order. + + reiser4_write_log() allocates and writes wandered blocks and maintains + additional on-disk structures of the atom as wander records (each wander + record occupies one block) for storing of the "wandered map" (a table which + contains a relation between wandered and real block numbers) and other + information which might be needed at transaction recovery time. + + The wander records are unidirectionally linked into a circle: each wander + record contains a block number of the next wander record, the last wander + record points to the first one. + + One wander record (named "tx head" in this file) has a format which is + different from the other wander records. The "tx head" has a reference to the + "tx head" block of the previously committed atom. Also, "tx head" contains + fs information (the free blocks counter, and the oid allocator state) which + is logged in a special way . + + There are two journal control blocks, named journal header and journal + footer which have fixed on-disk locations. The journal header has a + reference to the "tx head" block of the last committed atom. The journal + footer points to the "tx head" of the last flushed atom. The atom is + "played" when all blocks from its overwrite set are written to disk the + second time (i.e. written to their real locations). + + NOTE: People who know reiserfs internals and its journal structure might be + confused with these terms journal footer and journal header. There is a table + with terms of similar semantics in reiserfs (reiser3) and reiser4: + + REISER3 TERM | REISER4 TERM | DESCRIPTION + --------------------+-----------------------+---------------------------- + commit record | journal header | atomic write of this record + | | ends transaction commit + --------------------+-----------------------+---------------------------- + journal header | journal footer | atomic write of this record + | | ends post-commit writes. + | | After successful + | | writing of this journal + | | blocks (in reiser3) or + | | wandered blocks/records are + | | free for re-use. + --------------------+-----------------------+---------------------------- + + The atom commit process is the following: + + 1. The overwrite set is taken from atom's clean list, and its size is + counted. + + 2. The number of necessary wander records (including tx head) is calculated, + and the wander record blocks are allocated. + + 3. Allocate wandered blocks and populate wander records by wandered map. + + 4. submit write requests for wander records and wandered blocks. + + 5. wait until submitted write requests complete. + + 6. update journal header: change the pointer to the block number of just + written tx head, submit an i/o for modified journal header block and wait + for i/o completion. + + NOTE: The special logging for bitmap blocks and some reiser4 super block + fields makes processes of atom commit, flush and recovering a bit more + complex (see comments in the source code for details). + + The atom playing process is the following: + + 1. Write atom's overwrite set in-place. + + 2. Wait on i/o. + + 3. Update journal footer: change the pointer to block number of tx head + block of the atom we currently flushing, submit an i/o, wait on i/o + completion. + + 4. Free disk space which was used for wandered blocks and wander records. + + After the freeing of wandered blocks and wander records we have that journal + footer points to the on-disk structure which might be overwritten soon. + Neither the log writer nor the journal recovery procedure use that pointer + for accessing the data. When the journal recovery procedure finds the oldest + transaction it compares the journal footer pointer value with the "prev_tx" + pointer value in tx head, if values are equal the oldest not flushed + transaction is found. + + NOTE on disk space leakage: the information about of what blocks and how many + blocks are allocated for wandered blocks, wandered records is not written to + the disk because of special logging for bitmaps and some super blocks + counters. After a system crash we the reiser4 does not remember those + objects allocation, thus we have no such a kind of disk space leakage. +*/ + +/* Special logging of reiser4 super block fields. */ + +/* There are some reiser4 super block fields (free block count and OID allocator + state (number of files and next free OID) which are logged separately from + super block to avoid unnecessary atom fusion. + + So, the reiser4 super block can be not captured by a transaction with + allocates/deallocates disk blocks or create/delete file objects. Moreover, + the reiser4 on-disk super block is not touched when such a transaction is + committed and flushed. Those "counters logged specially" are logged in "tx + head" blocks and in the journal footer block. + + A step-by-step description of special logging: + + 0. The per-atom information about deleted or created files and allocated or + freed blocks is collected during the transaction. The atom's + ->nr_objects_created and ->nr_objects_deleted are for object + deletion/creation tracking, the numbers of allocated and freed blocks are + calculated using atom's delete set and atom's capture list -- all new and + relocated nodes should be on atom's clean list and should have JNODE_RELOC + bit set. + + 1. The "logged specially" reiser4 super block fields have their "committed" + versions in the reiser4 in-memory super block. They get modified only at + atom commit time. The atom's commit thread has an exclusive access to those + "committed" fields because the log writer implementation supports only one + atom commit a time (there is a per-fs "commit" semaphore). At + that time "committed" counters are modified using per-atom information + collected during the transaction. These counters are stored on disk as a + part of tx head block when atom is committed. + + 2. When the atom is flushed the value of the free block counter and the OID + allocator state get written to the journal footer block. A special journal + procedure (journal_recover_sb_data()) takes those values from the journal + footer and updates the reiser4 in-memory super block. + + NOTE: That means free block count and OID allocator state are logged + separately from the reiser4 super block regardless of the fact that the + reiser4 super block has fields to store both the free block counter and the + OID allocator. + + Writing the whole super block at commit time requires knowing true values of + all its fields without changes made by not yet committed transactions. It is + possible by having their "committed" version of the super block like the + reiser4 bitmap blocks have "committed" and "working" versions. However, + another scheme was implemented which stores special logged values in the + unused free space inside transaction head block. In my opinion it has an + advantage of not writing whole super block when only part of it was + modified. */ + +#include "debug.h" +#include "dformat.h" +#include "txnmgr.h" +#include "jnode.h" +#include "znode.h" +#include "block_alloc.h" +#include "page_cache.h" +#include "wander.h" +#include "reiser4.h" +#include "super.h" +#include "vfs_ops.h" +#include "writeout.h" +#include "inode.h" +#include "entd.h" + +#include +#include /* for struct super_block */ +#include /* for struct page */ +#include +#include /* for struct bio */ +#include + +static int write_jnodes_to_disk_extent( + capture_list_head * head, jnode *, int, const reiser4_block_nr *, flush_queue_t *, int ); + +/* The commit_handle is a container for objects needed at atom commit time */ +struct commit_handle { + /* A pointer to the list of OVRWR nodes */ + capture_list_head * overwrite_set; + /* atom's overwrite set size */ + int overwrite_set_size; + /* jnodes for wander record blocks */ + capture_list_head tx_list; + /* number of wander records */ + int tx_size; + /* 'committed' sb counters are saved here until atom is completely + flushed */ + __u64 free_blocks; + __u64 nr_files; + __u64 next_oid; + /* A pointer to the atom which is being committed */ + txn_atom *atom; + /* A pointer to current super block */ + struct super_block *super; + /* The counter of modified bitmaps */ + reiser4_block_nr nr_bitmap; +}; + +static void +init_commit_handle(struct commit_handle *ch, txn_atom * atom) +{ + memset(ch, 0, sizeof (struct commit_handle)); + capture_list_init(&ch->tx_list); + + ch->atom = atom; + ch->super = reiser4_get_current_sb(); +} + +static void +done_commit_handle(struct commit_handle *ch UNUSED_ARG) +{ + assert("zam-690", capture_list_empty(&ch->tx_list)); +} + +/* fill journal header block data */ +static void +format_journal_header(struct commit_handle *ch) +{ + struct reiser4_super_info_data *sbinfo; + struct journal_header *header; + jnode *txhead; + + sbinfo = get_super_private(ch->super); + assert("zam-479", sbinfo != NULL); + assert("zam-480", sbinfo->journal_header != NULL); + + txhead = capture_list_front(&ch->tx_list); + + jload(sbinfo->journal_header); + + header = (struct journal_header *) jdata(sbinfo->journal_header); + assert("zam-484", header != NULL); + + cputod64(*jnode_get_block(txhead), &header->last_committed_tx); + + jrelse(sbinfo->journal_header); +} + +/* fill journal footer block data */ +static void +format_journal_footer(struct commit_handle *ch) +{ + struct reiser4_super_info_data *sbinfo; + struct journal_footer *footer; + + jnode *tx_head; + + sbinfo = get_super_private(ch->super); + + tx_head = capture_list_front(&ch->tx_list); + + assert("zam-493", sbinfo != NULL); + assert("zam-494", sbinfo->journal_header != NULL); + + check_me("zam-691", jload(sbinfo->journal_footer) == 0); + + footer = (struct journal_footer *) jdata(sbinfo->journal_footer); + assert("zam-495", footer != NULL); + + cputod64(*jnode_get_block(tx_head), &footer->last_flushed_tx); + cputod64(ch->free_blocks, &footer->free_blocks); + + cputod64(ch->nr_files, &footer->nr_files); + cputod64(ch->next_oid, &footer->next_oid); + + jrelse(sbinfo->journal_footer); +} + +/* wander record capacity depends on current block size */ +static int +wander_record_capacity(const struct super_block *super) +{ + return (super->s_blocksize - sizeof (struct wander_record_header)) / sizeof (struct wander_entry); +} + +/* Fill first wander record (tx head) in accordance with supplied given data */ +static void +format_tx_head(struct commit_handle *ch) +{ + jnode *tx_head; + jnode *next; + struct tx_header *header; + + tx_head = capture_list_front(&ch->tx_list); + assert("zam-692", !capture_list_end(&ch->tx_list, tx_head)); + + next = capture_list_next(tx_head); + if (capture_list_end(&ch->tx_list, next)) + next = tx_head; + + header = (struct tx_header *) jdata(tx_head); + + assert("zam-460", header != NULL); + assert("zam-462", ch->super->s_blocksize >= sizeof (struct tx_header)); + + memset(jdata(tx_head), 0, (size_t) ch->super->s_blocksize); + memcpy(jdata(tx_head), TX_HEADER_MAGIC, TX_HEADER_MAGIC_SIZE); + + cputod32((__u32) ch->tx_size, &header->total); + cputod64(get_super_private(ch->super)->last_committed_tx, &header->prev_tx); + cputod64(*jnode_get_block(next), &header->next_block); + + cputod64(ch->free_blocks, &header->free_blocks); + cputod64(ch->nr_files, &header->nr_files); + cputod64(ch->next_oid, &header->next_oid); +} + +/* prepare ordinary wander record block (fill all service fields) */ +static void +format_wander_record(struct commit_handle *ch, jnode * node, int serial) +{ + struct wander_record_header *LRH; + jnode *next; + + assert("zam-464", node != NULL); + + LRH = (struct wander_record_header *) jdata(node); + next = capture_list_next(node); + + if (capture_list_end(&ch->tx_list, next)) + next = capture_list_front(&ch->tx_list); + + assert("zam-465", LRH != NULL); + assert("zam-463", ch->super->s_blocksize > sizeof (struct wander_record_header)); + + memset(jdata(node), 0, (size_t) ch->super->s_blocksize); + memcpy(jdata(node), WANDER_RECORD_MAGIC, WANDER_RECORD_MAGIC_SIZE); + + cputod32((__u32) ch->tx_size, &LRH->total); + cputod32((__u32) serial, &LRH->serial); + cputod64((__u64) * jnode_get_block(next), &LRH->next_block); +} + +/* add one wandered map entry to formatted wander record */ +static void +store_entry(jnode * node, int index, const reiser4_block_nr * a, const reiser4_block_nr * b) +{ + char *data; + struct wander_entry *pairs; + + data = jdata(node); + assert("zam-451", data != NULL); + + pairs = (struct wander_entry *) (data + sizeof (struct wander_record_header)); + + cputod64(*a, &pairs[index].original); + cputod64(*b, &pairs[index].wandered); +} + +/* currently, wander records contains contain only wandered map, which depend on + overwrite set size */ +static void +get_tx_size(struct commit_handle *ch) +{ + assert("zam-440", ch->overwrite_set_size != 0); + assert("zam-695", ch->tx_size == 0); + + /* count all ordinary wander records + ( - 1) / + 1 and add one + for tx head block */ + ch->tx_size = (ch->overwrite_set_size - 1) / wander_record_capacity(ch->super) + 2; +} + +/* A special structure for using in store_wmap_actor() for saving its state + between calls */ +struct store_wmap_params { + jnode *cur; /* jnode of current wander record to fill */ + int idx; /* free element index in wander record */ + int capacity; /* capacity */ + +#if REISER4_DEBUG + capture_list_head *tx_list; +#endif +}; + +/* an actor for use in blocknr_set_iterator routine which populates the list + of pre-formatted wander records by wandered map info */ +static int +store_wmap_actor(txn_atom * atom UNUSED_ARG, const reiser4_block_nr * a, const reiser4_block_nr * b, void *data) +{ + struct store_wmap_params *params = data; + + if (params->idx >= params->capacity) { + /* a new wander record should be taken from the tx_list */ + params->cur = capture_list_next(params->cur); + assert("zam-454", !capture_list_end(params->tx_list, params->cur)); + + params->idx = 0; + } + + store_entry(params->cur, params->idx, a, b); + params->idx++; + + return 0; +} + +/* This function is called after Relocate set gets written to disk, Overwrite + set is written to wandered locations and all wander records are written + also. Updated journal header blocks contains a pointer (block number) to + first wander record of the just written transaction */ +static int +update_journal_header(struct commit_handle *ch) +{ + struct reiser4_super_info_data *sbinfo = get_super_private(ch->super); + + jnode *jh = sbinfo->journal_header; + jnode *head = capture_list_front(&ch->tx_list); + + int ret; + + format_journal_header(ch); + + ret = write_jnodes_to_disk_extent(&ch->tx_list, jh, 1, jnode_get_block(jh), NULL, 0); + + if (ret) + return ret; + + blk_run_address_space(sbinfo->fake->i_mapping); + /*blk_run_queues();*/ + + ret = jwait_io(jh, WRITE); + + if (ret) + return ret; + + sbinfo->last_committed_tx = *jnode_get_block(head); + + return 0; +} + +/* This function is called after write-back is finished. We update journal + footer block and free blocks which were occupied by wandered blocks and + transaction wander records */ +static int +update_journal_footer(struct commit_handle *ch) +{ + reiser4_super_info_data *sbinfo = get_super_private(ch->super); + + jnode *jf = sbinfo->journal_footer; + + int ret; + + format_journal_footer(ch); + + ret = write_jnodes_to_disk_extent(&ch->tx_list, jf, 1, jnode_get_block(jf), NULL, 0); + if (ret) + return ret; + + blk_run_address_space(sbinfo->fake->i_mapping); + /*blk_run_queue();*/ + + ret = jwait_io(jf, WRITE); + if (ret) + return ret; + + return 0; +} + +/* free block numbers of wander records of already written in place transaction */ +static void +dealloc_tx_list(struct commit_handle *ch) +{ + while (!capture_list_empty(&ch->tx_list)) { + jnode *cur = capture_list_pop_front(&ch->tx_list); + + ON_DEBUG(capture_list_clean(cur)); + reiser4_dealloc_block(jnode_get_block(cur), BLOCK_NOT_COUNTED, BA_FORMATTED); + + unpin_jnode_data(cur); + drop_io_head(cur); + } +} + +/* An actor for use in block_nr_iterator() routine which frees wandered blocks + from atom's overwrite set. */ +static int +dealloc_wmap_actor(txn_atom * atom UNUSED_ARG, + const reiser4_block_nr * a UNUSED_ARG, const reiser4_block_nr * b, void *data UNUSED_ARG) +{ + + assert("zam-499", b != NULL); + assert("zam-500", *b != 0); + assert("zam-501", !blocknr_is_fake(b)); + + reiser4_dealloc_block(b, BLOCK_NOT_COUNTED, BA_FORMATTED); + return 0; +} + +/* free wandered block locations of already written in place transaction */ +static void +dealloc_wmap(struct commit_handle *ch) +{ + assert("zam-696", ch->atom != NULL); + + blocknr_set_iterator(ch->atom, &ch->atom->wandered_map, dealloc_wmap_actor, NULL, 1); +} + +/* helper function for alloc wandered blocks, which refill set of block + numbers needed for wandered blocks */ +static int +get_more_wandered_blocks(int count, reiser4_block_nr * start, int *len) +{ + reiser4_blocknr_hint hint; + int ret; + + reiser4_block_nr wide_len = count; + + /* FIXME-ZAM: A special policy needed for allocation of wandered blocks + ZAM-FIXME-HANS: yes, what happened to our discussion of using a fixed + reserved allocation area so as to get the best qualities of fixed + journals? */ + blocknr_hint_init(&hint); + hint.block_stage = BLOCK_GRABBED; + + ret = reiser4_alloc_blocks(&hint, start, &wide_len, + BA_FORMATTED | BA_USE_DEFAULT_SEARCH_START); + + *len = (int) wide_len; + + return ret; +} + +/* + * roll back changes made before issuing BIO in the case of IO error. + */ +static void +undo_bio(struct bio *bio) +{ + int i; + + for (i = 0; i < bio->bi_vcnt; ++i) { + struct page *pg; + jnode *node; + + pg = bio->bi_io_vec[i].bv_page; + ClearPageWriteback(pg); + node = jprivate(pg); + LOCK_JNODE(node); + JF_CLR(node, JNODE_WRITEBACK); + JF_SET(node, JNODE_DIRTY); + UNLOCK_JNODE(node); + } + bio_put(bio); +} + +#if REISER4_COPY_ON_CAPTURE + +extern spinlock_t scan_lock; + +/* put overwrite set back to atom's clean list */ +static void put_overwrite_set(struct commit_handle * ch) +{ + jnode * cur; + + spin_lock(&scan_lock); + cur = capture_list_front(ch->overwrite_set); + while (!capture_list_end(ch->overwrite_set, cur)) { + assert("vs-1443", NODE_LIST(cur) == OVRWR_LIST); + JF_SET(cur, JNODE_SCANNED); + spin_unlock(&scan_lock); + JF_CLR(cur, JNODE_JLOADED_BY_GET_OVERWRITE_SET); + jrelse_tail(cur); + spin_lock(&scan_lock); + JF_CLR(cur, JNODE_SCANNED); + cur = capture_list_next(cur); + } + spin_unlock(&scan_lock); +} + +/* Count overwrite set size, grab disk space for wandered blocks allocation. + Since we have a separate list for atom's overwrite set we just scan the list, + count bitmap and other not leaf nodes which wandered blocks allocation we + have to grab space for. */ +static int +get_overwrite_set(struct commit_handle *ch) +{ + int ret; + jnode *cur; + __u64 nr_not_leaves = 0; +#if REISER4_DEBUG + __u64 nr_formatted_leaves = 0; + __u64 nr_unformatted_leaves = 0; +#endif + + + assert("zam-697", ch->overwrite_set_size == 0); + + ch->overwrite_set = ATOM_OVRWR_LIST(ch->atom); + + spin_lock(&scan_lock); + cur = capture_list_front(ch->overwrite_set); + + while (!capture_list_end(ch->overwrite_set, cur)) { + jnode *next; + + /* FIXME: for all but first this bit is set already */ + assert("vs-1444", NODE_LIST(cur) == OVRWR_LIST); + JF_SET(cur, JNODE_SCANNED); + next = capture_list_next(cur); + if (!capture_list_end(ch->overwrite_set, next)) + JF_SET(next, JNODE_SCANNED); + spin_unlock(&scan_lock); + + if (jnode_is_znode(cur) && znode_above_root(JZNODE(cur))) { + ON_TRACE(TRACE_LOG, "fake znode found , WANDER=(%d)\n", JF_ISSET(cur, JNODE_OVRWR)); + } + + /* Count bitmap locks for getting correct statistics what number + * of blocks were cleared by the transaction commit. */ + if (jnode_get_type(cur) == JNODE_BITMAP) + ch->nr_bitmap ++; + + assert("zam-939", JF_ISSET(cur, JNODE_OVRWR) || jnode_get_type(cur) == JNODE_BITMAP); + + if (jnode_is_znode(cur) && znode_above_root(JZNODE(cur))) { + /* we replace fake znode by another (real) + znode which is suggested by disk_layout + plugin */ + + /* FIXME: it looks like fake znode should be + replaced by jnode supplied by + disk_layout. */ + + struct super_block *s = reiser4_get_current_sb(); + reiser4_super_info_data *sbinfo = get_current_super_private(); + + if (sbinfo->df_plug->log_super) { + jnode *sj = sbinfo->df_plug->log_super(s); + + assert("zam-593", sj != NULL); + + if (IS_ERR(sj)) + return PTR_ERR(sj); + + LOCK_ATOM(ch->atom); + LOCK_JNODE(sj); + JF_SET(sj, JNODE_OVRWR); + insert_into_atom_ovrwr_list(ch->atom, sj); + UNLOCK_JNODE(sj); + UNLOCK_ATOM(ch->atom); + + /* jload it as the rest of overwrite set */ + jload_gfp(sj, GFP_KERNEL, 0); + + ch->overwrite_set_size++; + } + LOCK_ATOM(ch->atom); + LOCK_JNODE(cur); + uncapture_block(cur); + UNLOCK_ATOM(ch->atom); + jput(cur); + + spin_lock(&scan_lock); + JF_CLR(cur, JNODE_SCANNED); + cur = next; + nr_not_leaves ++; + } else { + int ret; + ch->overwrite_set_size++; + ret = jload_gfp(cur, GFP_KERNEL, 0); + if (ret) + reiser4_panic("zam-783", "cannot load e-flushed jnode back (ret = %d)\n", ret); + + /* Count not leaves here because we have to grab disk + * space for wandered blocks. They were not counted as + * "flush reserved". This should be done after doing + * jload() to avoid races with emergency + * flush. Counting should be done _after_ nodes are + * pinned * into memory by jload(). */ + if (!jnode_is_leaf(cur)) + nr_not_leaves ++; + /* this is to check atom's flush reserved space for + * overwritten leaves */ + else { +#if REISER4_DEBUG + /* at this point @cur either has + * JNODE_FLUSH_RESERVED or is + * eflushed. Locking is not strong enough to + * write an assertion checking for this. */ + if (jnode_is_znode(cur)) + nr_formatted_leaves ++; + else + nr_unformatted_leaves ++; +#endif + JF_CLR(cur, JNODE_FLUSH_RESERVED); + } + spin_lock(&scan_lock); + JF_SET(cur, JNODE_JLOADED_BY_GET_OVERWRITE_SET); + assert("", cur->pg); + JF_CLR(cur, JNODE_SCANNED); + cur = next; + } + + } + spin_unlock(&scan_lock); + + /* Grab space for writing (wandered blocks) of not leaves found in + * overwrite set. */ + ret = reiser4_grab_space_force(nr_not_leaves, BA_RESERVED); + if (ret) + return ret; + + /* Disk space for allocation of wandered blocks of leaf nodes already + * reserved as "flush reserved", move it to grabbed space counter. */ + spin_lock_atom(ch->atom); + assert("zam-940", nr_formatted_leaves + nr_unformatted_leaves <= ch->atom->flush_reserved); + flush_reserved2grabbed(ch->atom, ch->atom->flush_reserved); + spin_unlock_atom(ch->atom); + + return ch->overwrite_set_size; +} + +/* Submit a write request for @nr jnodes beginning from the @first, other + jnodes are after the @first on the double-linked "capture" list. All + jnodes will be written to the disk region of @nr blocks starting with + @block_p block number. If @fq is not NULL it means that waiting for i/o + completion will be done more efficiently by using flush_queue_t objects + +ZAM-FIXME-HANS: brief me on why this function exists, and why bios are +aggregated in this function instead of being left to the layers below + +FIXME: ZAM->HANS: What layer are you talking about? Can you point me to that? +Why that layer needed? Why BIOs cannot be constructed here? +*/ +static int +write_jnodes_to_disk_extent(capture_list_head * head, jnode * first, int nr, + const reiser4_block_nr * block_p, flush_queue_t * fq, int flags) +{ + struct super_block *super = reiser4_get_current_sb(); + int for_reclaim = flags & WRITEOUT_FOR_PAGE_RECLAIM; + int max_blocks; + jnode *cur = first; + reiser4_block_nr block; + + assert("zam-571", first != NULL); + assert("zam-572", block_p != NULL); + assert("zam-570", nr > 0); + + block = *block_p; + + ON_TRACE (TRACE_IO_W, "write of %d blocks starting from %llu\n", + nr, (unsigned long long)block); + + max_blocks = bdev_get_queue(super->s_bdev)->max_sectors >> (super->s_blocksize_bits - 9); + + while (nr > 0) { + struct bio *bio; + int nr_blocks = min(nr, max_blocks); + int i; + int nr_used; + + bio = bio_alloc(GFP_NOIO, nr_blocks); + if (!bio) + return RETERR(-ENOMEM); + + bio->bi_bdev = super->s_bdev; + bio->bi_sector = block * (super->s_blocksize >> 9); + for (nr_used = 0, i = 0; i < nr_blocks; i++) { + struct page *pg; + ON_DEBUG(int jnode_is_releasable(jnode *)); + + assert("vs-1423", ergo(jnode_is_znode(cur) || jnode_is_unformatted(cur), JF_ISSET(cur, JNODE_SCANNED))); + pg = jnode_page(cur); + assert("zam-573", pg != NULL); + + page_cache_get(pg); + + lock_and_wait_page_writeback(pg); + + LOCK_JNODE(cur); + assert("nikita-3553", jnode_page(cur) == pg); + assert("nikita-3554", jprivate(pg) == cur); + + assert("nikita-3166", + ergo(!JF_ISSET(cur, JNODE_CC), pg->mapping == jnode_get_mapping(cur))); + if (!JF_ISSET(cur, JNODE_WRITEBACK)) { + assert("nikita-3165", !jnode_is_releasable(cur)); + UNLOCK_JNODE(cur); + if (!bio_add_page(bio, + pg, super->s_blocksize, 0)) { + /* + * underlying device is satiated. Stop + * adding pages to the bio. + */ + unlock_page(pg); + page_cache_release(pg); + break; + } + + LOCK_JNODE(cur); + JF_SET(cur, JNODE_WRITEBACK); + JF_CLR(cur, JNODE_DIRTY); + UNLOCK_JNODE(cur); + + SetPageWriteback(pg); + if (for_reclaim) + ent_writes_page(super, pg); + spin_lock(&pg->mapping->page_lock); + + if (REISER4_STATS && !PageDirty(pg)) + reiser4_stat_inc(pages_clean); + + /* don't check return value: submit page even if + it wasn't dirty. */ + test_clear_page_dirty(pg); + + list_del(&pg->list); + list_add(&pg->list, &pg->mapping->locked_pages); + + spin_unlock(&pg->mapping->page_lock); + + nr_used ++; + } else { + /* jnode being WRITEBACK might be replaced on + ovrwr_nodes list with jnode CC. We just + encountered this CC jnode. Do not submit i/o + for it */ + assert("zam-912", JF_ISSET(cur, JNODE_CC)); + UNLOCK_JNODE(cur); + } + unlock_page(pg); + + nr --; + cur = capture_list_next(cur); + } + if (nr_used > 0) { + assert("nikita-3455", + bio->bi_size == super->s_blocksize * nr_used); + assert("nikita-3456", bio->bi_vcnt == nr_used); + + /* Check if we are allowed to write at all */ + if (super->s_flags & MS_RDONLY) + undo_bio(bio); + else { + add_fq_to_bio(fq, bio); + reiser4_submit_bio(WRITE, bio); + } + + block += nr_used - 1; + update_blocknr_hint_default (super, &block); + block += 1; + } else { + reiser4_stat_inc(txnmgr.empty_bio); + bio_put(bio); + } + } + return 0; +} + +/* @nr jnodes starting from @j are marked as JNODE_SCANNED. Clear this bit for + all those jnodes */ +static void +unscan_sequence_nolock(jnode *j, int nr) +{ + int i; + + for (i = 0; i < nr; i ++) { + assert("vs-1631", JF_ISSET(j, JNODE_SCANNED)); + JF_CLR(j, JNODE_SCANNED); + j = capture_list_next(j); + } +} + +static void +unscan_sequence(jnode *j, int nr) +{ + spin_lock(&scan_lock); + unscan_sequence_nolock(j, nr); + spin_unlock(&scan_lock); +} + +/* This is a procedure which recovers a contiguous sequences of disk block + numbers in the given list of j-nodes and submits write requests on this + per-sequence basis */ +reiser4_internal int +write_jnode_list(capture_list_head *head, flush_queue_t *fq, long *nr_submitted, int flags) +{ + int ret; + jnode *beg, *end; + + spin_lock(&scan_lock); + beg = capture_list_front(head); + while (!capture_list_end(head, beg)) { + int nr = 1; + jnode *cur; + + JF_SET(beg, JNODE_SCANNED); + end = beg; + cur = capture_list_next(beg); + + while (!capture_list_end(head, cur)) { + if (*jnode_get_block(cur) != *jnode_get_block(beg) + nr) + /* jnode from which next sequence of blocks starts */ + break; + + JF_SET(cur, JNODE_SCANNED); + ++ nr; + end = cur; + cur = capture_list_next(cur); + } + spin_unlock(&scan_lock); + + ret = write_jnodes_to_disk_extent(head, beg, nr, jnode_get_block(beg), fq, flags); + if (ret) { + unscan_sequence(beg, nr); + return ret; + } + + if (nr_submitted) + *nr_submitted += nr; + + spin_lock(&scan_lock); + unscan_sequence_nolock(beg, nr); + beg = capture_list_next(end); + } + + spin_unlock(&scan_lock); + return 0; +} + +/* add given wandered mapping to atom's wandered map + this starts from jnode which is in JNODE_SCANNED state. */ +static int +add_region_to_wmap(jnode * cur, int len, const reiser4_block_nr * block_p) +{ + int ret; + blocknr_set_entry *new_bsep = NULL; + reiser4_block_nr block; + int first; + txn_atom *atom; + + assert("zam-568", block_p != NULL); + block = *block_p; + assert("zam-569", len > 0); + + while ((len--) > 0) { + assert("vs-1422", JF_ISSET(cur, JNODE_SCANNED)); + + do { + atom = get_current_atom_locked(); + assert("zam-536", !blocknr_is_fake(jnode_get_block(cur))); + ret = blocknr_set_add_pair(atom, &atom->wandered_map, &new_bsep, jnode_get_block(cur), &block); + } while (ret == -E_REPEAT); + + if (ret) { + /* deallocate blocks which were not added to wandered + map */ + reiser4_block_nr wide_len = len; + + reiser4_dealloc_blocks(&block, &wide_len, BLOCK_NOT_COUNTED, + BA_FORMATTED/* formatted, without defer */); + + return ret; + } + + UNLOCK_ATOM(atom); + + cur = capture_list_next(cur); + ++block; + first = 0; + } + + return 0; +} + +/* Allocate wandered blocks for current atom's OVERWRITE SET and immediately + submit IO for allocated blocks. We assume that current atom is in a stage + when any atom fusion is impossible and atom is unlocked and it is safe. */ +static int +alloc_wandered_blocks(struct commit_handle *ch, flush_queue_t * fq) +{ + reiser4_block_nr block; + + int rest; + int len, prev_len = 0, i; + int ret; + jnode *cur, *beg, *end; + + assert("zam-534", ch->overwrite_set_size > 0); + + cur = beg = end = NULL; + + for (rest = ch->overwrite_set_size; rest > 0; rest -= len) { + ret = get_more_wandered_blocks(rest, &block, &len); + if (ret) { + if (beg != NULL) + unscan_sequence_nolock(beg, prev_len); + return ret; + } + + spin_lock(&scan_lock); + if (beg == NULL) + cur = capture_list_front(ch->overwrite_set); + else { + unscan_sequence_nolock(beg, prev_len); + cur = capture_list_next(end); + } + beg = cur; + + /* mark @len jnodes starting from @cur as scanned */ + for (i = 0; i < len; i ++) { + assert("vs-1633", !capture_list_end(ch->overwrite_set, cur)); + assert("vs-1632", !JF_ISSET(cur, JNODE_SCANNED)); + JF_SET(cur, JNODE_SCANNED); + end = cur; + cur = capture_list_next(cur); + } + prev_len = len; + spin_unlock(&scan_lock); + + ret = add_region_to_wmap(beg, len, &block); + if (ret) { + unscan_sequence(beg, len); + return ret; + } + ret = write_jnodes_to_disk_extent(ch->overwrite_set, beg, len, &block, fq, 0); + if (ret) { + unscan_sequence(beg, len); + return ret; + } + assert("vs-1638", rest >= len); + } + + assert("vs-1634", rest == 0); + assert("vs-1635", beg != NULL && end != NULL); + assert("vs-1639", cur == capture_list_next(end)); + assert("vs-1636", capture_list_end(ch->overwrite_set, cur)); + unscan_sequence(beg, len); + + return 0; +} + +#else /* !REISER4_COPY_ON_CAPTURE */ + +/* put overwrite set back to atom's clean list */ +static void put_overwrite_set(struct commit_handle * ch) +{ + jnode * cur; + + for_all_type_safe_list(capture, ch->overwrite_set, cur) + jrelse_tail(cur); +} + +/* Count overwrite set size, grab disk space for wandered blocks allocation. + Since we have a separate list for atom's overwrite set we just scan the list, + count bitmap and other not leaf nodes which wandered blocks allocation we + have to grab space for. */ +static int +get_overwrite_set(struct commit_handle *ch) +{ + int ret; + jnode *cur; + __u64 nr_not_leaves = 0; +#if REISER4_DEBUG + __u64 nr_formatted_leaves = 0; + __u64 nr_unformatted_leaves = 0; +#endif + + + assert("zam-697", ch->overwrite_set_size == 0); + + ch->overwrite_set = ATOM_OVRWR_LIST(ch->atom); + cur = capture_list_front(ch->overwrite_set); + + while (!capture_list_end(ch->overwrite_set, cur)) { + jnode *next = capture_list_next(cur); + + /* Count bitmap locks for getting correct statistics what number + * of blocks were cleared by the transaction commit. */ + if (jnode_get_type(cur) == JNODE_BITMAP) + ch->nr_bitmap ++; + + assert("zam-939", JF_ISSET(cur, JNODE_OVRWR) || jnode_get_type(cur) == JNODE_BITMAP); + + if (jnode_is_znode(cur) && znode_above_root(JZNODE(cur))) { + /* we replace fake znode by another (real) + znode which is suggested by disk_layout + plugin */ + + /* FIXME: it looks like fake znode should be + replaced by jnode supplied by + disk_layout. */ + + struct super_block *s = reiser4_get_current_sb(); + reiser4_super_info_data *sbinfo = get_current_super_private(); + + if (sbinfo->df_plug->log_super) { + jnode *sj = sbinfo->df_plug->log_super(s); + + assert("zam-593", sj != NULL); + + if (IS_ERR(sj)) + return PTR_ERR(sj); + + LOCK_JNODE(sj); + JF_SET(sj, JNODE_OVRWR); + insert_into_atom_ovrwr_list(ch->atom, sj); + UNLOCK_JNODE(sj); + + /* jload it as the rest of overwrite set */ + jload_gfp(sj, GFP_KERNEL, 0); + + ch->overwrite_set_size++; + } + LOCK_JNODE(cur); + uncapture_block(cur); + jput(cur); + + } else { + int ret; + ch->overwrite_set_size++; + ret = jload_gfp(cur, GFP_KERNEL, 0); + if (ret) + reiser4_panic("zam-783", "cannot load e-flushed jnode back (ret = %d)\n", ret); + } + + /* Count not leaves here because we have to grab disk space + * for wandered blocks. They were not counted as "flush + * reserved". Counting should be done _after_ nodes are pinned + * into memory by jload(). */ + if (!jnode_is_leaf(cur)) + nr_not_leaves ++; + else { +#if REISER4_DEBUG + /* at this point @cur either has JNODE_FLUSH_RESERVED + * or is eflushed. Locking is not strong enough to + * write an assertion checking for this. */ + if (jnode_is_znode(cur)) + nr_formatted_leaves ++; + else + nr_unformatted_leaves ++; +#endif + JF_CLR(cur, JNODE_FLUSH_RESERVED); + } + + cur = next; + } + + /* Grab space for writing (wandered blocks) of not leaves found in + * overwrite set. */ + ret = reiser4_grab_space_force(nr_not_leaves, BA_RESERVED); + if (ret) + return ret; + + /* Disk space for allocation of wandered blocks of leaf nodes already + * reserved as "flush reserved", move it to grabbed space counter. */ + spin_lock_atom(ch->atom); + assert("zam-940", nr_formatted_leaves + nr_unformatted_leaves <= ch->atom->flush_reserved); + flush_reserved2grabbed(ch->atom, ch->atom->flush_reserved); + spin_unlock_atom(ch->atom); + + return ch->overwrite_set_size; +} + +/* Submit a write request for @nr jnodes beginning from the @first, other jnodes + are after the @first on the double-linked "capture" list. All jnodes will be + written to the disk region of @nr blocks starting with @block_p block number. + If @fq is not NULL it means that waiting for i/o completion will be done more + efficiently by using flush_queue_t objects. + + This function is the one which writes list of jnodes in batch mode. It does + all low-level things as bio construction and page states manipulation. +*/ +static int +write_jnodes_to_disk_extent(capture_list_head * head, jnode * first, int nr, + const reiser4_block_nr * block_p, flush_queue_t * fq, int flags) +{ + struct super_block *super = reiser4_get_current_sb(); + int for_reclaim = flags & WRITEOUT_FOR_PAGE_RECLAIM; + int max_blocks; + jnode *cur = first; + reiser4_block_nr block; + + assert("zam-571", first != NULL); + assert("zam-572", block_p != NULL); + assert("zam-570", nr > 0); + + block = *block_p; + max_blocks = bdev_get_queue(super->s_bdev)->max_sectors >> (super->s_blocksize_bits - 9); + + while (nr > 0) { + struct bio *bio; + int nr_blocks = min(nr, max_blocks); + int i; + int nr_used; + + bio = bio_alloc(GFP_NOIO, nr_blocks); + if (!bio) + return RETERR(-ENOMEM); + + bio->bi_bdev = super->s_bdev; + bio->bi_sector = block * (super->s_blocksize >> 9); + for (nr_used = 0, i = 0; i < nr_blocks; i++) { + struct page *pg; + ON_DEBUG(int jnode_is_releasable(jnode *)); + + pg = jnode_page(cur); + assert("zam-573", pg != NULL); + + page_cache_get(pg); + + lock_and_wait_page_writeback(pg); + + if (!bio_add_page(bio, pg, super->s_blocksize, 0)) { + /* + * underlying device is satiated. Stop adding + * pages to the bio. + */ + unlock_page(pg); + page_cache_release(pg); + break; + } + + LOCK_JNODE(cur); + assert("nikita-3166", pg->mapping == jnode_get_mapping(cur)); + assert("zam-912", !JF_ISSET(cur, JNODE_WRITEBACK)); + assert("nikita-3165", !jnode_is_releasable(cur)); + JF_SET(cur, JNODE_WRITEBACK); + JF_CLR(cur, JNODE_DIRTY); + UNLOCK_JNODE(cur); + + set_page_writeback(pg); + if (for_reclaim) + ent_writes_page(super, pg); + /* clear DIRTY or REISER4_MOVED tag if it is set */ + reiser4_clear_page_dirty(pg); + + unlock_page(pg); + + cur = capture_list_next(cur); + nr_used ++; + } + if (nr_used > 0) { + assert("nikita-3453", + bio->bi_size == super->s_blocksize * nr_used); + assert("nikita-3454", bio->bi_vcnt == nr_used); + + /* Check if we are allowed to write at all */ + if (super->s_flags & MS_RDONLY) + undo_bio(bio); + else { + add_fq_to_bio(fq, bio); + reiser4_submit_bio(WRITE, bio); + } + + block += nr_used - 1; + update_blocknr_hint_default (super, &block); + block += 1; + } else { + bio_put(bio); + } + nr -= nr_used; + } + + return 0; +} + +/* This is a procedure which recovers a contiguous sequences of disk block + numbers in the given list of j-nodes and submits write requests on this + per-sequence basis */ +reiser4_internal int +write_jnode_list (capture_list_head * head, flush_queue_t * fq, long *nr_submitted, int flags) +{ + int ret; + jnode *beg = capture_list_front(head); + + while (!capture_list_end(head, beg)) { + int nr = 1; + jnode *cur = capture_list_next(beg); + + while (!capture_list_end(head, cur)) { + if (*jnode_get_block(cur) != *jnode_get_block(beg) + nr) + break; + ++nr; + cur = capture_list_next(cur); + } + + ret = write_jnodes_to_disk_extent(head, beg, nr, jnode_get_block(beg), fq, flags); + if (ret) + return ret; + + if (nr_submitted) + *nr_submitted += nr; + + beg = cur; + } + + return 0; +} + +/* add given wandered mapping to atom's wandered map */ +static int +add_region_to_wmap(jnode * cur, int len, const reiser4_block_nr * block_p) +{ + int ret; + blocknr_set_entry *new_bsep = NULL; + reiser4_block_nr block; + + txn_atom *atom; + + assert("zam-568", block_p != NULL); + block = *block_p; + assert("zam-569", len > 0); + + while ((len--) > 0) { + do { + atom = get_current_atom_locked(); + assert("zam-536", !blocknr_is_fake(jnode_get_block(cur))); + ret = blocknr_set_add_pair(atom, &atom->wandered_map, &new_bsep, jnode_get_block(cur), &block); + } while (ret == -E_REPEAT); + + if (ret) { + /* deallocate blocks which were not added to wandered + map */ + reiser4_block_nr wide_len = len; + + reiser4_dealloc_blocks(&block, &wide_len, BLOCK_NOT_COUNTED, + BA_FORMATTED/* formatted, without defer */); + + return ret; + } + + UNLOCK_ATOM(atom); + + cur = capture_list_next(cur); + ++block; + } + + return 0; +} + +/* Allocate wandered blocks for current atom's OVERWRITE SET and immediately + submit IO for allocated blocks. We assume that current atom is in a stage + when any atom fusion is impossible and atom is unlocked and it is safe. */ +static int +alloc_wandered_blocks(struct commit_handle *ch, flush_queue_t * fq) +{ + reiser4_block_nr block; + + int rest; + int len; + int ret; + + jnode *cur; + + assert("zam-534", ch->overwrite_set_size > 0); + + rest = ch->overwrite_set_size; + + cur = capture_list_front(ch->overwrite_set); + while (!capture_list_end(ch->overwrite_set, cur)) { + assert("zam-567", JF_ISSET(cur, JNODE_OVRWR)); + + ret = get_more_wandered_blocks(rest, &block, &len); + if (ret) + return ret; + + rest -= len; + + ret = add_region_to_wmap(cur, len, &block); + if (ret) + return ret; + + ret = write_jnodes_to_disk_extent(ch->overwrite_set, cur, len, &block, fq, 0); + if (ret) + return ret; + + while ((len--) > 0) { + assert("zam-604", !capture_list_end(ch->overwrite_set, cur)); + cur = capture_list_next(cur); + } + } + + return 0; +} + +#endif /* ! REISER4_COPY_ON_CAPTURE */ + +/* allocate given number of nodes over the journal area and link them into a + list, return pointer to the first jnode in the list */ +static int +alloc_tx(struct commit_handle *ch, flush_queue_t * fq) +{ + reiser4_blocknr_hint hint; + + reiser4_block_nr allocated = 0; + reiser4_block_nr first, len; + + jnode *cur; + jnode *txhead; + int ret; + + assert("zam-698", ch->tx_size > 0); + assert("zam-699", capture_list_empty(&ch->tx_list)); + + while (allocated < (unsigned) ch->tx_size) { + len = (ch->tx_size - allocated); + + blocknr_hint_init(&hint); + + hint.block_stage = BLOCK_GRABBED; + + /* FIXME: there should be some block allocation policy for + nodes which contain wander records */ + + /* We assume that disk space for wandered record blocks can be + * taken from reserved area. */ + ret = reiser4_alloc_blocks (&hint, &first, &len, + BA_FORMATTED | BA_RESERVED | BA_USE_DEFAULT_SEARCH_START); + + blocknr_hint_done(&hint); + + if (ret) + return ret; + + allocated += len; + + /* create jnodes for all wander records */ + while (len--) { + cur = alloc_io_head(&first); + + if (cur == NULL) { + ret = RETERR(-ENOMEM); + goto free_not_assigned; + } + + ret = jinit_new(cur, GFP_KERNEL); + + if (ret != 0) { + jfree(cur); + goto free_not_assigned; + } + + pin_jnode_data(cur); + + capture_list_push_back(&ch->tx_list, cur); + + first++; + } + } + + { /* format a on-disk linked list of wander records */ + int serial = 1; + + txhead = capture_list_front(&ch->tx_list); + format_tx_head(ch); + + cur = capture_list_next(txhead); + while (!capture_list_end(&ch->tx_list, cur)) { + format_wander_record(ch, cur, serial++); + cur = capture_list_next(cur); + } + + } + + { /* Fill wander records with Wandered Set */ + struct store_wmap_params params; + txn_atom *atom; + + params.cur = capture_list_next(txhead); + + params.idx = 0; + params.capacity = wander_record_capacity(reiser4_get_current_sb()); + + atom = get_current_atom_locked(); + blocknr_set_iterator(atom, &atom->wandered_map, &store_wmap_actor, ¶ms, 0); + UNLOCK_ATOM(atom); + } + + { /* relse all jnodes from tx_list */ + cur = capture_list_front(&ch->tx_list); + while (!capture_list_end(&ch->tx_list, cur)) { + jrelse(cur); + cur = capture_list_next(cur); + } + } + + ret = write_jnode_list(&ch->tx_list, fq, NULL, 0); + + return ret; + +free_not_assigned: + /* We deallocate blocks not yet assigned to jnodes on tx_list. The + caller takes care about invalidating of tx list */ + reiser4_dealloc_blocks(&first, &len, BLOCK_NOT_COUNTED, BA_FORMATTED); + + return ret; +} + +/* We assume that at this moment all captured blocks are marked as RELOC or + WANDER (belong to Relocate o Overwrite set), all nodes from Relocate set + are submitted to write. +*/ + +reiser4_internal int reiser4_write_logs(long * nr_submitted) +{ + txn_atom *atom; + struct super_block *super = reiser4_get_current_sb(); + reiser4_super_info_data *sbinfo = get_super_private(super); + struct commit_handle ch; + int ret; + + writeout_mode_enable(); + + /* block allocator may add j-nodes to the clean_list */ + ret = pre_commit_hook(); + if (ret) + return ret; + + /* No locks are required if we take atom which stage >= + * ASTAGE_PRE_COMMIT */ + atom = get_current_context()->trans->atom; + assert("zam-965", atom != NULL); + + /* relocate set is on the atom->clean_nodes list after + * current_atom_complete_writes() finishes. It can be safely + * uncaptured after commit_semaphore is taken, because any atom that + * captures these nodes is guaranteed to commit after current one. + * + * This can only be done after pre_commit_hook(), because it is where + * early flushed jnodes with CREATED bit are transferred to the + * overwrite list. */ + invalidate_list(ATOM_CLEAN_LIST(atom)); + LOCK_ATOM(atom); + /* There might be waiters for the relocate nodes which we have + * released, wake them up. */ + atom_send_event(atom); + UNLOCK_ATOM(atom); + + if (REISER4_DEBUG) { + int level; + + for (level = 0; level < REAL_MAX_ZTREE_HEIGHT + 1; ++ level) + assert("nikita-3352", + capture_list_empty(ATOM_DIRTY_LIST(atom, level))); + } + + sbinfo->nr_files_committed += (unsigned) atom->nr_objects_created; + sbinfo->nr_files_committed -= (unsigned) atom->nr_objects_deleted; + + init_commit_handle(&ch, atom); + + ch.free_blocks = sbinfo->blocks_free_committed; + ch.nr_files = sbinfo->nr_files_committed; + /* ZAM-FIXME-HANS: email me what the contention level is for the super + * lock. */ + ch.next_oid = oid_next(super); + + /* count overwrite set and place it in a separate list */ + ret = get_overwrite_set(&ch); + + if (ret <= 0) { + /* It is possible that overwrite set is empty here, it means + all captured nodes are clean */ + goto up_and_ret; + } + + /* Inform the caller about what number of dirty pages will be + * submitted to disk. */ + *nr_submitted += ch.overwrite_set_size - ch.nr_bitmap; + + /* count all records needed for storing of the wandered set */ + get_tx_size(&ch); + + /* Grab more space for wandered records. */ + ret = reiser4_grab_space_force((__u64)(ch.tx_size), BA_RESERVED); + if (ret) + goto up_and_ret; + + { + flush_queue_t *fq; + + fq = get_fq_for_current_atom(); + + if (IS_ERR(fq)) { + ret = PTR_ERR(fq); + goto up_and_ret; + } + + UNLOCK_ATOM(fq->atom); + + do { + ret = alloc_wandered_blocks(&ch, fq); + if (ret) + break; + + ret = alloc_tx(&ch, fq); + if (ret) + break; + } while (0); + + + /* Release all grabbed space if it was not fully used for + * wandered blocks/records allocation. */ + all_grabbed2free(); + + fq_put(fq); + if (ret) + goto up_and_ret; + } + + ret = current_atom_finish_all_fq(); + if (ret) + goto up_and_ret; + + if ((ret = update_journal_header(&ch))) + goto up_and_ret; + + UNDER_SPIN_VOID(atom, atom, atom_set_stage(atom, ASTAGE_POST_COMMIT)); + + post_commit_hook(); + + { + /* force j-nodes write back */ + + flush_queue_t *fq; + + fq = get_fq_for_current_atom(); + + if (IS_ERR(fq)) { + ret = PTR_ERR(fq); + goto up_and_ret; + } + + UNLOCK_ATOM(fq->atom); + + ret = write_jnode_list(ch.overwrite_set, fq, NULL, WRITEOUT_FOR_PAGE_RECLAIM); + + fq_put(fq); + + if (ret) + goto up_and_ret; + } + + ret = current_atom_finish_all_fq(); + + if (ret) + goto up_and_ret; + + if ((ret = update_journal_footer(&ch))) + goto up_and_ret; + + post_write_back_hook(); + +up_and_ret: + if (ret) { + /* there could be fq attached to current atom; the only way to + remove them is: */ + current_atom_finish_all_fq(); + } + + /* free blocks of flushed transaction */ + dealloc_tx_list(&ch); + dealloc_wmap(&ch); + + put_overwrite_set(&ch); + + done_commit_handle(&ch); + + writeout_mode_disable(); + + return ret; +} + +/* consistency checks for journal data/control blocks: header, footer, log + records, transactions head blocks. All functions return zero on success. */ + +static int +check_journal_header(const jnode * node UNUSED_ARG) +{ + /* FIXME: journal header has no magic field yet. */ + return 0; +} + +/* wait for write completion for all jnodes from given list */ +static int +wait_on_jnode_list(capture_list_head * head) +{ + jnode *scan; + int ret = 0; + + for_all_type_safe_list(capture, head, scan) { + struct page *pg = jnode_page(scan); + + if (pg) { + if (PageWriteback(pg)) + wait_on_page_writeback(pg); + + if (PageError(pg)) + ret++; + } + } + + return ret; +} + +static int +check_journal_footer(const jnode * node UNUSED_ARG) +{ + /* FIXME: journal footer has no magic field yet. */ + return 0; +} + +static int +check_tx_head(const jnode * node) +{ + struct tx_header *header = (struct tx_header *) jdata(node); + + if (memcmp(&header->magic, TX_HEADER_MAGIC, TX_HEADER_MAGIC_SIZE) != 0) { + warning("zam-627", "tx head at block %s corrupted\n", sprint_address(jnode_get_block(node))); + return RETERR(-EIO); + } + + return 0; +} + +static int +check_wander_record(const jnode * node) +{ + struct wander_record_header *RH = (struct wander_record_header *) jdata(node); + + if (memcmp(&RH->magic, WANDER_RECORD_MAGIC, WANDER_RECORD_MAGIC_SIZE) != 0) { + warning("zam-628", "wander record at block %s corrupted\n", sprint_address(jnode_get_block(node))); + return RETERR(-EIO); + } + + return 0; +} + +/* fill commit_handler structure by everything what is needed for update_journal_footer */ +static int +restore_commit_handle(struct commit_handle *ch, jnode * tx_head) +{ + struct tx_header *TXH; + int ret; + + ret = jload(tx_head); + + if (ret) + return ret; + + TXH = (struct tx_header *) jdata(tx_head); + + ch->free_blocks = d64tocpu(&TXH->free_blocks); + ch->nr_files = d64tocpu(&TXH->nr_files); + ch->next_oid = d64tocpu(&TXH->next_oid); + + jrelse(tx_head); + + capture_list_push_front(&ch->tx_list, tx_head); + + return 0; +} + +/* replay one transaction: restore and write overwrite set in place */ +static int +replay_transaction(const struct super_block *s, + jnode * tx_head, + const reiser4_block_nr * log_rec_block_p, + const reiser4_block_nr * end_block, unsigned int nr_wander_records) +{ + reiser4_block_nr log_rec_block = *log_rec_block_p; + struct commit_handle ch; + capture_list_head overwrite_set; + jnode *log; + int ret; + + init_commit_handle(&ch, NULL); + capture_list_init(&overwrite_set); + ch.overwrite_set = &overwrite_set; + + restore_commit_handle(&ch, tx_head); + + while (log_rec_block != *end_block) { + struct wander_record_header *header; + struct wander_entry *entry; + + int i; + + if (nr_wander_records == 0) { + warning("zam-631", + "number of wander records in the linked list" " greater than number stored in tx head.\n"); + ret = RETERR(-EIO); + goto free_ow_set; + } + + log = alloc_io_head(&log_rec_block); + if (log == NULL) + return RETERR(-ENOMEM); + + ret = jload(log); + if (ret < 0) { + drop_io_head(log); + return ret; + } + + ret = check_wander_record(log); + if (ret) { + jrelse(log); + drop_io_head(log); + return ret; + } + + header = (struct wander_record_header *) jdata(log); + log_rec_block = d64tocpu(&header->next_block); + + entry = (struct wander_entry *) (header + 1); + + /* restore overwrite set from wander record content */ + for (i = 0; i < wander_record_capacity(s); i++) { + reiser4_block_nr block; + jnode *node; + + block = d64tocpu(&entry->wandered); + + if (block == 0) + break; + + node = alloc_io_head(&block); + if (node == NULL) { + ret = RETERR(-ENOMEM); + /* + * FIXME-VS:??? + */ + jrelse(log); + drop_io_head(log); + goto free_ow_set; + } + + ret = jload(node); + + if (ret < 0) { + drop_io_head(node); + /* + * FIXME-VS:??? + */ + jrelse(log); + drop_io_head(log); + goto free_ow_set; + } + + block = d64tocpu(&entry->original); + + assert("zam-603", block != 0); + + jnode_set_block(node, &block); + + capture_list_push_back(ch.overwrite_set, node); + + ++entry; + } + + jrelse(log); + drop_io_head(log); + + --nr_wander_records; + } + + if (nr_wander_records != 0) { + warning("zam-632", "number of wander records in the linked list" + " less than number stored in tx head.\n"); + ret = RETERR(-EIO); + goto free_ow_set; + } + + { /* write wandered set in place */ + write_jnode_list(ch.overwrite_set, 0, NULL, 0); + ret = wait_on_jnode_list(ch.overwrite_set); + + if (ret) { + ret = RETERR(-EIO); + goto free_ow_set; + } + } + + ret = update_journal_footer(&ch); + +free_ow_set: + + while (!capture_list_empty(ch.overwrite_set)) { + jnode *cur = capture_list_front(ch.overwrite_set); + capture_list_remove_clean (cur); + jrelse(cur); + drop_io_head(cur); + } + + capture_list_remove_clean (tx_head); + + done_commit_handle(&ch); + + return ret; +} + +/* find oldest committed and not played transaction and play it. The transaction + * was committed and journal header block was updated but the blocks from the + * process of writing the atom's overwrite set in-place and updating of journal + * footer block were not completed. This function completes the process by + * recovering the atom's overwrite set from their wandered locations and writes + * them in-place and updating the journal footer. */ +static int +replay_oldest_transaction(struct super_block *s) +{ + reiser4_super_info_data *sbinfo = get_super_private(s); + jnode *jf = sbinfo->journal_footer; + unsigned int total; + struct journal_footer *F; + struct tx_header *T; + + reiser4_block_nr prev_tx; + reiser4_block_nr last_flushed_tx; + reiser4_block_nr log_rec_block = 0; + + jnode *tx_head; + + int ret; + + if ((ret = jload(jf)) < 0) + return ret; + + F = (struct journal_footer *) jdata(jf); + + last_flushed_tx = d64tocpu(&F->last_flushed_tx); + + jrelse(jf); + + if (sbinfo->last_committed_tx == last_flushed_tx) { + /* all transactions are replayed */ + return 0; + } + + prev_tx = sbinfo->last_committed_tx; + + /* searching for oldest not flushed transaction */ + while (1) { + tx_head = alloc_io_head(&prev_tx); + if (!tx_head) + return RETERR(-ENOMEM); + + ret = jload(tx_head); + if (ret < 0) { + drop_io_head(tx_head); + return ret; + } + + ret = check_tx_head(tx_head); + if (ret) { + jrelse(tx_head); + drop_io_head(tx_head); + return ret; + } + + T = (struct tx_header *) jdata(tx_head); + + prev_tx = d64tocpu(&T->prev_tx); + + if (prev_tx == last_flushed_tx) + break; + + jrelse(tx_head); + drop_io_head(tx_head); + } + + total = d32tocpu(&T->total); + log_rec_block = d64tocpu(&T->next_block); + + pin_jnode_data(tx_head); + jrelse(tx_head); + + ret = replay_transaction(s, tx_head, &log_rec_block, jnode_get_block(tx_head), total - 1); + + unpin_jnode_data(tx_head); + drop_io_head(tx_head); + + if (ret) + return ret; + return -E_REPEAT; +} + +/* The reiser4 journal current implementation was optimized to not to capture + super block if certain super blocks fields are modified. Currently, the set + is (, ). These fields are logged by + special way which includes storing them in each transaction head block at + atom commit time and writing that information to journal footer block at + atom flush time. For getting info from journal footer block to the + in-memory super block there is a special function + reiser4_journal_recover_sb_data() which should be called after disk format + plugin re-reads super block after journal replaying. +*/ + +/* get the information from journal footer in-memory super block */ +reiser4_internal int +reiser4_journal_recover_sb_data(struct super_block *s) +{ + reiser4_super_info_data *sbinfo = get_super_private(s); + struct journal_footer *jf; + int ret; + + assert("zam-673", sbinfo->journal_footer != NULL); + + ret = jload(sbinfo->journal_footer); + if (ret != 0) + return ret; + + ret = check_journal_footer(sbinfo->journal_footer); + if (ret != 0) + goto out; + + jf = (struct journal_footer *) jdata(sbinfo->journal_footer); + + /* was there at least one flushed transaction? */ + if (d64tocpu(&jf->last_flushed_tx)) { + + /* restore free block counter logged in this transaction */ + reiser4_set_free_blocks(s, d64tocpu(&jf->free_blocks)); + + /* restore oid allocator state */ + oid_init_allocator(s, + d64tocpu(&jf->nr_files), + d64tocpu(&jf->next_oid)); + } +out: + jrelse(sbinfo->journal_footer); + return ret; +} + +/* reiser4 replay journal procedure */ +reiser4_internal int +reiser4_journal_replay(struct super_block *s) +{ + reiser4_super_info_data *sbinfo = get_super_private(s); + jnode *jh, *jf; + + struct journal_header *header; + int nr_tx_replayed = 0; + + int ret; + + assert("zam-582", sbinfo != NULL); + + jh = sbinfo->journal_header; + jf = sbinfo->journal_footer; + + if (!jh || !jf) { + /* it is possible that disk layout does not support journal + structures, we just warn about this */ + warning("zam-583", + "journal control blocks were not loaded by disk layout plugin. " + "journal replaying is not possible.\n"); + return 0; + } + + /* Take free block count from journal footer block. The free block + counter value corresponds the last flushed transaction state */ + ret = jload(jf); + if (ret < 0) + return ret; + + ret = check_journal_footer(jf); + if (ret) { + jrelse(jf); + return ret; + } + + jrelse(jf); + + /* store last committed transaction info in reiser4 in-memory super + block */ + ret = jload(jh); + if (ret < 0) + return ret; + + ret = check_journal_header(jh); + if (ret) { + jrelse(jh); + return ret; + } + + header = (struct journal_header *) jdata(jh); + sbinfo->last_committed_tx = d64tocpu(&header->last_committed_tx); + + jrelse(jh); + + /* replay committed transactions */ + while ((ret = replay_oldest_transaction(s)) == -E_REPEAT) + nr_tx_replayed++; + + return ret; +} +/* load journal control block (either journal header or journal footer block) */ +static int +load_journal_control_block(jnode ** node, const reiser4_block_nr * block) +{ + int ret; + + *node = alloc_io_head(block); + if (!(*node)) + return RETERR(-ENOMEM); + + ret = jload(*node); + + if (ret) { + drop_io_head(*node); + *node = NULL; + return ret; + } + + pin_jnode_data(*node); + jrelse(*node); + + return 0; +} + +/* unload journal header or footer and free jnode */ +static void +unload_journal_control_block(jnode ** node) +{ + if (*node) { + unpin_jnode_data(*node); + drop_io_head(*node); + *node = NULL; + } +} + +/* release journal control blocks */ +reiser4_internal void +done_journal_info(struct super_block *s) +{ + reiser4_super_info_data *sbinfo = get_super_private(s); + + assert("zam-476", sbinfo != NULL); + + unload_journal_control_block(&sbinfo->journal_header); + unload_journal_control_block(&sbinfo->journal_footer); +} + +/* load journal control blocks */ +reiser4_internal int +init_journal_info(struct super_block *s) +{ + reiser4_super_info_data *sbinfo = get_super_private(s); + journal_location *loc; + int ret; + + loc = &sbinfo->jloc; + + assert("zam-651", loc != NULL); + assert("zam-652", loc->header != 0); + assert("zam-653", loc->footer != 0); + + ret = load_journal_control_block(&sbinfo->journal_header, &loc->header); + + if (ret) + return ret; + + ret = load_journal_control_block(&sbinfo->journal_footer, &loc->footer); + + if (ret) { + unload_journal_control_block(&sbinfo->journal_header); + } + + return ret; +} + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 80 + End: +*/ diff -puN /dev/null fs/reiser4/wander.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/wander.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,135 @@ +/* Copyright 2002, 2003 by Hans Reiser, licensing governed by reiser4/README */ + +#if !defined (__FS_REISER4_WANDER_H__) +#define __FS_REISER4_WANDER_H__ + +#include "dformat.h" + +#include /* for struct super_block */ + +/* REISER4 JOURNAL ON-DISK DATA STRUCTURES */ + +#define TX_HEADER_MAGIC "TxMagic4" +#define WANDER_RECORD_MAGIC "LogMagc4" + +#define TX_HEADER_MAGIC_SIZE (8) +#define WANDER_RECORD_MAGIC_SIZE (8) + +/* journal header block format */ +struct journal_header { + /* last written transaction head location */ + d64 last_committed_tx; +}; + +typedef struct journal_location { + reiser4_block_nr footer; + reiser4_block_nr header; +} journal_location; + +/* The wander.c head comment describes usage and semantic of all these structures */ +/* journal footer block format */ +struct journal_footer { + /* last flushed transaction location. */ + /* This block number is no more valid after the transaction it points + to gets flushed, this number is used only at journal replaying time + for detection of the end of on-disk list of committed transactions + which were not flushed completely */ + d64 last_flushed_tx; + + /* free block counter is written in journal footer at transaction + flushing , not in super block because free blocks counter is logged + by another way than super block fields (root pointer, for + example). */ + d64 free_blocks; + + /* number of used OIDs and maximal used OID are logged separately from + super block */ + d64 nr_files; + d64 next_oid; +}; + +/* Each wander record (except the first one) has unified format with wander + record header followed by an array of log entries */ +struct wander_record_header { + /* when there is no predefined location for wander records, this magic + string should help reiser4fsck. */ + char magic[WANDER_RECORD_MAGIC_SIZE]; + + /* transaction id */ + d64 id; + + /* total number of wander records in current transaction */ + d32 total; + + /* this block number in transaction */ + d32 serial; + + /* number of previous block in commit */ + d64 next_block; +}; + +/* The first wander record (transaction head) of written transaction has the + special format */ +struct tx_header { + /* magic string makes first block in transaction different from other + logged blocks, it should help fsck. */ + char magic[TX_HEADER_MAGIC_SIZE]; + + /* transaction id */ + d64 id; + + /* total number of records (including this first tx head) in the + transaction */ + d32 total; + + /* align next field to 8-byte boundary; this field always is zero */ + d32 padding; + + /* block number of previous transaction head */ + d64 prev_tx; + + /* next wander record location */ + d64 next_block; + + /* committed versions of free blocks counter */ + d64 free_blocks; + + /* number of used OIDs (nr_files) and maximal used OID are logged + separately from super block */ + d64 nr_files; + d64 next_oid; +}; + +/* A transaction gets written to disk as a set of wander records (each wander + record size is fs block) */ + +/* As it was told above a wander The rest of wander record is filled by these log entries, unused space filled + by zeroes */ +struct wander_entry { + d64 original; /* block original location */ + d64 wandered; /* block wandered location */ +}; + +/* REISER4 JOURNAL WRITER FUNCTIONS */ + +extern int reiser4_write_logs(long *); +extern int reiser4_journal_replay(struct super_block *); +extern int reiser4_journal_recover_sb_data(struct super_block *); + +extern int init_journal_info(struct super_block *); +extern void done_journal_info(struct super_block *); + +extern int write_jnode_list (capture_list_head*, flush_queue_t*, long*, int); + +#endif /* __FS_REISER4_WANDER_H__ */ + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 80 + scroll-step: 1 + End: +*/ diff -puN /dev/null fs/reiser4/writeout.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/writeout.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,21 @@ +/* Copyright 2002, 2003, 2004 by Hans Reiser, licensing governed by reiser4/README */ + +#if !defined (__FS_REISER4_WRITEOUT_H__) + +#define WRITEOUT_SINGLE_STREAM (0x1) +#define WRITEOUT_FOR_PAGE_RECLAIM (0x2) + +extern int get_writeout_flags(void); + +#endif /* __FS_REISER4_WRITEOUT_H__ */ + + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 80 + End: +*/ diff -puN /dev/null fs/reiser4/znode.c --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/znode.c Mon Jun 13 15:05:23 2005 @@ -0,0 +1,1141 @@ +/* Copyright 2001, 2002, 2003 by Hans Reiser, licensing governed by + * reiser4/README */ +/* Znode manipulation functions. */ +/* Znode is the in-memory header for a tree node. It is stored + separately from the node itself so that it does not get written to + disk. In this respect znode is like buffer head or page head. We + also use znodes for additional reiser4 specific purposes: + + . they are organized into tree structure which is a part of whole + reiser4 tree. + . they are used to implement node grained locking + . they are used to keep additional state associated with a + node + . they contain links to lists used by the transaction manager + + Znode is attached to some variable "block number" which is instance of + fs/reiser4/tree.h:reiser4_block_nr type. Znode can exist without + appropriate node being actually loaded in memory. Existence of znode itself + is regulated by reference count (->x_count) in it. Each time thread + acquires reference to znode through call to zget(), ->x_count is + incremented and decremented on call to zput(). Data (content of node) are + brought in memory through call to zload(), which also increments ->d_count + reference counter. zload can block waiting on IO. Call to zrelse() + decreases this counter. Also, ->c_count keeps track of number of child + znodes and prevents parent znode from being recycled until all of its + children are. ->c_count is decremented whenever child goes out of existence + (being actually recycled in zdestroy()) which can be some time after last + reference to this child dies if we support some form of LRU cache for + znodes. + +*/ +/* EVERY ZNODE'S STORY + + 1. His infancy. + + Once upon a time, the znode was born deep inside of zget() by call to + zalloc(). At the return from zget() znode had: + + . reference counter (x_count) of 1 + . assigned block number, marked as used in bitmap + . pointer to parent znode. Root znode parent pointer points + to its father: "fake" znode. This, in turn, has NULL parent pointer. + . hash table linkage + . no data loaded from disk + . no node plugin + . no sibling linkage + + 2. His childhood + + Each node is either brought into memory as a result of tree traversal, or + created afresh, creation of the root being a special case of the latter. In + either case it's inserted into sibling list. This will typically require + some ancillary tree traversing, but ultimately both sibling pointers will + exist and JNODE_LEFT_CONNECTED and JNODE_RIGHT_CONNECTED will be true in + zjnode.state. + + 3. His youth. + + If znode is bound to already existing node in a tree, its content is read + from the disk by call to zload(). At that moment, JNODE_LOADED bit is set + in zjnode.state and zdata() function starts to return non null for this + znode. zload() further calls zparse() that determines which node layout + this node is rendered in, and sets ->nplug on success. + + If znode is for new node just created, memory for it is allocated and + zinit_new() function is called to initialise data, according to selected + node layout. + + 4. His maturity. + + After this point, znode lingers in memory for some time. Threads can + acquire references to znode either by blocknr through call to zget(), or by + following a pointer to unallocated znode from internal item. Each time + reference to znode is obtained, x_count is increased. Thread can read/write + lock znode. Znode data can be loaded through calls to zload(), d_count will + be increased appropriately. If all references to znode are released + (x_count drops to 0), znode is not recycled immediately. Rather, it is + still cached in the hash table in the hope that it will be accessed + shortly. + + There are two ways in which znode existence can be terminated: + + . sudden death: node bound to this znode is removed from the tree + . overpopulation: znode is purged out of memory due to memory pressure + + 5. His death. + + Death is complex process. + + When we irrevocably commit ourselves to decision to remove node from the + tree, JNODE_HEARD_BANSHEE bit is set in zjnode.state of corresponding + znode. This is done either in ->kill_hook() of internal item or in + kill_root() function when tree root is removed. + + At this moment znode still has: + + . locks held on it, necessary write ones + . references to it + . disk block assigned to it + . data loaded from the disk + . pending requests for lock + + But once JNODE_HEARD_BANSHEE bit set, last call to unlock_znode() does node + deletion. Node deletion includes two phases. First all ways to get + references to that znode (sibling and parent links and hash lookup using + block number stored in parent node) should be deleted -- it is done through + sibling_list_remove(), also we assume that nobody uses down link from + parent node due to its nonexistence or proper parent node locking and + nobody uses parent pointers from children due to absence of them. Second we + invalidate all pending lock requests which still are on znode's lock + request queue, this is done by invalidate_lock(). Another JNODE_IS_DYING + znode status bit is used to invalidate pending lock requests. Once it set + all requesters are forced to return -EINVAL from + longterm_lock_znode(). Future locking attempts are not possible because all + ways to get references to that znode are removed already. Last, node is + uncaptured from transaction. + + When last reference to the dying znode is just about to be released, + block number for this lock is released and znode is removed from the + hash table. + + Now znode can be recycled. + + [it's possible to free bitmap block and remove znode from the hash + table when last lock is released. This will result in having + referenced but completely orphaned znode] + + 6. Limbo + + As have been mentioned above znodes with reference counter 0 are + still cached in a hash table. Once memory pressure increases they are + purged out of there [this requires something like LRU list for + efficient implementation. LRU list would also greatly simplify + implementation of coord cache that would in this case morph to just + scanning some initial segment of LRU list]. Data loaded into + unreferenced znode are flushed back to the durable storage if + necessary and memory is freed. Znodes themselves can be recycled at + this point too. + +*/ + +#include "debug.h" +#include "dformat.h" +#include "key.h" +#include "coord.h" +#include "plugin/plugin_header.h" +#include "plugin/node/node.h" +#include "plugin/plugin.h" +#include "txnmgr.h" +#include "jnode.h" +#include "znode.h" +#include "block_alloc.h" +#include "tree.h" +#include "tree_walk.h" +#include "super.h" +#include "reiser4.h" + +#include +#include +#include +#include + +static z_hash_table *get_htable(reiser4_tree *, const reiser4_block_nr * const blocknr); +static z_hash_table *znode_get_htable(const znode *); +static void zdrop(znode *); + +/* hash table support */ + +/* compare two block numbers for equality. Used by hash-table macros */ +static inline int +blknreq(const reiser4_block_nr * b1, const reiser4_block_nr * b2) +{ + assert("nikita-534", b1 != NULL); + assert("nikita-535", b2 != NULL); + + return *b1 == *b2; +} + +/* Hash znode by block number. Used by hash-table macros */ +/* Audited by: umka (2002.06.11) */ +static inline __u32 +blknrhashfn(z_hash_table *table, const reiser4_block_nr * b) +{ + assert("nikita-536", b != NULL); + + return *b & (REISER4_ZNODE_HASH_TABLE_SIZE - 1); +} + +/* The hash table definition */ +#define KMALLOC(size) reiser4_kmalloc((size), GFP_KERNEL) +#define KFREE(ptr, size) reiser4_kfree(ptr) +TYPE_SAFE_HASH_DEFINE(z, znode, reiser4_block_nr, zjnode.key.z, zjnode.link.z, blknrhashfn, blknreq); +#undef KFREE +#undef KMALLOC + +/* slab for znodes */ +static kmem_cache_t *znode_slab; + +int znode_shift_order; + +/* ZNODE INITIALIZATION */ + +/* call this once on reiser4 initialisation */ +reiser4_internal int +znodes_init(void) +{ + znode_slab = kmem_cache_create("znode", sizeof (znode), 0, + SLAB_HWCACHE_ALIGN|SLAB_RECLAIM_ACCOUNT, + NULL, NULL); + if (znode_slab == NULL) { + return RETERR(-ENOMEM); + } else { + for (znode_shift_order = 0; + (1 << znode_shift_order) < sizeof(znode); + ++ znode_shift_order) + ; + -- znode_shift_order; + return 0; + } +} + +/* call this before unloading reiser4 */ +reiser4_internal int +znodes_done(void) +{ + return kmem_cache_destroy(znode_slab); +} + +/* call this to initialise tree of znodes */ +reiser4_internal int +znodes_tree_init(reiser4_tree * tree /* tree to initialise znodes for */ ) +{ + int result; + assert("umka-050", tree != NULL); + + rw_dk_init(tree); + + result = z_hash_init(&tree->zhash_table, REISER4_ZNODE_HASH_TABLE_SIZE); + if (result != 0) + return result; + result = z_hash_init(&tree->zfake_table, REISER4_ZNODE_HASH_TABLE_SIZE); + return result; +} + +/* free this znode */ +reiser4_internal void +zfree(znode * node /* znode to free */ ) +{ + assert("nikita-465", node != NULL); + assert("nikita-2120", znode_page(node) == NULL); + assert("nikita-2301", owners_list_empty(&node->lock.owners)); + assert("nikita-2302", requestors_list_empty(&node->lock.requestors)); + assert("nikita-2663", capture_list_is_clean(ZJNODE(node)) && NODE_LIST(ZJNODE(node)) == NOT_CAPTURED); + assert("nikita-2773", !JF_ISSET(ZJNODE(node), JNODE_EFLUSH)); + assert("nikita-3220", list_empty(&ZJNODE(node)->jnodes)); + assert("nikita-3293", !znode_is_right_connected(node)); + assert("nikita-3294", !znode_is_left_connected(node)); + assert("nikita-3295", node->left == NULL); + assert("nikita-3296", node->right == NULL); + + + /* not yet phash_jnode_destroy(ZJNODE(node)); */ + + /* poison memory. */ + ON_DEBUG(memset(node, 0xde, sizeof *node)); + kmem_cache_free(znode_slab, node); +} + +/* call this to free tree of znodes */ +reiser4_internal void +znodes_tree_done(reiser4_tree * tree /* tree to finish with znodes of */ ) +{ + znode *node; + znode *next; + z_hash_table *ztable; + + /* scan znode hash-tables and kill all znodes, then free hash tables + * themselves. */ + + assert("nikita-795", tree != NULL); + + ztable = &tree->zhash_table; + + for_all_in_htable(ztable, z, node, next) { + node->c_count = 0; + node->in_parent.node = NULL; + assert("nikita-2179", atomic_read(&ZJNODE(node)->x_count) == 0); + zdrop(node); + } + + z_hash_done(&tree->zhash_table); + + ztable = &tree->zfake_table; + + for_all_in_htable(ztable, z, node, next) { + node->c_count = 0; + node->in_parent.node = NULL; + assert("nikita-2179", atomic_read(&ZJNODE(node)->x_count) == 0); + zdrop(node); + } + + z_hash_done(&tree->zfake_table); +} + +/* ZNODE STRUCTURES */ + +/* allocate fresh znode */ +reiser4_internal znode * +zalloc(int gfp_flag /* allocation flag */ ) +{ + znode *node; + + node = kmem_cache_alloc(znode_slab, gfp_flag); + return node; +} + +/* Initialize fields of znode + @node: znode to initialize; + @parent: parent znode; + @tree: tree we are in. */ +reiser4_internal void +zinit(znode * node, const znode * parent, reiser4_tree * tree) +{ + assert("nikita-466", node != NULL); + assert("umka-268", current_tree != NULL); + + memset(node, 0, sizeof *node); + + assert("umka-051", tree != NULL); + + jnode_init(&node->zjnode, tree, JNODE_FORMATTED_BLOCK); + reiser4_init_lock(&node->lock); + init_parent_coord(&node->in_parent, parent); +} + +/* + * remove znode from indices. This is called jput() when last reference on + * znode is released. + */ +reiser4_internal void +znode_remove(znode * node /* znode to remove */ , reiser4_tree * tree) +{ + assert("nikita-2108", node != NULL); + assert("nikita-470", node->c_count == 0); + assert("zam-879", rw_tree_is_write_locked(tree)); + + /* remove reference to this znode from cbk cache */ + cbk_cache_invalidate(node, tree); + + /* update c_count of parent */ + if (znode_parent(node) != NULL) { + assert("nikita-472", znode_parent(node)->c_count > 0); + /* father, onto your hands I forward my spirit... */ + znode_parent(node)->c_count --; + node->in_parent.node = NULL; + } else { + /* orphaned znode?! Root? */ + } + + /* remove znode from hash-table */ + z_hash_remove_rcu(znode_get_htable(node), node); +} + +/* zdrop() -- Remove znode from the tree. + + This is called when znode is removed from the memory. */ +static void +zdrop(znode * node /* znode to finish with */ ) +{ + jdrop(ZJNODE(node)); +} + +/* + * put znode into right place in the hash table. This is called by relocate + * code. + */ +reiser4_internal int +znode_rehash(znode * node /* node to rehash */ , + const reiser4_block_nr * new_block_nr /* new block number */ ) +{ + z_hash_table *oldtable; + z_hash_table *newtable; + reiser4_tree *tree; + + assert("nikita-2018", node != NULL); + + tree = znode_get_tree(node); + oldtable = znode_get_htable(node); + newtable = get_htable(tree, new_block_nr); + + WLOCK_TREE(tree); + /* remove znode from hash-table */ + z_hash_remove_rcu(oldtable, node); + + /* assertion no longer valid due to RCU */ + /* assert("nikita-2019", z_hash_find(newtable, new_block_nr) == NULL); */ + + /* update blocknr */ + znode_set_block(node, new_block_nr); + node->zjnode.key.z = *new_block_nr; + + /* insert it into hash */ + z_hash_insert_rcu(newtable, node); + WUNLOCK_TREE(tree); + return 0; +} + +/* ZNODE LOOKUP, GET, PUT */ + +/* zlook() - get znode with given block_nr in a hash table or return NULL + + If result is non-NULL then the znode's x_count is incremented. Internal version + accepts pre-computed hash index. The hash table is accessed under caller's + tree->hash_lock. +*/ +reiser4_internal znode * +zlook(reiser4_tree * tree, const reiser4_block_nr * const blocknr) +{ + znode *result; + __u32 hash; + z_hash_table *htable; + + assert("jmacd-506", tree != NULL); + assert("jmacd-507", blocknr != NULL); + + htable = get_htable(tree, blocknr); + hash = blknrhashfn(htable, blocknr); + + rcu_read_lock(); + result = z_hash_find_index(htable, hash, blocknr); + + if (result != NULL) { + add_x_ref(ZJNODE(result)); + result = znode_rip_check(tree, result); + } + rcu_read_unlock(); + + return result; +} + +/* return hash table where znode with block @blocknr is (or should be) + * stored */ +static z_hash_table * +get_htable(reiser4_tree * tree, const reiser4_block_nr * const blocknr) +{ + z_hash_table *table; + if (is_disk_addr_unallocated(blocknr)) + table = &tree->zfake_table; + else + table = &tree->zhash_table; + return table; +} + +/* return hash table where znode @node is (or should be) stored */ +static z_hash_table * +znode_get_htable(const znode *node) +{ + return get_htable(znode_get_tree(node), znode_get_block(node)); +} + +/* zget() - get znode from hash table, allocating it if necessary. + + First a call to zlook, locating a x-referenced znode if one + exists. If znode is not found, allocate new one and return. Result + is returned with x_count reference increased. + + LOCKS TAKEN: TREE_LOCK, ZNODE_LOCK + LOCK ORDERING: NONE +*/ +reiser4_internal znode * +zget(reiser4_tree * tree, + const reiser4_block_nr * const blocknr, + znode * parent, + tree_level level, + int gfp_flag) +{ + znode *result; + __u32 hashi; + + z_hash_table *zth; + + assert("jmacd-512", tree != NULL); + assert("jmacd-513", blocknr != NULL); + assert("jmacd-514", level < REISER4_MAX_ZTREE_HEIGHT); + + zth = get_htable(tree, blocknr); + hashi = blknrhashfn(zth, blocknr); + + /* NOTE-NIKITA address-as-unallocated-blocknr still is not + implemented. */ + + z_hash_prefetch_bucket(zth, hashi); + + rcu_read_lock(); + /* Find a matching BLOCKNR in the hash table. If the znode is found, + we obtain an reference (x_count) but the znode remains unlocked. + Have to worry about race conditions later. */ + result = z_hash_find_index(zth, hashi, blocknr); + /* According to the current design, the hash table lock protects new + znode references. */ + if (result != NULL) { + add_x_ref(ZJNODE(result)); + /* NOTE-NIKITA it should be so, but special case during + creation of new root makes such assertion highly + complicated. */ + assert("nikita-2131", 1 || znode_parent(result) == parent || + (ZF_ISSET(result, JNODE_ORPHAN) && (znode_parent(result) == NULL))); + result = znode_rip_check(tree, result); + } + + rcu_read_unlock(); + + if (!result) { + znode * shadow; + + result = zalloc(gfp_flag); + if (!result) { + return ERR_PTR(RETERR(-ENOMEM)); + } + + zinit(result, parent, tree); + ZJNODE(result)->blocknr = *blocknr; + ZJNODE(result)->key.z = *blocknr; + result->level = level; + + WLOCK_TREE(tree); + + shadow = z_hash_find_index(zth, hashi, blocknr); + if (unlikely(shadow != NULL && !ZF_ISSET(shadow, JNODE_RIP))) { + jnode_list_remove(ZJNODE(result)); + zfree(result); + result = shadow; + } else { + result->version = znode_build_version(tree); + z_hash_insert_index_rcu(zth, hashi, result); + + if (parent != NULL) + ++ parent->c_count; + } + + add_x_ref(ZJNODE(result)); + + WUNLOCK_TREE(tree); + } + +#if REISER4_DEBUG + if (!blocknr_is_fake(blocknr) && *blocknr != 0) + reiser4_check_block(blocknr, 1); +#endif + /* Check for invalid tree level, return -EIO */ + if (unlikely(znode_get_level(result) != level)) { + warning("jmacd-504", + "Wrong level for cached block %llu: %i expecting %i", + (unsigned long long)(*blocknr), znode_get_level(result), level); + zput(result); + return ERR_PTR(RETERR(-EIO)); + } + + assert("nikita-1227", znode_invariant(result)); + + return result; +} + +/* ZNODE PLUGINS/DATA */ + +/* "guess" plugin for node loaded from the disk. Plugin id of node plugin is + stored at the fixed offset from the beginning of the node. */ +static node_plugin * +znode_guess_plugin(const znode * node /* znode to guess + * plugin of */ ) +{ + reiser4_tree * tree; + + assert("nikita-1053", node != NULL); + assert("nikita-1055", zdata(node) != NULL); + + tree = znode_get_tree(node); + assert("umka-053", tree != NULL); + + if (reiser4_is_set(tree->super, REISER4_ONE_NODE_PLUGIN)) { + return tree->nplug; + } else { + return node_plugin_by_disk_id + (tree, &((common_node_header *) zdata(node))->plugin_id); +#ifdef GUESS_EXISTS + reiser4_plugin *plugin; + + /* NOTE-NIKITA add locking here when dynamic plugins will be + * implemented */ + for_all_plugins(REISER4_NODE_PLUGIN_TYPE, plugin) { + if ((plugin->u.node.guess != NULL) && plugin->u.node.guess(node)) + return plugin; + } + warning("nikita-1057", "Cannot guess node plugin"); + print_znode("node", node); + return NULL; +#endif + } +} + +/* parse node header and install ->node_plugin */ +reiser4_internal int +zparse(znode * node /* znode to parse */ ) +{ + int result; + + assert("nikita-1233", node != NULL); + assert("nikita-2370", zdata(node) != NULL); + + if (node->nplug == NULL) { + node_plugin *nplug; + + nplug = znode_guess_plugin(node); + if (likely(nplug != NULL)) { + result = nplug->parse(node); + if (likely(result == 0)) + node->nplug = nplug; + } else { + result = RETERR(-EIO); + } + } else + result = 0; + return result; +} + +/* zload with readahead */ +reiser4_internal int +zload_ra(znode * node /* znode to load */, ra_info_t *info) +{ + int result; + + assert("nikita-484", node != NULL); + assert("nikita-1377", znode_invariant(node)); + assert("jmacd-7771", !znode_above_root(node)); + assert("nikita-2125", atomic_read(&ZJNODE(node)->x_count) > 0); + assert("nikita-3016", schedulable()); + + if (info) + formatted_readahead(node, info); + + result = jload(ZJNODE(node)); + assert("nikita-1378", znode_invariant(node)); + return result; +} + +/* load content of node into memory */ +reiser4_internal int zload(znode * node) +{ + return zload_ra(node, 0); +} + +/* call node plugin to initialise newly allocated node. */ +reiser4_internal int +zinit_new(znode * node /* znode to initialise */, int gfp_flags ) +{ + return jinit_new(ZJNODE(node), gfp_flags); +} + +/* drop reference to node data. When last reference is dropped, data are + unloaded. */ +reiser4_internal void +zrelse(znode * node /* znode to release references to */ ) +{ + assert("nikita-1381", znode_invariant(node)); + + jrelse(ZJNODE(node)); +} + +/* returns free space in node */ +reiser4_internal unsigned +znode_free_space(znode * node /* znode to query */ ) +{ + assert("nikita-852", node != NULL); + return node_plugin_by_node(node)->free_space(node); +} + +/* left delimiting key of znode */ +reiser4_internal reiser4_key * +znode_get_rd_key(znode * node /* znode to query */ ) +{ + assert("nikita-958", node != NULL); + assert("nikita-1661", rw_dk_is_locked(znode_get_tree(node))); + assert("nikita-3067", LOCK_CNT_GTZ(rw_locked_dk)); + assert("nikita-30671", node->rd_key_version != 0); + return &node->rd_key; +} + +/* right delimiting key of znode */ +reiser4_internal reiser4_key * +znode_get_ld_key(znode * node /* znode to query */ ) +{ + assert("nikita-974", node != NULL); + assert("nikita-1662", rw_dk_is_locked(znode_get_tree(node))); + assert("nikita-3068", LOCK_CNT_GTZ(rw_locked_dk)); + assert("nikita-30681", node->ld_key_version != 0); + return &node->ld_key; +} + +ON_DEBUG(atomic_t delim_key_version = ATOMIC_INIT(0);) + +/* update right-delimiting key of @node */ +reiser4_internal reiser4_key * +znode_set_rd_key(znode * node, const reiser4_key * key) +{ + assert("nikita-2937", node != NULL); + assert("nikita-2939", key != NULL); + assert("nikita-2938", rw_dk_is_write_locked(znode_get_tree(node))); + assert("nikita-3069", LOCK_CNT_GTZ(write_locked_dk)); + assert("nikita-2944", + znode_is_any_locked(node) || + znode_get_level(node) != LEAF_LEVEL || + keyge(key, &node->rd_key) || + keyeq(&node->rd_key, min_key()) || + ZF_ISSET(node, JNODE_HEARD_BANSHEE)); + + node->rd_key = *key; + ON_DEBUG(node->rd_key_version = atomic_inc_return(&delim_key_version)); + return &node->rd_key; +} + + +/* update left-delimiting key of @node */ +reiser4_internal reiser4_key * +znode_set_ld_key(znode * node, const reiser4_key * key) +{ + assert("nikita-2940", node != NULL); + assert("nikita-2941", key != NULL); + assert("nikita-2942", rw_dk_is_write_locked(znode_get_tree(node))); + assert("nikita-3070", LOCK_CNT_GTZ(write_locked_dk > 0)); + assert("nikita-2943", + znode_is_any_locked(node) || + keyeq(&node->ld_key, min_key())); + + node->ld_key = *key; + ON_DEBUG(node->ld_key_version = atomic_inc_return(&delim_key_version)); + return &node->ld_key; +} + +/* true if @key is inside key range for @node */ +reiser4_internal int +znode_contains_key(znode * node /* znode to look in */ , + const reiser4_key * key /* key to look for */ ) +{ + assert("nikita-1237", node != NULL); + assert("nikita-1238", key != NULL); + + /* left_delimiting_key <= key <= right_delimiting_key */ + return keyle(znode_get_ld_key(node), key) && keyle(key, znode_get_rd_key(node)); +} + +/* same as znode_contains_key(), but lock dk lock */ +reiser4_internal int +znode_contains_key_lock(znode * node /* znode to look in */ , + const reiser4_key * key /* key to look for */ ) +{ + assert("umka-056", node != NULL); + assert("umka-057", key != NULL); + + return UNDER_RW(dk, znode_get_tree(node), + read, znode_contains_key(node, key)); +} + +/* get parent pointer, assuming tree is not locked */ +reiser4_internal znode * +znode_parent_nolock(const znode * node /* child znode */ ) +{ + assert("nikita-1444", node != NULL); + return node->in_parent.node; +} + +/* get parent pointer of znode */ +reiser4_internal znode * +znode_parent(const znode * node /* child znode */ ) +{ + assert("nikita-1226", node != NULL); + assert("nikita-1406", LOCK_CNT_GTZ(rw_locked_tree)); + return znode_parent_nolock(node); +} + +/* detect uber znode used to protect in-superblock tree root pointer */ +reiser4_internal int +znode_above_root(const znode * node /* znode to query */ ) +{ + assert("umka-059", node != NULL); + + return disk_addr_eq(&ZJNODE(node)->blocknr, &UBER_TREE_ADDR); +} + +/* check that @node is root---that its block number is recorder in the tree as + that of root node */ +#if REISER4_DEBUG +static int +znode_is_true_root(const znode * node /* znode to query */ ) +{ + assert("umka-060", node != NULL); + assert("umka-061", current_tree != NULL); + + return disk_addr_eq(znode_get_block(node), &znode_get_tree(node)->root_block); +} +#endif + +/* check that @node is root */ +reiser4_internal int +znode_is_root(const znode * node /* znode to query */ ) +{ + assert("nikita-1206", node != NULL); + + return znode_get_level(node) == znode_get_tree(node)->height; +} + +/* Returns true is @node was just created by zget() and wasn't ever loaded + into memory. */ +/* NIKITA-HANS: yes */ +reiser4_internal int +znode_just_created(const znode * node) +{ + assert("nikita-2188", node != NULL); + return (znode_page(node) == NULL); +} + +/* obtain updated ->znode_epoch. See seal.c for description. */ +reiser4_internal __u64 +znode_build_version(reiser4_tree * tree) +{ + return UNDER_SPIN(epoch, tree, ++tree->znode_epoch); +} + +reiser4_internal void +init_load_count(load_count * dh) +{ + assert("nikita-2105", dh != NULL); + memset(dh, 0, sizeof *dh); +} + +reiser4_internal void +done_load_count(load_count * dh) +{ + assert("nikita-2106", dh != NULL); + if (dh->node != NULL) { + for (; dh->d_ref > 0; --dh->d_ref) + zrelse(dh->node); + dh->node = NULL; + } +} + +static int +incr_load_count(load_count * dh) +{ + int result; + + assert("nikita-2110", dh != NULL); + assert("nikita-2111", dh->node != NULL); + + result = zload(dh->node); + if (result == 0) + ++dh->d_ref; + return result; +} + +reiser4_internal int +incr_load_count_znode(load_count * dh, znode * node) +{ + assert("nikita-2107", dh != NULL); + assert("nikita-2158", node != NULL); + assert("nikita-2109", ergo(dh->node != NULL, (dh->node == node) || (dh->d_ref == 0))); + + dh->node = node; + return incr_load_count(dh); +} + +reiser4_internal int +incr_load_count_jnode(load_count * dh, jnode * node) +{ + if (jnode_is_znode(node)) { + return incr_load_count_znode(dh, JZNODE(node)); + } + return 0; +} + +reiser4_internal void +copy_load_count(load_count * new, load_count * old) +{ + int ret = 0; + done_load_count(new); + new->node = old->node; + new->d_ref = 0; + + while ((new->d_ref < old->d_ref) && (ret = incr_load_count(new)) == 0) { + } + + assert("jmacd-87589", ret == 0); +} + +reiser4_internal void +move_load_count(load_count * new, load_count * old) +{ + done_load_count(new); + new->node = old->node; + new->d_ref = old->d_ref; + old->node = NULL; + old->d_ref = 0; +} + +/* convert parent pointer into coord */ +reiser4_internal void +parent_coord_to_coord(const parent_coord_t * pcoord, coord_t * coord) +{ + assert("nikita-3204", pcoord != NULL); + assert("nikita-3205", coord != NULL); + + coord_init_first_unit_nocheck(coord, pcoord->node); + coord_set_item_pos(coord, pcoord->item_pos); + coord->between = AT_UNIT; +} + +/* pack coord into parent_coord_t */ +reiser4_internal void +coord_to_parent_coord(const coord_t * coord, parent_coord_t * pcoord) +{ + assert("nikita-3206", pcoord != NULL); + assert("nikita-3207", coord != NULL); + + pcoord->node = coord->node; + pcoord->item_pos = coord->item_pos; +} + +/* Initialize a parent hint pointer. (parent hint pointer is a field in znode, + look for comments there) */ +reiser4_internal void +init_parent_coord(parent_coord_t * pcoord, const znode * node) +{ + pcoord->node = (znode *) node; + pcoord->item_pos = (unsigned short)~0; +} + + +#if REISER4_DEBUG +int jnode_invariant_f(const jnode * node, char const **msg); + +/* debugging aid: znode invariant */ +static int +znode_invariant_f(const znode * node /* znode to check */ , + char const **msg /* where to store error + * message, if any */ ) +{ +#define _ergo(ant, con) \ + ((*msg) = "{" #ant "} ergo {" #con "}", ergo((ant), (con))) + +#define _equi(e1, e2) \ + ((*msg) = "{" #e1 "} <=> {" #e2 "}", equi((e1), (e2))) + +#define _check(exp) ((*msg) = #exp, (exp)) + + return + jnode_invariant_f(ZJNODE(node), msg) && + + /* [znode-fake] invariant */ + + /* fake znode doesn't have a parent, and */ + _ergo(znode_get_level(node) == 0, znode_parent(node) == NULL) && + /* there is another way to express this very check, and */ + _ergo(znode_above_root(node), + znode_parent(node) == NULL) && + /* it has special block number, and */ + _ergo(znode_get_level(node) == 0, + disk_addr_eq(znode_get_block(node), &UBER_TREE_ADDR)) && + /* it is the only znode with such block number, and */ + _ergo(!znode_above_root(node) && znode_is_loaded(node), + !disk_addr_eq(znode_get_block(node), &UBER_TREE_ADDR)) && + /* it is parent of the tree root node */ + _ergo(znode_is_true_root(node), znode_above_root(znode_parent(node))) && + + /* [znode-level] invariant */ + + /* level of parent znode is one larger than that of child, + except for the fake znode, and */ + _ergo(znode_parent(node) && !znode_above_root(znode_parent(node)), + znode_get_level(znode_parent(node)) == + znode_get_level(node) + 1) && + /* left neighbor is at the same level, and */ + _ergo(znode_is_left_connected(node) && node->left != NULL, + znode_get_level(node) == znode_get_level(node->left)) && + /* right neighbor is at the same level */ + _ergo(znode_is_right_connected(node) && node->right != NULL, + znode_get_level(node) == znode_get_level(node->right)) && + + /* [znode-connected] invariant */ + + _ergo(node->left != NULL, znode_is_left_connected(node)) && + _ergo(node->right != NULL, znode_is_right_connected(node)) && + _ergo(!znode_is_root(node) && node->left != NULL, + znode_is_right_connected(node->left) && + node->left->right == node) && + _ergo(!znode_is_root(node) && node->right != NULL, + znode_is_left_connected(node->right) && + node->right->left == node) && + + /* [znode-c_count] invariant */ + + /* for any znode, c_count of its parent is greater than 0 */ + _ergo(znode_parent(node) != NULL && + !znode_above_root(znode_parent(node)), + znode_parent(node)->c_count > 0) && + /* leaves don't have children */ + _ergo(znode_get_level(node) == LEAF_LEVEL, + node->c_count == 0) && + + _check(node->zjnode.jnodes.prev != NULL) && + _check(node->zjnode.jnodes.next != NULL) && + /* orphan doesn't have a parent */ + _ergo(ZF_ISSET(node, JNODE_ORPHAN), znode_parent(node) == 0) && + + /* [znode-modify] invariant */ + + /* if znode is not write-locked, its checksum remains + * invariant */ + /* unfortunately, zlock is unordered w.r.t. jnode_lock, so we + * cannot check this. */ + /* [znode-refs] invariant */ + /* only referenced znode can be long-term locked */ + _ergo(znode_is_locked(node), + atomic_read(&ZJNODE(node)->x_count) != 0); +} + +/* debugging aid: check znode invariant and panic if it doesn't hold */ +int +znode_invariant(const znode * node /* znode to check */ ) +{ + char const *failed_msg; + int result; + + assert("umka-063", node != NULL); + assert("umka-064", current_tree != NULL); + + spin_lock_znode((znode *) node); + RLOCK_TREE(znode_get_tree(node)); + result = znode_invariant_f(node, &failed_msg); + if (!result) { + /* print_znode("corrupted node", node); */ + warning("jmacd-555", "Condition %s failed", failed_msg); + } + RUNLOCK_TREE(znode_get_tree(node)); + spin_unlock_znode((znode *) node); + return result; +} + +/* debugging aid: output human readable information about @node */ +static void +info_znode(const char *prefix /* prefix to print */ , + const znode * node /* node to print */ ) +{ + if (node == NULL) { + return; + } + info_jnode(prefix, ZJNODE(node)); + if (!jnode_is_znode(ZJNODE(node))) + return; + + printk("c_count: %i, readers: %i, items: %i\n", + node->c_count, node->lock.nr_readers, node->nr_items); +} + +/* debugging aid: output more human readable information about @node that + info_znode(). */ +reiser4_internal void +print_znode(const char *prefix /* prefix to print */ , + const znode * node /* node to print */ ) +{ + if (node == NULL) { + printk("%s: null\n", prefix); + return; + } + + info_znode(prefix, node); + if (!jnode_is_znode(ZJNODE(node))) + return; + info_znode("\tparent", znode_parent_nolock(node)); + info_znode("\tleft", node->left); + info_znode("\tright", node->right); + print_key("\tld", &node->ld_key); + print_key("\trd", &node->rd_key); + printk("\n"); +} + +/* print all znodes in @tree */ +reiser4_internal void +print_znodes(const char *prefix, reiser4_tree * tree) +{ + znode *node; + znode *next; + z_hash_table *htable; + int tree_lock_taken; + + if (tree == NULL) + tree = current_tree; + + /* this is debugging function. It can be called by reiser4_panic() + with tree spin-lock already held. Trylock is not exactly what we + want here, but it is passable. + */ + tree_lock_taken = write_trylock_tree(tree); + + htable = &tree->zhash_table; + for_all_in_htable(htable, z, node, next) { + info_znode(prefix, node); + } + + htable = &tree->zfake_table; + for_all_in_htable(htable, z, node, next) { + info_znode(prefix, node); + } + + if (tree_lock_taken) + WUNLOCK_TREE(tree); +} + +/* return non-0 iff data are loaded into znode */ +reiser4_internal int +znode_is_loaded(const znode * node /* znode to query */ ) +{ + assert("nikita-497", node != NULL); + return jnode_is_loaded(ZJNODE(node)); +} + +reiser4_internal unsigned long +znode_times_locked(const znode *z) +{ + return z->times_locked; +} + +#endif /* REISER4_DEBUG */ + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ diff -puN /dev/null fs/reiser4/znode.h --- /dev/null Thu Apr 11 07:25:15 2002 +++ 25-akpm/fs/reiser4/znode.h Mon Jun 13 15:05:23 2005 @@ -0,0 +1,451 @@ +/* Copyright 2001, 2002, 2003, 2004 by Hans Reiser, licensing governed by + * reiser4/README */ + +/* Declaration of znode (Zam's node). See znode.c for more details. */ + +#ifndef __ZNODE_H__ +#define __ZNODE_H__ + +#include "forward.h" +#include "debug.h" +#include "dformat.h" +#include "spin_macros.h" +#include "key.h" +#include "coord.h" +#include "type_safe_list.h" +#include "plugin/node/node.h" +#include "jnode.h" +#include "lock.h" +#include "readahead.h" + +#include +#include +#include /* for PAGE_CACHE_SIZE */ +#include +#include + +/* znode tracks its position within parent (internal item in a parent node, + * that contains znode's block number). */ +typedef struct parent_coord { + znode *node; + pos_in_node_t item_pos; +} parent_coord_t; + +/* &znode - node in a reiser4 tree. + + NOTE-NIKITA fields in this struct have to be rearranged (later) to reduce + cacheline pressure. + + Locking: + + Long term: data in a disk node attached to this znode are protected + by long term, deadlock aware lock ->lock; + + Spin lock: the following fields are protected by the spin lock: + + ->lock + + Following fields are protected by the global tree lock: + + ->left + ->right + ->in_parent + ->c_count + + Following fields are protected by the global delimiting key lock (dk_lock): + + ->ld_key (to update ->ld_key long-term lock on the node is also required) + ->rd_key + + Following fields are protected by the long term lock: + + ->nr_items + + ->node_plugin is never changed once set. This means that after code made + itself sure that field is valid it can be accessed without any additional + locking. + + ->level is immutable. + + Invariants involving this data-type: + + [znode-fake] + [znode-level] + [znode-connected] + [znode-c_count] + [znode-refs] + [jnode-refs] + [jnode-queued] + [znode-modify] + + For this to be made into a clustering or NUMA filesystem, we would want to eliminate all of the global locks. + Suggestions for how to do that are desired.*/ +struct znode { + /* Embedded jnode. */ + jnode zjnode; + + /* contains three subfields, node, pos_in_node, and pos_in_unit. + + pos_in_node and pos_in_unit are only hints that are cached to + speed up lookups during balancing. They are not required to be up to + date. Synched in find_child_ptr(). + + This value allows us to avoid expensive binary searches. + + in_parent->node points to the parent of this node, and is NOT a + hint. + */ + parent_coord_t in_parent; + + /* + * sibling list pointers + */ + + /* left-neighbor */ + znode *left; + /* right-neighbor */ + znode *right; + + /* long term lock on node content. This lock supports deadlock + detection. See lock.c + */ + zlock lock; + + /* You cannot remove from memory a node that has children in + memory. This is because we rely on the fact that parent of given + node can always be reached without blocking for io. When reading a + node into memory you must increase the c_count of its parent, when + removing it from memory you must decrease the c_count. This makes + the code simpler, and the cases where it is suboptimal are truly + obscure. + */ + int c_count; + + /* plugin of node attached to this znode. NULL if znode is not + loaded. */ + node_plugin *nplug; + + /* version of znode data. This is increased on each modification. This + * is necessary to implement seals (see seal.[ch]) efficiently. */ + __u64 version; + + /* left delimiting key. Necessary to efficiently perform + balancing with node-level locking. Kept in memory only. */ + reiser4_key ld_key; + /* right delimiting key. */ + reiser4_key rd_key; + + /* znode's tree level */ + __u16 level; + /* number of items in this node. This field is modified by node + * plugin. */ + __u16 nr_items; + +#if REISER4_DEBUG + void *creator; + reiser4_key first_key; + unsigned long times_locked; + int left_version; /* when node->left was updated */ + int right_version; /* when node->right was updated */ + int ld_key_version; /* when node->ld_key was updated */ + int rd_key_version; /* when node->rd_key was updated */ +#endif + +} __attribute__((aligned(16))); + +ON_DEBUG(extern atomic_t delim_key_version;) + +/* In general I think these macros should not be exposed. */ +#define znode_is_locked(node) (lock_is_locked(&node->lock)) +#define znode_is_rlocked(node) (lock_is_rlocked(&node->lock)) +#define znode_is_wlocked(node) (lock_is_wlocked(&node->lock)) +#define znode_is_wlocked_once(node) (lock_is_wlocked_once(&node->lock)) +#define znode_can_be_rlocked(node) (lock_can_be_rlocked(&node->lock)) +#define is_lock_compatible(node, mode) (lock_mode_compatible(&node->lock, mode)) + +/* Macros for accessing the znode state. */ +#define ZF_CLR(p,f) JF_CLR (ZJNODE(p), (f)) +#define ZF_ISSET(p,f) JF_ISSET(ZJNODE(p), (f)) +#define ZF_SET(p,f) JF_SET (ZJNODE(p), (f)) + +extern znode *zget(reiser4_tree * tree, const reiser4_block_nr * const block, + znode * parent, tree_level level, int gfp_flag); +extern znode *zlook(reiser4_tree * tree, const reiser4_block_nr * const block); +extern int zload(znode * node); +extern int zload_ra(znode * node, ra_info_t *info); +extern int zinit_new(znode * node, int gfp_flags); +extern void zrelse(znode * node); +extern void znode_change_parent(znode * new_parent, reiser4_block_nr * block); + +/* size of data in znode */ +static inline unsigned +znode_size(const znode * node UNUSED_ARG /* znode to query */ ) +{ + assert("nikita-1416", node != NULL); + return PAGE_CACHE_SIZE; +} + +extern void parent_coord_to_coord(const parent_coord_t *pcoord, coord_t *coord); +extern void coord_to_parent_coord(const coord_t *coord, parent_coord_t *pcoord); +extern void init_parent_coord(parent_coord_t * pcoord, const znode * node); + +extern unsigned znode_free_space(znode * node); + +extern reiser4_key *znode_get_rd_key(znode * node); +extern reiser4_key *znode_get_ld_key(znode * node); + +extern reiser4_key *znode_set_rd_key(znode * node, const reiser4_key * key); +extern reiser4_key *znode_set_ld_key(znode * node, const reiser4_key * key); + +/* `connected' state checks */ +static inline int +znode_is_right_connected(const znode * node) +{ + return ZF_ISSET(node, JNODE_RIGHT_CONNECTED); +} + +static inline int +znode_is_left_connected(const znode * node) +{ + return ZF_ISSET(node, JNODE_LEFT_CONNECTED); +} + +static inline int +znode_is_connected(const znode * node) +{ + return znode_is_right_connected(node) && znode_is_left_connected(node); +} + +extern int znode_rehash(znode * node, const reiser4_block_nr * new_block_nr); +extern void znode_remove(znode *, reiser4_tree *); +extern znode *znode_parent(const znode * node); +extern znode *znode_parent_nolock(const znode * node); +extern int znode_above_root(const znode * node); +extern int znodes_init(void); +extern int znodes_done(void); +extern int znodes_tree_init(reiser4_tree * ztree); +extern void znodes_tree_done(reiser4_tree * ztree); +extern int znode_contains_key(znode * node, const reiser4_key * key); +extern int znode_contains_key_lock(znode * node, const reiser4_key * key); +extern unsigned znode_save_free_space(znode * node); +extern unsigned znode_recover_free_space(znode * node); + +extern int znode_just_created(const znode * node); + +extern void zfree(znode * node); + +/* +#define znode_pre_write(n) noop +#define znode_post_write(n) noop +#define znode_set_checksum(n, l) noop +#define znode_at_read(n) (1) +*/ + +#if REISER4_DEBUG +extern void print_znode(const char *prefix, const znode * node); +extern void print_znodes(const char *prefix, reiser4_tree * tree); +extern void print_lock_stack(const char *prefix, lock_stack * owner); +#else +#define print_znode( p, n ) noop +#define print_znodes( p, t ) noop +#define print_lock_stack( p, o ) noop +#endif + +/* Make it look like various znode functions exist instead of treating znodes as + jnodes in znode-specific code. */ +#define znode_page(x) jnode_page ( ZJNODE(x) ) +#define zdata(x) jdata ( ZJNODE(x) ) +#define znode_get_block(x) jnode_get_block ( ZJNODE(x) ) +#define znode_created(x) jnode_created ( ZJNODE(x) ) +#define znode_set_created(x) jnode_set_created ( ZJNODE(x) ) +#define znode_convertible(x) jnode_convertible (ZJNODE(x)) +#define znode_set_convertible(x) jnode_set_convertible (ZJNODE(x)) + +#define znode_is_dirty(x) jnode_is_dirty ( ZJNODE(x) ) +#define znode_check_dirty(x) jnode_check_dirty ( ZJNODE(x) ) +#define znode_make_clean(x) jnode_make_clean ( ZJNODE(x) ) +#define znode_set_block(x, b) jnode_set_block ( ZJNODE(x), (b) ) + +#define spin_lock_znode(x) LOCK_JNODE ( ZJNODE(x) ) +#define spin_unlock_znode(x) UNLOCK_JNODE ( ZJNODE(x) ) +#define spin_trylock_znode(x) spin_trylock_jnode ( ZJNODE(x) ) +#define spin_znode_is_locked(x) spin_jnode_is_locked ( ZJNODE(x) ) +#define spin_znode_is_not_locked(x) spin_jnode_is_not_locked ( ZJNODE(x) ) + +#if REISER4_DEBUG +extern int znode_x_count_is_protected(const znode * node); +extern int znode_invariant(const znode * node); +#endif + +/* acquire reference to @node */ +static inline znode * +zref(znode * node) +{ + /* change of x_count from 0 to 1 is protected by tree spin-lock */ + return JZNODE(jref(ZJNODE(node))); +} + +/* release reference to @node */ +static inline void +zput(znode * node) +{ + assert("nikita-3564", znode_invariant(node)); + jput(ZJNODE(node)); +} + +/* get the level field for a znode */ +static inline tree_level +znode_get_level(const znode * node) +{ + return node->level; +} + +/* get the level field for a jnode */ +static inline tree_level +jnode_get_level(const jnode * node) +{ + if (jnode_is_znode(node)) + return znode_get_level(JZNODE(node)); + else + /* unformatted nodes are all at the LEAF_LEVEL and for + "semi-formatted" nodes like bitmaps, level doesn't matter. */ + return LEAF_LEVEL; +} + +/* true if jnode is on leaf level */ +static inline int jnode_is_leaf(const jnode * node) +{ + if (jnode_is_znode(node)) + return (znode_get_level(JZNODE(node)) == LEAF_LEVEL); + if (jnode_get_type(node) == JNODE_UNFORMATTED_BLOCK) + return 1; + return 0; +} + +/* return znode's tree */ +static inline reiser4_tree * +znode_get_tree(const znode * node) +{ + assert("nikita-2692", node != NULL); + return jnode_get_tree(ZJNODE(node)); +} + +/* resolve race with zput */ +static inline znode * +znode_rip_check(reiser4_tree *tree, znode * node) +{ + jnode *j; + + j = jnode_rip_sync(tree, ZJNODE(node)); + if (likely(j != NULL)) + node = JZNODE(j); + else + node = NULL; + return node; +} + +#if defined(REISER4_DEBUG) || defined(REISER4_DEBUG_MODIFY) || defined(REISER4_DEBUG_OUTPUT) +int znode_is_loaded(const znode * node /* znode to query */ ); +#endif + +extern __u64 znode_build_version(reiser4_tree * tree); + +/* Data-handles. A data handle object manages pairing calls to zload() and zrelse(). We + must load the data for a node in many places. We could do this by simply calling + zload() everywhere, the difficulty arises when we must release the loaded data by + calling zrelse. In a function with many possible error/return paths, it requires extra + work to figure out which exit paths must call zrelse and those which do not. The data + handle automatically calls zrelse for every zload that it is responsible for. In that + sense, it acts much like a lock_handle. +*/ +typedef struct load_count { + znode *node; + int d_ref; +} load_count; + +extern void init_load_count(load_count * lc); /* Initialize a load_count set the current node to NULL. */ +extern void done_load_count(load_count * dh); /* Finalize a load_count: call zrelse() if necessary */ +extern int incr_load_count_znode(load_count * dh, znode * node); /* Set the argument znode to the current node, call zload(). */ +extern int incr_load_count_jnode(load_count * dh, jnode * node); /* If the argument jnode is formatted, do the same as + * incr_load_count_znode, otherwise do nothing (unformatted nodes + * don't require zload/zrelse treatment). */ +extern void move_load_count(load_count * new, load_count * old); /* Move the contents of a load_count. Old handle is released. */ +extern void copy_load_count(load_count * new, load_count * old); /* Copy the contents of a load_count. Old handle remains held. */ + +/* Variable initializers for load_count. */ +#define INIT_LOAD_COUNT ( load_count * ){ .node = NULL, .d_ref = 0 } +#define INIT_LOAD_COUNT_NODE( n ) ( load_count ){ .node = ( n ), .d_ref = 0 } +/* A convenience macro for use in assertions or debug-only code, where loaded + data is only required to perform the debugging check. This macro + encapsulates an expression inside a pair of calls to zload()/zrelse(). */ +#define WITH_DATA( node, exp ) \ +({ \ + long __with_dh_result; \ + znode *__with_dh_node; \ + \ + __with_dh_node = ( node ); \ + __with_dh_result = zload( __with_dh_node ); \ + if( __with_dh_result == 0 ) { \ + __with_dh_result = ( long )( exp ); \ + zrelse( __with_dh_node ); \ + } \ + __with_dh_result; \ +}) + +/* Same as above, but accepts a return value in case zload fails. */ +#define WITH_DATA_RET( node, ret, exp ) \ +({ \ + int __with_dh_result; \ + znode *__with_dh_node; \ + \ + __with_dh_node = ( node ); \ + __with_dh_result = zload( __with_dh_node ); \ + if( __with_dh_result == 0 ) { \ + __with_dh_result = ( int )( exp ); \ + zrelse( __with_dh_node ); \ + } else \ + __with_dh_result = ( ret ); \ + __with_dh_result; \ +}) + +#define WITH_COORD(coord, exp) \ +({ \ + coord_t *__coord; \ + \ + __coord = (coord); \ + coord_clear_iplug(__coord); \ + WITH_DATA(__coord->node, exp); \ +}) + + +#if REISER4_DEBUG +#define STORE_COUNTERS \ + lock_counters_info __entry_counters = *lock_counters() +#define CHECK_COUNTERS \ +ON_DEBUG_CONTEXT( \ +({ \ + __entry_counters.x_refs = lock_counters() -> x_refs; \ + __entry_counters.t_refs = lock_counters() -> t_refs; \ + __entry_counters.d_refs = lock_counters() -> d_refs; \ + assert("nikita-2159", \ + !memcmp(&__entry_counters, lock_counters(), \ + sizeof __entry_counters)); \ +}) ) + +#else +#define STORE_COUNTERS +#define CHECK_COUNTERS noop +#endif + +/* __ZNODE_H__ */ +#endif + +/* Make Linus happy. + Local variables: + c-indentation-style: "K&R" + mode-name: "LC" + c-basic-offset: 8 + tab-width: 8 + fill-column: 120 + End: +*/ _