From andrea@suse.de Thu Apr 20 02:19:12 2000 Date: Wed, 5 Apr 2000 16:31:52 +0200 (CEST) From: Andrea Arcangeli To: Heinz Mauelshagen Cc: Linus Torvalds , mge@EZ-Darmstadt.Telekom.de, alan@redhat.com, linux-kernel@vger.rutgers.edu Subject: Re: **** LVM 0.8final patch vs. 2.3.99-pre3 *** On Wed, 5 Apr 2000, Heinz Mauelshagen wrote: >Actually there are still some pieces missing which i need for LVM. > > File Reason > ---------------------------- ----------------------------------------- > drivers/block/Config.in enable LVM choice > drivers/block/ll_rw_blk.c initialization of LVM > drivers/block/lvm.c type correction for static LVM driver > fs/block_dev.c type corrections for static LVM driver > fs/partitions/check.c /proc/partitions LVM support > include/linux/fs.h missing definitions > kernel/ksyms.c missing exported symbols > > >I attached these missing pieces vs. 2.3.99-pre3 below. Thanks for the patch. I am attaching an incremental patch. It's against 2.3.99-pre3 plus the patch you posted, so it will fit perfectly on top of your current tree. Detailed description: o fix the plugging: there's no reason to do the plugging and the elevating in the lvm common per major queue. The elevating and plugging has to be done in the queue of the underlying blockdevice (in the physical level not at the logical level). o move buffer_IO_error() in fs.h, now it's the lowlevel make_request_fn that have to deal with IO errors and make sure the buffer is clean and not uptodate so it make sense to provide such API to all make_request_fn implementations. o fixed some retval according to the new error handling semantics and fixed I/O error handling using buffer_IO_error helper function. o dropped the broken hack that tries to give per logical device (minor number) readahead feature by clobbering the per-major readahead information with the readahead wanted by the logical volume that does the read/write(). It's racy because doing I/O in multiple blockdevices at the same time will lead to changing the readahead under the other I/O opeartions. And it can't work with mounted filesystem. Suppose an fs is mounted on /dev/vg0/lv0/. At this point you do a read from /dev/vg0/lv1 and lv1 is using a readahead of 10. After the read also the filesystem in /dev/vg0/lv0 will start using a readahead of 10 because the per-major readahead information is been clobbered! Dropping the hack cleanup lots of dirty code (that wants to plays with the fops directly) and it has no downside compared to the hack solution. Users will have simply to know that right now the readahead has to be set equal in all logical volumes otherwise all logical volumes will inherit the readahead setting of the latest logical volume activated (and that is true also without my patch). To provide the per-minor readahead feature (that lvm internally and userspace are ready to handle) we'll have to change the linux common code. It's impossible to provide that feature within lvm.c. o avoids side effects in the LVM_CORRECT_READ_AHEAD macro. o remove unused lvm_snapshot_lock o avoids binary breakage of the PV_FLUSH ioctl. I suggest not to change pv_flush_req_t but to make a new userspace data structure pv_flush_req2_t and a new PV_FLUSH2 ioctl instead of breaking the old one. Otherwise old binary tools will break. o invalidate the buffers in the BLKFLSBUF ioctl on the logical volume (so flushb /dev/vg0/lv0 will work correctly) o change md.c to use the buffer_IO_error helper function. o drops the lvm_hd_name_ptr dirty code and #undef LVM_HD_NAME accordingly. Do we really need to enable LVM_HD_NAME? I don't think it's interesting enough to add the lvm_hd_name_ptr stuff. o enlarge the possible readahead values in the 0-250 range (current binary tools supports that range). There's no reason to forbid an user to disable readahead. Some filesystem or blockdevice application may not want to have any readahead. With the pagecache we use madvicse(MAV_RANDOM) for that, with the blockdevice accesses via /dev/vg0/lv0 we only have the readahead option for disabling readahead. Also enlarged the maximum (for example the RAID0/MD default readahead is 128). o inserts a pad in the lv_t structure to avoid breaking the binary tools interface. I think everybody prefers to lose 32 bit per logical volume in the system than having to change binary tools while switching between 2.2.x and 2.3.x to make snapshotting to work :). I'm using such patch to boot into the 2.3.x kernels since LVM lives into the kernel, so it should be well tested. All seems to work. I'm using the same binary tools that I'm using with the 2.2.xaa kernels (so I can swtich between 2.2.x and 2.3.x without noticing). Comments are welcome. Thanks! diff -urN lvm-heinz/drivers/block/ll_rw_blk.c lvm/drivers/block/ll_rw_blk.c --- lvm-heinz/drivers/block/ll_rw_blk.c Wed Apr 5 04:21:22 2000 +++ lvm/drivers/block/ll_rw_blk.c Wed Apr 5 04:21:57 2000 @@ -774,15 +774,6 @@ bh->b_end_io(bh, test_bit(BH_Uptodate, &bh->b_state)); } -static inline void buffer_IO_error(struct buffer_head * bh) -{ - mark_buffer_clean(bh); - /* - * b_end_io has to clear the BH_Uptodate bitflag in the error case! - */ - bh->b_end_io(bh, 0); -} - int generic_make_request (request_queue_t *q, int rw, struct buffer_head * bh) { unsigned long flags; diff -urN lvm-heinz/drivers/block/lvm-snap.c lvm/drivers/block/lvm-snap.c --- lvm-heinz/drivers/block/lvm-snap.c Mon Feb 21 15:17:31 2000 +++ lvm/drivers/block/lvm-snap.c Wed Apr 5 04:21:58 2000 @@ -278,12 +278,12 @@ lvm_hash_link(lv_snap->lv_block_exception + idx, org_phys_dev, org_start, lv_snap); lv_snap->lv_remap_ptr = idx + 1; - return 0; + return 1; /* slow path */ out: lvm_drop_snapshot(lv_snap, reason); - return 1; + return -1; fail_out_of_space: reason = "out of space"; diff -urN lvm-heinz/drivers/block/lvm.c lvm/drivers/block/lvm.c --- lvm-heinz/drivers/block/lvm.c Wed Apr 5 04:21:22 2000 +++ lvm/drivers/block/lvm.c Wed Apr 5 04:21:58 2000 @@ -122,7 +122,7 @@ * - avoided "/dev/" in proc filesystem output * - avoided inline strings functions lvm_strlen etc. * 14/02/2000 - support for 2.3.43 - * - integrated Andrea Arcagnelli's snapshot code + * - integrated Andrea Arcangeli's snapshot code * */ @@ -171,9 +171,9 @@ #include #include -#define LVM_CORRECT_READ_AHEAD( a) \ - if ( a < LVM_MIN_READ_AHEAD || \ - a > LVM_MAX_READ_AHEAD) a = LVM_MAX_READ_AHEAD; +#define LVM_CORRECT_READ_AHEAD(a) \ + (((a) < LVM_MIN_READ_AHEAD || (a) > LVM_MAX_READ_AHEAD) \ + ? LVM_MAX_READ_AHEAD : (a)) #ifndef WRITEA # define WRITEA WRITE @@ -193,13 +193,11 @@ #define DEVICE_REQUEST lvm_dummy_device_request static int lvm_make_request_fn(request_queue_t *, int, struct buffer_head*); +static void lvm_plug_device_noop(request_queue_t *, kdev_t); static int lvm_blk_ioctl(struct inode *, struct file *, uint, ulong); static int lvm_blk_open(struct inode *, struct file *); -static ssize_t lvm_blk_read(struct file *, char *, size_t, loff_t *); -static ssize_t lvm_blk_write(struct file *, const char *, size_t, loff_t *); - static int lvm_chr_open(struct inode *, struct file *); static int lvm_chr_close(struct inode *, struct file *); @@ -298,7 +296,6 @@ static DECLARE_WAIT_QUEUE_HEAD(lvm_map_wait); static spinlock_t lvm_lock = SPIN_LOCK_UNLOCKED; -static spinlock_t lvm_snapshot_lock = SPIN_LOCK_UNLOCKED; static struct file_operations lvm_chr_fops = { @@ -307,18 +304,6 @@ ioctl: lvm_chr_ioctl, }; -static struct file_operations lvm_blk_fops = -{ - open: lvm_blk_open, - release: blkdev_close, - read: lvm_blk_read, - write: lvm_blk_write, - ioctl: lvm_blk_ioctl, - fsync: block_fsync, -}; - -#define BLOCK_DEVICE_OPERATIONS -/* block device operations structure needed for 2.3.38? and above */ static struct block_device_operations lvm_blk_dops = { open: lvm_blk_open, @@ -370,11 +355,7 @@ printk(KERN_ERR "%s -- register_chrdev failed\n", lvm_name); return -EIO; } -#ifdef BLOCK_DEVICE_OPERATIONS if (register_blkdev(MAJOR_NR, lvm_name, &lvm_blk_dops) < 0) -#else - if (register_blkdev(MAJOR_NR, lvm_name, &lvm_blk_fops) < 0) -#endif { printk("%s -- register_blkdev failed\n", lvm_name); if (unregister_chrdev(LVM_CHAR_MAJOR, lvm_name) < 0) @@ -410,6 +391,7 @@ blk_init_queue(BLK_DEFAULT_QUEUE(MAJOR_NR), DEVICE_REQUEST); blk_queue_make_request(BLK_DEFAULT_QUEUE(MAJOR_NR), lvm_make_request_fn); + blk_queue_pluggable(BLK_DEFAULT_QUEUE(MAJOR_NR), lvm_plug_device_noop); /* optional read root VGDA */ /* if ( *rootvg != 0) vg_read_with_pv_and_lv ( rootvg, &vg); @@ -488,8 +470,6 @@ loadtime = CURRENT_TIME; - lvm_lock = lvm_snapshot_lock = SPIN_LOCK_UNLOCKED; - pe_lock_req.lock = UNLOCK_PE; pe_lock_req.data.lv_dev = \ pe_lock_req.data.pv_dev = \ @@ -724,8 +704,19 @@ sizeof(pv_flush_req)) != 0) return -EFAULT; - fsync_dev(pv_flush_req.pv_dev); - invalidate_buffers(pv_flush_req.pv_dev); + for ( v = 0; v < ABS_MAX_VG; v++) { + unsigned int p; + if ( vg[v] == NULL) continue; + for ( p = 0; p < vg[v]->pv_max; p++) { + if ( vg[v]->pv[p] != NULL && + strcmp ( vg[v]->pv[p]->pv_name, + pv_flush_req.pv_name) == 0) { + fsync_dev ( vg[v]->pv[p]->pv_dev); + invalidate_buffers ( vg[v]->pv[p]->pv_dev); + return 0; + } + } + } return 0; default: @@ -813,10 +804,6 @@ if (!(lv_ptr->lv_access & LV_WRITE)) return -EACCES; } -#ifdef BLOCK_DEVICE_OPERATIONS - file->f_op = &lvm_blk_fops; -#endif - /* be sure to increment VG counter */ if (lv_ptr->lv_open == 0) vg_ptr->lv_open++; lv_ptr->lv_open++; @@ -837,34 +824,6 @@ /* - * block device read - */ -static ssize_t lvm_blk_read(struct file *file, char *buffer, - size_t size, loff_t * offset) -{ - int minor = MINOR(file->f_dentry->d_inode->i_rdev); - - read_ahead[MAJOR(file->f_dentry->d_inode->i_rdev)] = - vg[VG_BLK(minor)]->lv[LV_BLK(minor)]->lv_read_ahead; - return block_read(file, buffer, size, offset); -} - - -/* - * block device write - */ -static ssize_t lvm_blk_write(struct file *file, const char *buffer, - size_t size, loff_t * offset) -{ - int minor = MINOR(file->f_dentry->d_inode->i_rdev); - - read_ahead[MAJOR(file->f_dentry->d_inode->i_rdev)] = - vg[VG_BLK(minor)]->lv[LV_BLK(minor)]->lv_read_ahead; - return block_write(file, buffer, size, offset); -} - - -/* * block device i/o-control routine */ static int lvm_blk_ioctl(struct inode *inode, struct file *file, @@ -906,6 +865,7 @@ "%s -- lvm_blk_ioctl -- BLKFLSBUF\n", lvm_name); #endif fsync_dev(inode->i_rdev); + invalidate_buffers(inode->i_rdev); break; @@ -921,7 +881,7 @@ if ((long) arg < LVM_MIN_READ_AHEAD || (long) arg > LVM_MAX_READ_AHEAD) return -EINVAL; - lv_ptr->lv_read_ahead = (long) arg; + read_ahead[MAJOR_NR] = lv_ptr->lv_read_ahead = (long) arg; break; @@ -1293,7 +1253,6 @@ static int lvm_map(struct buffer_head *bh, int rw) { int minor = MINOR(bh->b_rdev); - int ret = 0; ulong index; ulong pe_start; ulong size = bh->b_size >> 9; @@ -1308,7 +1267,7 @@ printk(KERN_ALERT "%s - lvm_map: ll_rw_blk for inactive LV %s\n", lvm_name, lv->lv_name); - return -1; + goto error; } /* if ( lv->lv_access & LV_SNAPSHOT) @@ -1323,7 +1282,7 @@ (rw == WRITEA || rw == WRITE)) { printk ( "%s -- doing snapshot write for %02d:%02d[%02d:%02d] b_blocknr: %lu b_rsector: %lu\n", lvm_name, MAJOR ( bh->b_dev), MINOR ( bh->b_dev), MAJOR ( bh->b_rdev), MINOR ( bh->b_rdev), bh->b_blocknr, bh->b_rsector); - return 0; + goto error; } if ((rw == WRITE || rw == WRITEA) && @@ -1331,7 +1290,7 @@ printk(KERN_CRIT "%s - lvm_map: ll_rw_blk write for readonly LV %s\n", lvm_name, lv->lv_name); - return -1; + goto error; } #ifdef DEBUG_MAP printk(KERN_DEBUG @@ -1347,7 +1306,7 @@ printk(KERN_ALERT "%s - lvm_map *rsector: %lu or size: %lu wrong for" " minor: %2d\n", lvm_name, rsector_tmp, size, minor); - return -1; + goto error; } rsector_sav = rsector_tmp; rdev_sav = rdev_tmp; @@ -1444,11 +1403,11 @@ pe_start, lv_ptr)) { /* create a new mapping */ - ret = lvm_snapshot_COW(rdev_tmp, - rsector_tmp, - pe_start, - rsector_sav, - lv_ptr); + lvm_snapshot_COW(rdev_tmp, + rsector_tmp, + pe_start, + rsector_sav, + lv_ptr); } rdev_tmp = rdev_sav; rsector_tmp = rsector_sav; @@ -1467,7 +1426,11 @@ bh->b_rdev = rdev_tmp; bh->b_rsector = rsector_tmp; - return ret; + return 1; + + error: + buffer_IO_error(bh); + return -1; } /* lvm_map() */ @@ -1519,6 +1482,12 @@ return 1; } +/* + * plug device function is a noop because plugging has to happen + * in the queue of the physical blockdevice to allow the + * elevator to do a better job. + */ +static void lvm_plug_device_noop(request_queue_t *q, kdev_t dev) { } /******************************************************************** * @@ -2090,7 +2059,7 @@ lvm_size[MINOR(lv_ptr->lv_dev)] = lv_ptr->lv_size >> 1; vg_lv_map[MINOR(lv_ptr->lv_dev)].vg_number = vg_ptr->vg_number; vg_lv_map[MINOR(lv_ptr->lv_dev)].lv_number = lv_ptr->lv_number; - LVM_CORRECT_READ_AHEAD(lv_ptr->lv_read_ahead); + read_ahead[MAJOR_NR] = lv_ptr->lv_read_ahead = LVM_CORRECT_READ_AHEAD(lv_ptr->lv_read_ahead); vg_ptr->lv_cur++; lv_ptr->lv_status = lv_status_save; @@ -2328,7 +2297,7 @@ lvm_size[MINOR(lv_ptr->lv_dev)] = lv_ptr->lv_size >> 1; /* vg_lv_map array doesn't have to be changed here */ - LVM_CORRECT_READ_AHEAD(lv_ptr->lv_read_ahead); + read_ahead[MAJOR_NR] = lv_ptr->lv_read_ahead = LVM_CORRECT_READ_AHEAD(lv_ptr->lv_read_ahead); lv_ptr->lv_status = lv_status_save; return 0; diff -urN lvm-heinz/drivers/block/md.c lvm/drivers/block/md.c --- lvm-heinz/drivers/block/md.c Mon Apr 3 03:21:56 2000 +++ lvm/drivers/block/md.c Wed Apr 5 04:21:58 2000 @@ -185,8 +185,7 @@ if (mddev && mddev->pers) return mddev->pers->make_request(q, mddev, rw, bh); else { - mark_buffer_clean(bh); - bh->b_end_io(bh, 0); + buffer_IO_error(bh); return -1; } } diff -urN lvm-heinz/fs/block_dev.c lvm/fs/block_dev.c --- lvm-heinz/fs/block_dev.c Wed Apr 5 04:21:22 2000 +++ lvm/fs/block_dev.c Fri Mar 10 03:22:34 2000 @@ -313,7 +313,7 @@ * since the vma has no handle. */ -int block_fsync(struct file *filp, struct dentry *dentry) +static int block_fsync(struct file *filp, struct dentry *dentry) { return fsync_dev(dentry->d_inode->i_rdev); } @@ -650,7 +650,7 @@ return ret; } -int blkdev_close(struct inode * inode, struct file * filp) +static int blkdev_close(struct inode * inode, struct file * filp) { return blkdev_put(inode->i_bdev, BDEV_FILE); } diff -urN lvm-heinz/fs/partitions/check.c lvm/fs/partitions/check.c --- lvm-heinz/fs/partitions/check.c Wed Apr 5 04:21:22 2000 +++ lvm/fs/partitions/check.c Sun Feb 27 06:19:44 2000 @@ -71,11 +71,6 @@ NULL }; -#if defined CONFIG_BLK_DEV_LVM || defined CONFIG_BLK_DEV_LVM_MODULE -#include -void (*lvm_hd_name_ptr) (char *, int) = NULL; -#endif - /* * disk_name() is used by genhd.c and blkpg.c. * It formats the devicename of the indicated disk into @@ -102,13 +97,6 @@ * This requires special handling here. */ switch (hd->major) { -#if defined CONFIG_BLK_DEV_LVM || defined CONFIG_BLK_DEV_LVM_MODULE - case LVM_BLK_MAJOR: - *buf = 0; - if ( lvm_hd_name_ptr != NULL) - (lvm_hd_name_ptr) ( buf, minor); - return buf; -#endif case IDE9_MAJOR: unit += 2; case IDE8_MAJOR: diff -urN lvm-heinz/include/linux/fs.h lvm/include/linux/fs.h --- lvm-heinz/include/linux/fs.h Wed Apr 5 04:21:22 2000 +++ lvm/include/linux/fs.h Wed Apr 5 04:21:58 2000 @@ -827,7 +827,6 @@ extern struct block_device *bdget(dev_t); extern void bdput(struct block_device *); extern int blkdev_open(struct inode *, struct file *); -extern int blkdev_close(struct inode * inode, struct file * filp); extern struct file_operations def_blk_fops; extern struct file_operations def_fifo_fops; extern int ioctl_by_bdev(struct block_device *, unsigned, unsigned long); @@ -911,6 +910,21 @@ #define atomic_set_buffer_dirty(bh) test_and_set_bit(BH_Dirty, &(bh)->b_state) +/* + * If an error happens during the make_request, this function + * has to be recalled. It marks the buffer as clean and not + * uptodate, and it notifys the upper layer about the end + * of the I/O. + */ +static inline void buffer_IO_error(struct buffer_head * bh) +{ + mark_buffer_clean(bh); + /* + * b_end_io has to clear the BH_Uptodate bitflag in the error case! + */ + bh->b_end_io(bh, 0); +} + extern void balance_dirty(kdev_t); extern int check_disk_change(kdev_t); extern int invalidate_inodes(struct super_block *); @@ -1044,7 +1058,6 @@ /* Generic buffer handling for block filesystems.. */ extern int block_flushpage(struct page *, unsigned long); -extern int block_fsync(struct file *filp, struct dentry *dentry); extern int block_symlink(struct inode *, const char *, int); extern int block_write_full_page(struct page*, get_block_t*); extern int block_read_full_page(struct page*, get_block_t*); diff -urN lvm-heinz/include/linux/lvm.h lvm/include/linux/lvm.h --- lvm-heinz/include/linux/lvm.h Mon Feb 21 15:17:35 2000 +++ lvm/include/linux/lvm.h Wed Apr 5 04:21:58 2000 @@ -65,7 +65,7 @@ #define LVM_TOTAL_RESET #define LVM_GET_INODE -#define LVM_HD_NAME +#undef LVM_HD_NAME /* lots of debugging output (see driver source) #define DEBUG_LVM_GET_INFO @@ -261,8 +261,8 @@ #define LVM_MAX_STRIPES 128 /* max # of stripes */ #define LVM_MAX_SIZE ( 1024LU * 1024 * 1024 * 2) /* 1TB[sectors] */ #define LVM_MAX_MIRRORS 2 /* future use */ -#define LVM_MIN_READ_AHEAD 2 /* minimum read ahead sectors */ -#define LVM_MAX_READ_AHEAD 120 /* maximum read ahead sectors */ +#define LVM_MIN_READ_AHEAD 0 /* minimum read ahead sectors */ +#define LVM_MAX_READ_AHEAD 256 /* maximum read ahead sectors */ #define LVM_MAX_LV_IO_TIMEOUT 60 /* seconds I/O timeout (future use) */ #define LVM_PARTITION 0xfe /* LVM partition id */ #define LVM_NEW_PARTITION 0x8e /* new LVM partition id (10/09/1999) */ @@ -590,6 +590,7 @@ struct lv_v2 *lv_snapshot_prev; struct lv_v2 *lv_snapshot_next; lv_block_exception_t *lv_block_exception; + uint8_t __unused; uint32_t lv_remap_ptr; uint32_t lv_remap_end; uint32_t lv_chunk_size; @@ -786,7 +787,6 @@ typedef struct { char pv_name[NAME_LEN]; - kdev_t pv_dev; } pv_flush_req_t; diff -urN lvm-heinz/kernel/ksyms.c lvm/kernel/ksyms.c --- lvm-heinz/kernel/ksyms.c Wed Apr 5 04:21:22 2000 +++ lvm/kernel/ksyms.c Mon Apr 3 03:21:59 2000 @@ -52,11 +52,6 @@ #include #endif -#ifdef CONFIG_BLK_DEV_LVM_MODULE -extern void (*lvm_hd_name_ptr) ( char*, int); -EXPORT_SYMBOL(lvm_hd_name_ptr); -#endif - extern int console_loglevel; extern void set_device_ro(kdev_t dev,int flag); #if !defined(CONFIG_NFSD) && defined(CONFIG_NFSD_MODULE) @@ -239,7 +234,6 @@ EXPORT_SYMBOL(page_readlink); EXPORT_SYMBOL(page_follow_link); EXPORT_SYMBOL(page_symlink_inode_operations); -EXPORT_SYMBOL(block_fsync); EXPORT_SYMBOL(block_symlink); EXPORT_SYMBOL(vfs_readdir); @@ -277,7 +271,6 @@ EXPORT_SYMBOL(sync_dev); EXPORT_SYMBOL(devfs_register_partitions); EXPORT_SYMBOL(blkdev_open); -EXPORT_SYMBOL(blkdev_close); EXPORT_SYMBOL(blkdev_get); EXPORT_SYMBOL(blkdev_put); EXPORT_SYMBOL(ioctl_by_bdev); Andrea