RAW DEVICE IO FOR LINUX-2.1.131 =============================== First test release. USE AT YOUR OWN RISK. Stephen Tweedie , 10 December 1998 Copyrigh (C) Red Hat Software, 1998 What is in this package? ------------------------ Three things: * A kernel patch to enable raw device IO, the O_DIRECT fcntl option, and a new system call, sys_fsattr, for the setting and querying of filesystem attribute flags on all inodes. * A small library, libsetattr, to interface to the sys_fsattr system call. libsetattr provides simple setattr(filename, flags) and getattr(filename) functions. * Patches to e2fsprogs to extend lsattr and chattr to use the new attribute functions from libsetattr, and to understand the "raw" attribute. The Kernel Patch ---------------- This kernel patch provides a new system call for accessing filesystem inode attributes. The existing mechanism for accessing attributes has two major problems: * It cannot be used on socket or device special files * It is specific to ext2fs. The sys_fsattr() system call (and its two user-space wrapper functions, setattr and getattr from the libsetattr library) take a filename as an argument, and do not require you to open the file first. Thus, you can set attribute bits on ANY inode in an ext2fs filesystem. For example, you can finally remove the immutable or append-only bits from a corrupt block-device inode, something which you could previously do only with debugfs. With this patch in place, opening an inode with the raw bit set (see the chattr instructions below) is exactly equivalent to opening that file with the O_DIRECT fcntl option. In either case, the kernel behaves exactly the same: if the file is a block device, then all [p]read() and [p]write() system calls to that file occur using raw IO. The O_DIRECT option is currently ignored for files other than block device special inodes. What does raw IO give you? Raw IO bypasses the kernel caching entirely. The data that you pass to the kernel's read/write system call gets passed _directly_ to the device driver. If the driver is DMA-capable, then the device will be doing DMA straight into the user buffer, with no data copies performed at all. All raw IO is synchronous, and when the read or write system call has completed, the kernel guarantees that the transfer to disk is finished. There is no read-ahead or write-behind, so performance on a raw device might actually be worse than on a buffered device unless you use a large enough blocksize. See the top of linux/fs/raw.c for implementation notes. Three points are worth emphasising here: * No cache-coherency is performed. The raw IO goes straight to disk; any stale data in the buffer cache is not invalidated. I'd like to know what other Unixes do in this situation. * The implementation currently limits the IO block size to 64k, to reduce the amount of physical memory that a user can keep locked down at any point in time. Using a larger block size will not help. If a larger block size is necessary for performance, please let me know. * The caller MUST make sure that the buffer supplied for the IO is suitable for being transferred straight to disk. In particular, the buffer size and the file offset must be multiples of the disk block size, and the buffer must be aligned such that no disk block straddles a page boundary in memory. Note that the gnu "dd" program does not use correctly-aligned buffers: get Larry McVoy's "lmdd" from the lmbench benchmark on www.bitmover.com if you want a working dd for raw devices. The Libsetattr Stub Library --------------------------- The kernel fsattr system call provides a single system call which allows you to test and set inode attributes in a single call. The set of flags is passed to fsattr via an arbitrary-sized buffer, to allow for future expansion. However, for now, only a few attributes are defined, and those attributes can easily be accessed as a single integer. As a result, it is convenient to have a way of setting or fetching attributes which does not have to know about setting up the transfer buffer for sys_fsattr(). The libsetattr.a libary contained here provides two functions, declared in /usr/include/setattr.h: extern int getattr(const char *filename); extern int setattr(const char *filename, int flags); These both return -1 on error. The flags mask is composed of the following bits: #define ATTR_FLAG_SYNCRONOUS 1 /* Syncronous write */ #define ATTR_FLAG_NOATIME 2 /* Don't update atime */ #define ATTR_FLAG_APPEND 4 /* Append-only file */ #define ATTR_FLAG_IMMUTABLE 8 /* Immutable file */ #define ATTR_FLAG_NODIRATIME 16 /* Don't update atime for directory */ #define ATTR_FLAG_DIRECT 32 /* Force unbuffered IO */ #define ATTR_FLAG_NODUMP 64 /* Do not dump(8) this file */ #define ATTR_FLAG_COMPRESS 128 /* Try to compress this file */ #define ATTR_FLAG_BTREE 256 /* Use btree format on dir */ Calling these functions on a kernel without the raw patch applied will return -1 with errno set to ENOSYS. The E2fsprogs Modifications --------------------------- There are four modifications to e2fsprogs in the patch supplied: * The libe2p fset_flags and fget_flags functions now understand, and use, libsetattr if present. As a result, you can now use these functions on a special device or socket inode without getting an error back. If you use the new libe2p on an older kernel, the functions will correctly detect the ENOSYS kernel error and will fall back to using the old ext2fs-specific ioctl() commands. * The libe2p "pf" print-flags function understands the new EXT2_DIRECT_FL raw IO flag, and prints an "r" character if it is set. * The chattr program understands the "r" attribute, so that you can set or clear the EXT2_DIRECT_FL flag on any ext2fs inode. * The lsattr program no longer tries to query the inode version number if you have not asked it to do so. You cannot query the ext2fs version number on a block device inode, so the old behaviour prevented lsattr from working properly on such devices. To build the new e2fsprogs (for lsattr/chattr): 1) Make sure /usr/src/linux points to a kernel tree with the DIRECT attributes defined. 2) Build and install the libsetattr.a library. 3) Build the e2fsprogs srpm as normal, or build Ted's official e2fsprogs-1.12 with the supplied e2fsprogs-1.12-raw.diff patch applied. The patched configure script will detect the libsetattr library if it is present, and the lsattr/chattr source will build support for the kernel ATTR_FLAG_DIRECT if defined in the normal #include path. Performance ----------- So far, this has been tested only on floppy, ide and scsi on a fairly low machine: a 486-dx100 with aha1542. The raw device was created using cp -a /dev/sda1 /dev/rsda1 chattr +r /dev/sda1 Using the (fairly slow) buffered /dev/sda1 and raw /dev/rsda1 devices on this machine give me: MB/sec read %cpu read MB/sec write %cpu write /dev/sda1: 1.3501 13% 1.2275 23% /dev/rsda1,32k: 1.5505 5% 1.3285 5% /dev/rsda1,64k: 1.6303 3% 1.3976 4% In these cases, at least, the raw device is uniformly but only slightly faster than the buffered device for raw disk throughput, but the %cpu utilisation is dramatically better for the raw devices. Using a larger block size for the transfers improves performance still further. This is a test release. All input, bug reports and requests are welcome. --Stephen