RAW DEVICE IO FOR LINUX-2.1.131
===============================

First test release.  USE AT YOUR OWN RISK.

Stephen Tweedie <sct@redhat.com>, 10 December 1998

Copyrigh (C) Red Hat Software, 1998


What is in this package?
------------------------

Three things:

* A kernel patch to enable raw device IO, the O_DIRECT fcntl option, and
  a new system call, sys_fsattr, for the setting and querying of 
  filesystem attribute flags on all inodes.

* A small library, libsetattr, to interface to the sys_fsattr system call.
  libsetattr provides simple setattr(filename, flags) and getattr(filename)
  functions.

* Patches to e2fsprogs to extend lsattr and chattr to use the new attribute
  functions from libsetattr, and to understand the "raw" attribute.


The Kernel Patch
----------------

This kernel patch provides a new system call for accessing filesystem
inode attributes.  The existing mechanism for accessing attributes has
two major problems:

* It cannot be used on socket or device special files
* It is specific to ext2fs.

The sys_fsattr() system call (and its two user-space wrapper functions,
setattr and getattr from the libsetattr library) take a filename as an
argument, and do not require you to open the file first.  Thus, you can
set attribute bits on ANY inode in an ext2fs filesystem.  For example,
you can finally remove the immutable or append-only bits from a corrupt
block-device inode, something which you could previously do only with
debugfs. 

With this patch in place, opening an inode with the raw bit set (see the
chattr instructions below) is exactly equivalent to opening that file
with the O_DIRECT fcntl option.  In either case, the kernel behaves
exactly the same: if the file is a block device, then all [p]read() and
[p]write() system calls to that file occur using raw IO.  The O_DIRECT
option is currently ignored for files other than block device special
inodes.

What does raw IO give you?

Raw IO bypasses the kernel caching entirely.  The data that you pass to
the kernel's read/write system call gets passed _directly_ to the device
driver.  If the driver is DMA-capable, then the device will be doing DMA
straight into the user buffer, with no data copies performed at all.
All raw IO is synchronous, and when the read or write system call has
completed, the kernel guarantees that the transfer to disk is finished.
There is no read-ahead or write-behind, so performance on a raw device
might actually be worse than on a buffered device unless you use a large
enough blocksize.

See the top of linux/fs/raw.c for implementation notes.  Three points are
worth emphasising here:

* No cache-coherency is performed.  The raw IO goes straight to disk;
any stale data in the buffer cache is not invalidated.  I'd like to know
what other Unixes do in this situation.

* The implementation currently limits the IO block size to 64k, to
reduce the amount of physical memory that a user can keep locked down at
any point in time.  Using a larger block size will not help.  If a
larger block size is necessary for performance, please let me know.

* The caller MUST make sure that the buffer supplied for the IO is
suitable for being transferred straight to disk.  In particular, the
buffer size and the file offset must be multiples of the disk block
size, and the buffer must be aligned such that no disk block straddles a
page boundary in memory.

Note that the gnu "dd" program does not use correctly-aligned buffers:
get Larry McVoy's "lmdd" from the lmbench benchmark on www.bitmover.com
if you want a working dd for raw devices.


The Libsetattr Stub Library
---------------------------

The kernel fsattr system call provides a single system call which allows
you to test and set inode attributes in a single call.  The set of flags
is passed to fsattr via an arbitrary-sized buffer, to allow for future
expansion.  However, for now, only a few attributes are defined, and
those attributes can easily be accessed as a single integer.  As a
result, it is convenient to have a way of setting or fetching attributes
which does not have to know about setting up the transfer buffer for
sys_fsattr(). 

The libsetattr.a libary contained here provides two functions, declared
in /usr/include/setattr.h:

extern int getattr(const char *filename);
extern int setattr(const char *filename, int flags);

These both return -1 on error.  The flags mask is composed of the
following bits:

#define ATTR_FLAG_SYNCRONOUS	1 	/* Syncronous write */
#define ATTR_FLAG_NOATIME	2 	/* Don't update atime */
#define ATTR_FLAG_APPEND	4 	/* Append-only file */
#define ATTR_FLAG_IMMUTABLE	8 	/* Immutable file */
#define ATTR_FLAG_NODIRATIME	16 	/* Don't update atime for directory */
#define ATTR_FLAG_DIRECT	32 	/* Force unbuffered IO */
#define ATTR_FLAG_NODUMP	64 	/* Do not dump(8) this file */
#define ATTR_FLAG_COMPRESS	128 	/* Try to compress this file */
#define ATTR_FLAG_BTREE		256 	/* Use btree format on dir */

Calling these functions on a kernel without the raw patch applied will
return -1 with errno set to ENOSYS.


The E2fsprogs Modifications
---------------------------

There are four modifications to e2fsprogs in the patch supplied:

* The libe2p fset_flags and fget_flags functions now understand, and
  use, libsetattr if present.  As a result, you can now use these
  functions on a special device or socket inode without getting an error
  back.  If you use the new libe2p on an older kernel, the functions
  will correctly detect the ENOSYS kernel error and will fall back to
  using the old ext2fs-specific ioctl() commands.

* The libe2p "pf" print-flags function understands the new
  EXT2_DIRECT_FL raw IO flag, and prints an "r" character if it is set. 
   
* The chattr program understands the "r" attribute, so that you can set
  or clear the EXT2_DIRECT_FL flag on any ext2fs inode.

* The lsattr program no longer tries to query the inode version number
  if you have not asked it to do so.  You cannot query the ext2fs
  version number on a block device inode, so the old behaviour prevented
  lsattr from working properly on such devices.


To build the new e2fsprogs (for lsattr/chattr):

1) Make sure /usr/src/linux points to a kernel tree with the DIRECT 
   attributes defined.

2) Build and install the libsetattr.a library.

3) Build the e2fsprogs srpm as normal, or build Ted's official
   e2fsprogs-1.12 with the supplied e2fsprogs-1.12-raw.diff
   patch applied.  The patched configure script will detect the 
   libsetattr library if it is present, and the lsattr/chattr
   source will build support for the kernel ATTR_FLAG_DIRECT
   if defined in the normal #include <linux/*> path.


Performance
-----------

So far, this has been tested only on floppy, ide and scsi on a fairly
low machine: a 486-dx100 with aha1542.  The raw device was created using

	cp -a /dev/sda1 /dev/rsda1
	chattr +r /dev/sda1

Using the (fairly slow) buffered /dev/sda1 and raw /dev/rsda1 devices on
this machine give me:

		MB/sec read	%cpu read	MB/sec write	%cpu write
/dev/sda1:	1.3501		13%		1.2275		23%
/dev/rsda1,32k:	1.5505		5%		1.3285		5%
/dev/rsda1,64k:	1.6303		3%		1.3976		4%

In these cases, at least, the raw device is uniformly but only slightly
faster than the buffered device for raw disk throughput, but the %cpu
utilisation is dramatically better for the raw devices.  Using a larger
block size for the transfers improves performance still further.


This is a test release.  All input, bug reports and requests are
welcome.

--Stephen