dm-log-writes

This target takes 2 devices, one to pass all IO to normally, and one to log all of the write operations to. This is intended for file system developers wishing to verify the integrity of metadata or data as the file system is written to. There is a log_write_entry written for every WRITE request and the target is able to take arbitrary data from userspace to insert into the log. The data that is in the WRITE requests is copied into the log to make the replay happen exactly as it happened originally.

Log Ordering

We log things in order of completion once we are sure the write is no longer in cache. This means that normal WRITE requests are not actually logged until the next REQ_PREFLUSH request. This is to make it easier for userspace to replay the log in a way that correlates to what is on disk and not what is in cache, to make it easier to detect improper waiting/flushing.

This works by attaching all WRITE requests to a list once the write completes. Once we see a REQ_PREFLUSH request we splice this list onto the request and once the FLUSH request completes we log all of the WRITEs and then the FLUSH. Only completed WRITEs, at the time the REQ_PREFLUSH is issued, are added in order to simulate the worst case scenario with regard to power failures. Consider the following example (W means write, C means complete):

W1,W2,W3,C3,C2,Wflush,C1,Cflush

The log would show the following:

W3,W2,flush,W1….

Again this is to simulate what is actually on disk, this allows us to detect cases where a power failure at a particular point in time would create an inconsistent file system.

Any REQ_FUA requests bypass this flushing mechanism and are logged as soon as they complete as those requests will obviously bypass the device cache.

Any REQ_OP_DISCARD requests are treated like WRITE requests. Otherwise we would have all the DISCARD requests, and then the WRITE requests and then the FLUSH request. Consider the following example:

WRITE block 1, DISCARD block 1, FLUSH

If we logged DISCARD when it completed, the replay would look like this:

DISCARD 1, WRITE 1, FLUSH

which isn’t quite what happened and wouldn’t be caught during the log replay.

Target interface

  1. Constructor

    log-writes <dev_path> <log_dev_path>

    dev_path

    Device that all of the IO will go to normally.

    log_dev_path

    Device where the log entries are written to.

  2. Status

    <#logged entries> <highest allocated sector>

    #logged entries

    Number of logged entries

    highest allocated sector

    Highest allocated sector

  3. Messages

mark <description>

You can use a dmsetup message to set an arbitrary mark in a log. For example say you want to fsck a file system after every write, but first you need to replay up to the mkfs to make sure we’re fsck’ing something reasonable, you would do something like this:

mkfs.btrfs -f /dev/mapper/log
dmsetup message log 0 mark mkfs
<run test>

This would allow you to replay the log up to the mkfs mark and then replay from that point on doing the fsck check in the interval that you want.

Every log has a mark at the end labeled “dm-log-writes-end”.

Userspace component

There is a userspace tool that will replay the log for you in various ways. It can be found here: https://github.com/josefbacik/log-writes

Example usage

Say you want to test fsync on your file system. You would do something like this:

TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
dmsetup create log --table "$TABLE"
mkfs.btrfs -f /dev/mapper/log
dmsetup message log 0 mark mkfs

mount /dev/mapper/log /mnt/btrfs-test
<some test that does fsync at the end>
dmsetup message log 0 mark fsync
md5sum /mnt/btrfs-test/foo
umount /mnt/btrfs-test

dmsetup remove log
replay-log --log /dev/sdc --replay /dev/sdb --end-mark fsync
mount /dev/sdb /mnt/btrfs-test
md5sum /mnt/btrfs-test/foo
<verify md5sum's are correct>

Another option is to do a complicated file system operation and verify the file
system is consistent during the entire operation.  You could do this with:

TABLE="0 $(blockdev --getsz /dev/sdb) log-writes /dev/sdb /dev/sdc"
dmsetup create log --table "$TABLE"
mkfs.btrfs -f /dev/mapper/log
dmsetup message log 0 mark mkfs

mount /dev/mapper/log /mnt/btrfs-test
<fsstress to dirty the fs>
btrfs filesystem balance /mnt/btrfs-test
umount /mnt/btrfs-test
dmsetup remove log

replay-log --log /dev/sdc --replay /dev/sdb --end-mark mkfs
btrfsck /dev/sdb
replay-log --log /dev/sdc --replay /dev/sdb --start-mark mkfs \
      --fsck "btrfsck /dev/sdb" --check fua

And that will replay the log until it sees a FUA request, run the fsck command and if the fsck passes it will replay to the next FUA, until it is completed or the fsck command exists abnormally.