Wednesday, July 12, 2006

FUSE implications on ZFS


I know it's been almost 2 weeks since my last post, but I'm still in my university exam season. Anyway, after my last exam next wednesday (the 19th), I'll be free to work on this project full-time ;)

Today I've received a few interesting questions from Jim Thompson that I (and him) think you should know about.

"(snip) reading the ZFS mailing list I've seen a couple of mentions that ZFS turns off the write cache on the disks it manages. There may be other low-level disk control issues in ZFS as well. Is it possible for ZFS to accomplish these low-level operations when running from user code in FUSE?

Secondly, how does FUSE+ZFS ensure that the linux kernel's disk cache doesn't interfere with ZFS's writes to the disk. When ZFS thinks it's written a block to disk, is there any possibility that the block is actually cached inside the linux kernel's list of dirty disk pages?"

Actually, regarding the write cache, ZFS on (Open)Solaris enables it if you give it a whole disk. The problem about disks's write caches is actually the reordering of the writes. ZFS must have a guarantee that all the writes in the current transaction are flushed to the disk platter before writing the uberblock, in case power fails.

This will be accomplished in zfs-fuse by calling fsync(2) on file vdevs and ioctl(BLKFLSBUF) on block devices at the appropriate times (which ZFS already does), in order to flush all writes to disk. The (Linux) kernel guarantees that this will happen (on sane disks).

This is the only low-level interaction with the disks that ZFS cares about.

If your hard disk is broken/misbehaving so that it ignores the command to flush the write cache, you can always disable the write cache with hdparm(8)/sdparm(8)/blktool(8), like you had to do with any other journaling filesystem.

I don't recommend disabling the write cache unless you know your disk misbehaves, because it's actually a good thing - it improves performance and your disk will last longer.

However, there's another thing that worries me a little more, and that I'll have to look into it later on.

The issue is with the Linux kernel read cache. I don't know exactly at what level ZFS caches nodes/blocks, so if I'm not careful, there could be cache duplication, which will manifest itself in wasted memory usage.
FUSE has a few mount options that allows one to control the kernel cache behaviour - direct_io, kernel_cache and auto_cache.

Actually, I don't know what will be better - disabling the kernel cache or disabling the ZFS cache (or portions of it).
I'll try to investigate this issue when the time comes :)


Jim said...

"Actually, regarding the write cache, ZFS on (Open)Solaris enables it if you give it a whole disk."

It enables the cache? Are sure what it's enabling isn't the cache write-through? How does ZFS guarantee the order that blocks are written to the disk if the cache is enabled?

Just trying to understand here.

wizeman said...

Ok, we are talking about 2 different write caches: the kernel cache and the disk cache. ZFS enables the disk writeback cache (see 'man hdparm', switch '-W', or 'man blktool', option 'wcache' -- they do the same thing).

To guarantee the order of writes, ZFS uses the sync/flush ioctl.

That command only returns when all cached/buffered writes are written to the disk platter. This includes both caches - the kernel cache and the disk writeback cache.

At that point ZFS can write the uberblock(s) and issue another sync/flush command.

In well-behaving disks, this guarantees the order of writes.

As a matter of fact, since the vdev label (which contains the uberblock) is the most critical piece of information in a ZFS pool, the ZFS on-disk structure has 4 labels per disk, and it uses 2-staged uberblock writes (see section 1.2.2 of this pdf), which should it make it even safer.

This sync/flush command is the same technique that journaled filesystems use to guarantee filesystem consistency - the biggest difference is that ZFS does copy-on-write (COW) transactional modifications, atomically updating the uberblock, while journaled filesystems must guarantee the order between normal filesystem modifications and journal modifications.

Anonymous said...

I presume your post-exam hangover must be wearing off by now; could you please post an update on the project?

wizeman said...

I will as soon as I work on it some more :p
I'm still going to have one more exam (tomorrow), so.. hang on a couple of days ;)