I know it's been almost 2 weeks since my last post, but I'm still in my university exam season. Anyway, after my last exam next wednesday (the 19th), I'll be free to work on this project full-time ;)
Today I've received a few interesting questions from Jim Thompson that I (and him) think you should know about.
"(snip) ...in reading the ZFS mailing list I've seen a couple of mentions that ZFS turns off the write cache on the disks it manages. There may be other low-level disk control issues in ZFS as well. Is it possible for ZFS to accomplish these low-level operations when running from user code in FUSE?
Secondly, how does FUSE+ZFS ensure that the linux kernel's disk cache doesn't interfere with ZFS's writes to the disk. When ZFS thinks it's written a block to disk, is there any possibility that the block is actually cached inside the linux kernel's list of dirty disk pages?"
Actually, regarding the write cache, ZFS on (Open)Solaris enables it if you give it a whole disk. The problem about disks's write caches is actually the reordering of the writes. ZFS must have a guarantee that all the writes in the current transaction are flushed to the disk platter before writing the uberblock, in case power fails.
This will be accomplished in zfs-fuse by calling fsync(2) on file vdevs and ioctl(BLKFLSBUF) on block devices at the appropriate times (which ZFS already does), in order to flush all writes to disk. The (Linux) kernel guarantees that this will happen (on sane disks).
This is the only low-level interaction with the disks that ZFS cares about.
If your hard disk is broken/misbehaving so that it ignores the command to flush the write cache, you can always disable the write cache with hdparm(8)/sdparm(8)/blktool(8), like you had to do with any other journaling filesystem.
I don't recommend disabling the write cache unless you know your disk misbehaves, because it's actually a good thing - it improves performance and your disk will last longer.
However, there's another thing that worries me a little more, and that I'll have to look into it later on.
The issue is with the Linux kernel read cache. I don't know exactly at what level ZFS caches nodes/blocks, so if I'm not careful, there could be cache duplication, which will manifest itself in wasted memory usage.
FUSE has a few mount options that allows one to control the kernel cache behaviour - direct_io, kernel_cache and auto_cache.
Actually, I don't know what will be better - disabling the kernel cache or disabling the ZFS cache (or portions of it).
I'll try to investigate this issue when the time comes :)