Friday, July 21, 2006

Status update

Woohoo, exams are over!! :)

Finally I'm going to have time to work on the project, yay :))


Today I got zfs_ioctl.c to compile (not linking yet, I've got to get libzpool to compile in the simulated kernel context, which probably means copy/pasting most of zfs_context.h to the correct libsolkerncompat headers).
However, even after zfs_ioctl links with libzpool-kernel, I still have to code some additional functionality in order to get the zfs and zpool commands working.


In other news, this week I've got a free 3-month Safari account, thanks to Google (and O'Reilly), which will be quite useful. It's incredible how these guys are always surprising me :D

After a little browsing of the available books, I've found one which has already proved itself to be helpful: Solaris Internals - Core Kernel Components. Although it was written at a time when only Solaris 7 was available, the VFS chapter content was still mostly accurate. I only wish it was more detailed.. :)

So, even with the help of the book (and the OpenSolaris OpenGrok browser, which I've been using since the beginning -- amazing, I already can't live without it), I've had some difficulty understanding some Solaris vfs/vnode interfaces, but I think I got it mostly right.

Of course, even if I haven't, I'm sure my kind and dedicated testers will help me find all the bugs, eventually.. ;)

Wednesday, July 12, 2006

FUSE implications on ZFS


I know it's been almost 2 weeks since my last post, but I'm still in my university exam season. Anyway, after my last exam next wednesday (the 19th), I'll be free to work on this project full-time ;)

Today I've received a few interesting questions from Jim Thompson that I (and him) think you should know about.

"(snip) reading the ZFS mailing list I've seen a couple of mentions that ZFS turns off the write cache on the disks it manages. There may be other low-level disk control issues in ZFS as well. Is it possible for ZFS to accomplish these low-level operations when running from user code in FUSE?

Secondly, how does FUSE+ZFS ensure that the linux kernel's disk cache doesn't interfere with ZFS's writes to the disk. When ZFS thinks it's written a block to disk, is there any possibility that the block is actually cached inside the linux kernel's list of dirty disk pages?"

Actually, regarding the write cache, ZFS on (Open)Solaris enables it if you give it a whole disk. The problem about disks's write caches is actually the reordering of the writes. ZFS must have a guarantee that all the writes in the current transaction are flushed to the disk platter before writing the uberblock, in case power fails.

This will be accomplished in zfs-fuse by calling fsync(2) on file vdevs and ioctl(BLKFLSBUF) on block devices at the appropriate times (which ZFS already does), in order to flush all writes to disk. The (Linux) kernel guarantees that this will happen (on sane disks).

This is the only low-level interaction with the disks that ZFS cares about.

If your hard disk is broken/misbehaving so that it ignores the command to flush the write cache, you can always disable the write cache with hdparm(8)/sdparm(8)/blktool(8), like you had to do with any other journaling filesystem.

I don't recommend disabling the write cache unless you know your disk misbehaves, because it's actually a good thing - it improves performance and your disk will last longer.

However, there's another thing that worries me a little more, and that I'll have to look into it later on.

The issue is with the Linux kernel read cache. I don't know exactly at what level ZFS caches nodes/blocks, so if I'm not careful, there could be cache duplication, which will manifest itself in wasted memory usage.
FUSE has a few mount options that allows one to control the kernel cache behaviour - direct_io, kernel_cache and auto_cache.

Actually, I don't know what will be better - disabling the kernel cache or disabling the ZFS cache (or portions of it).
I'll try to investigate this issue when the time comes :)