Monday, April 16, 2007

ZFS in the Linux kernel?

Due to the recent surge of interest in porting ZFS to the Linux kernel (if you are in the mood to read dozens of messages, see this thread, the follow-up, plus this one and one more), I'd like to offer my view on things.

I have a feeling most Linux kernel hackers (or at least those that talk about ZFS on linux-kernel) don't really know how ZFS works or what it can do. The best example is perhaps this message from Rik van Riel.

Well, first of all, ZFS doesn't have/need fsck (what?! are you nuts??). This is because ZFS checks and repairs the filesystem online and on-the-fly, as it is being used. And when it can't repair, it will pinpoint exactly which files and which bytes in those files were corrupted. You might think this is complex or expensive, but it's really simple and beautiful actually. These slides explain this and a lot more, so please read them carefully.

The great thing is that ZFS can also repair metadata on-the-fly even on ZFS pools that don't have any inherent redundancy (in other words, this also works for single disks). This is due to a feature called ditto blocks, which basically keeps multiple copies of metadata dynamically spread through the disk. Oh and now this works for data too, so you can configure your filesystem with important files to keep 2 or even 3 copies of data on the disk (this is despite any inherent pool redundancy).

ZFS has a lot of other nice things too, like cheap and instantaneous snapshots and clones, optional compression, variable sector sizes, easy management, .. I really think interested people should read these slides and try zfs-fuse.

Now regarding a ZFS port to the Linux kernel:

1) As for technical difficulty, I don't think it is a problem. I don't know Linux VFS internals, but if I was able to port it so easily to FUSE, it certainly can be done. I don't think this is a problem at all.

2) As for the license, well.. that is a real problem. I'm a big believer in FSF's ideals, but in this case I think the GPLv2 is preventing progress. It would be a big plus to have Linux benefit from a fully open-source, useful piece of functionality with 6 years of development behind it.

Of course, as Adrian Bunk put it, I don't think it'll be possible to have 10,000 (live and dead) people to agree on a licensing change.

One option would be to reimplement ZFS (or a comparable filesystem) from scratch. I don't think this is feasible, first because it would require a huge effort and several years to reach the same level of robustness as ZFS has right now. And second because Sun has filed more than 50 patents on ZFS. Even if Sun never uses those patents against Linux, some people might see it as a risk (in the United States).

The only way I'm seeing ZFS on the Linux kernel is to convince Sun to dual-license ZFS under the GPL and the CDDL. Some people might say Sun would never do this, but Sun has been very open to the open-source community recently. And in fact, Sun's ZFS FAQ initially had an answer saying Sun was considering a ZFS port to Linux (not to FUSE, that was my idea ;).

Finally I'd like to debunk a couple of myths about zfs-fuse:

1) In terms of features, zfs-fuse will certainly be comparable to a ZFS kernel implementation (and in fact, most of it already works). The only thing that can't be done is to store swap on a ZFS pool, due to the way ZFS works. You can see the STATUS file for more details about implemented features.
2) As for performance, well.. zfs-fuse is slow right now, but it will certainly improve. I haven't even started to seriously look at performance. And FUSE-based filesystems can have comparable performance to kernel filesystems, as the bottleneck is usually the disk(s), not the CPU.


Tephra said...

personally I think you have done a great job in porting to Fuse so quickly, Just ignore that person that said zfs-fuse was silly and slow...

I am sure you will get decent performance out of zfs-fuse even before the kernel people decide on what to do yet alone implement it :)

Felix said...

You have done a great job on porting ZFS to Linux, even though arguments about FUSE's performance are still hanging around. Personally I don't really like extreme things (even extreme performance), thus I like your FUSE approach. As we can see in ntfs-3g, performance would not be a big problem in userland file systems, rather what I want to see is between performance and feasibility the balance which lies on the kernel-userspace bridge. Nothing is more important than balance, even though that is performance.

Romain LE DISEZ said...

For information, ZFS is now part of FreeBSD-CURRENT and it works very well. Performance are correct.

Little question : are you planning to upgrade version of ZFS in your code ? That's because I upgraded my ZFS pool in FreeBSD and now I can't use ZFS-Fuse to read because version are lower than in FreeBSD. (FreeBSD version is 6)

Alex said...

yeah, of course, you don't need fsck because ZFS is bug free by definition!

wizeman said...


Yes, but probably only after zfs-fuse 1.0 is released, I don't want to introduce instability in the beta versions.

Serge said...

Regarding the license issue, I think you're not looking at things completely.

Sun has agreed that, if things go as planned, they will re-release Solaris under the GPL3. This should include ZFS.

Then the only impediment to getting ZFS into the kernel would be the kernel team's reluctance to move to GPL3.

The idea of having a single license over everything is that it strenghtnes the license. I don't know if you remember the fiasco in Linux back in 1997/1998 where the sound system had to be replaced due to license issues. Those kinds of issues get ugly and so it's better to have one license.

Linus has recently said he'll look at GPL3, now that the third draft is out.

As much as I love the Fuse port, the features that allow a ZFS filesystem to export block devices seems like it's a feature which would require an in-kernel port to operate properly.

That said, this work you've done has been incredible.

Stephen said...

Just to correct something Serge said:
Sun has agreed that, if things go as planned, they will re-release Solaris under the GPL3. This should include ZFS.

Sun has not made any such statement or agreement. They have not made any commitment to relicensing or dual-licensing any OpenSolaris code under the GPLv3 yet.

Chris Samuel said...

Even if Sun were to dual license under the forthcoming GPLv3 I don't know that it would help. :-(

Anonymous said...

If you're a Tannenbaum-head like me, FUSE is preferable to in-kernel drivers. Let's move everything into userspace. Microkernels are cool!

Chris Samuel said...

Microkernels may be "cool", but there are penalties you pay for going that way. LWN did some interesting and found:

The measured system call overhead for Minix is a full ten times higher than the value for Linux. The file copy tests ran between two and ten times faster on Linux. Pipe throughput differed by a factor of seven; Minix was 140 times slower at process creation.

So if you're happy with that trade-off then that's fine, you just need to be aware that it is there.

Anonymous said...

You might be interested in this:

As Sun has put ZFS support into GRUB, which means the code is now GPL2 too.

The only thing left is the ZFS tools, which could be re-implemented.

Anonymous said...

Heh. OK, a Linux publication measured Linux -- a system with years of optimization for running on some of the world's fastest supercomputers -- against a teaching OS. One happened to be monolithic, the other a microkernel, and they decided that the difference in performance was due to this one architecture decision. Priceless!

(I should benchmark Windows XP against Minix for 3d graphics performance, and use that to determine whether free software is feasible for graphics.)

Older sources (which are not so Linux-biased) have found the performance hit of a microkernel architecture at 5-15%. And yeah, considering the time (and data, almost) I lost because of things like the Linux Firewire module giving me kernel panics, I'd gladly pay 10% in CPU time for reliability.

After all, if I want 10% more speed, I can buy a 10% faster CPU. You can't buy software reliability from hardware, though.

So if you're happy with that trade-off then that's fine, you just need to be aware that it is there.

Anonymous said...

Sun kinda likes (but, won't admit) how they have something -in this case ZFS- which is simple and elegant for system administrators which Linux cannot use much like DTrace.

With Java this is different as it is a platform.

I'd like my software well tested before I'd trust my data to it. So if I would want to run a NAS or SAN (I do) I would either prefer OpenSolaris with ZFS, FreeBSD's GEOM_RAID5 (or later with ZFS), Or Linux its RAID implementation. I wouldn't want to trust such huge amounts of data to a not well tested userland application. But more importantly, I would really want to see some benchmarks performed by people who know how to benchmark. That excludes me though.

I still respect your work because you can, for compatibility reasons, and simply because I love open source and ZFS on Linux sounds cool ;-). Heck, it might even allow me to test out ZFS! I'm sorry that I'm not that much of enthusiast...

Anonymous said...

Has anyone contacted Simon Phipps in his role as the Sun Open Source ombudsman? Getting permission for the ZFS stuff to be incorporated into the kernel (and thus relicensed under the GPL) is something he can probably help with.

Anonymous said...

The LKML posts sound like a bunch of housewifes at a Tupper party wonder about that "Linux thing" and assure themselves how much better Windows is, because it looks so nice and Linux used only by acne-infested teenagers playing massaker games.
These guys not even bother to just look at the Powerpoint presentation, FAQm intro or anything, but felt inclined to comment on LKML. Given that, they are most likely not productive kernel contributors either, so nevermind them.

David Oftedal said...

Well done, well done!

Instead of moving non-GPL drivers INTO the kernel, though, shouldn't one really be talking about moving drivers out? It seems to me that if a filesystem driver can be run in userspace and eventually comparable performance, then other drivers such as binary-only graphics drivers and Wi-Fi drivers could be run in userspace as well.

Not only would this remove a potential license problem, but it would help solve a security problem as well: While it might be hard to probe the internals of a binary-only driver, it's easy enough to prevent a program running in userspace from accessing the filesystem or altering data in memory. A microkernel-like system might be a really good thing for Linux in more ways than one.

pradeep said...

Rik Van Riel is correct i guess. A huge file system database would need days to recover.The argument that ZFS fscks itself on the fly is feeble because you still fsck it while you are online!!! It may not be like traditional fscks but you still cannot get the mucked up metadata or data reliably while it has not been touched by the fsck yet.

Good work though. Keep it up.

Bob Hunter said...

>2) As for the license, well.. that is a real problem.

No, it is not. There are other cases where the licence is far worse than ZFS's own, such as Nvidia's closed source for the video driver, an yet it has been possible to use those drivers at kernel level via add-ons. The same approach can be used for ZFS.

The real problem with ZFS is that Sun did not manage (yet) to boot from it.

Jack Ripoff said...

"Sun has been very open to the open-source community recently"

Not really much. They are still unwanting to release documentation for their chips and until then we cannot code maintainable kernel drivers for them.

SDiZ said...

Some people still want to use fsck because they can't mount the filesystem. For example, the media is readonly. If i suspect the disk is corrupted, i would like to do a fsck, rather then waiting for a kernel panic at unexcepted time when it see wwield error.

cp said...

For all the guys who want to fsck ZFS: It can be done online doing manual 'crubbing'.

When it finishes, you know, that the data and the metadata is 100% consistent.

With ext's fsck you will never know, if the data is correct, since only metadata is checked.