Friday, June 30, 2006

Why ZFS is needed even in desktops and laptops

Personally, I can't use my computer unless I know I have reliable software and hardware. I think I have achieved that goal mostly, since my computer pretty much never crashes or loses data (apart from the occasional application bugs).

Now, even though I use a reliable journaling filesystem (XFS) in my Linux system, I like to do a filesystem consistency check once in a while (usually not less than once every 3 months), which can only happen in those rare times when I need (or want) to reboot. Today was one of those days.

And here are the results: xfs_repair.txt. I ended up with 90 files and empty dirs in lost+found. Why did this happen? It could be a hardware problem - either the hard disk, the SATA cable, the SATA controller or even the power supply; or a software bug - either in the SATA driver, the XFS code or somewhere else in the Linux kernel.

I actually suspect this is a hardware problem. This particular machine, back when I was using a different SATA controller and a different hard disk, had the very annoying problem of not flushing the disk write cache on reboots. This caused *a lot* of the problems you see above in the xfs_repair log. I even tried MS Windows, which would chkdsk on every other reboot. So the problem was definitely hardware related. Even though I never fixed the problem, fortunately I never lost an important file!

Now, after the hard disk died, I bought a new one and changed SATA controller (my motherboard has 2), just to be on the safe side. But, well, as you can see above, something's still definitely not working correctly.

This is one of the reasons I need ZFS. I don't want to lose or end up with misteriously corrupted files. I want to see how often data is corrupted. I want to see if corruption only happens after a reboot (which means it's a disk write cache flush problem), or if it happens while the system is running (I can't fsck XFS filesystems while they're being used). Of course, I want to do this in order to diagnose the problem and fix it.

And even if the hardware only corrupts data once in a blue moon, I need my filesystem to properly report a checksum error and retry (or return an error), instead of returning corrupted data. Basically, I want a reliable computer..

4 comments:

Jeb said...

Ouch -- glad everything is ok.

I totally agree that we need a fs like zfs to eliminate or at least know about potential points of failure. (Beyond the great features like snapshotting and pooling of storage...)

I'm really hoping that this project will get us closer to having an in-kernel ZFS patch (distros couldn't distribute the kernel mod, but users could patch...)

Thanks again for your great work.

Anonymous said...

amen...

im trying to repair an ext3 filesystem that was almost completely write once files...ive got about 90k of lines in my fsck log so far this is crazy...

i get the sense that fsck.ext3 finds itself pointed at the middle of an avi and decides its file metadata and starts "fixing" based on random bits...but im perhaps a little bitter right now ;)

thanks for this project...im using nexenta and zfs in a vmware container (against raw disk) right now rather then use xfs or ext3. it works well and performs well but id love a linux native solution.

Dave Abrahams said...

Go, man! ZFS is clearly the grail :)
Thanks for taking this one on.

When this is done, do we have Raid-Z capability automatically, or is that another project?

wizeman said...

Yep - RaidZ, RaidZ-2, mirroring and dynamic striping will definitely be supported :)