Friday, June 30, 2006

Why ZFS is needed even in desktops and laptops

Personally, I can't use my computer unless I know I have reliable software and hardware. I think I have achieved that goal mostly, since my computer pretty much never crashes or loses data (apart from the occasional application bugs).

Now, even though I use a reliable journaling filesystem (XFS) in my Linux system, I like to do a filesystem consistency check once in a while (usually not less than once every 3 months), which can only happen in those rare times when I need (or want) to reboot. Today was one of those days.

And here are the results: xfs_repair.txt. I ended up with 90 files and empty dirs in lost+found. Why did this happen? It could be a hardware problem - either the hard disk, the SATA cable, the SATA controller or even the power supply; or a software bug - either in the SATA driver, the XFS code or somewhere else in the Linux kernel.

I actually suspect this is a hardware problem. This particular machine, back when I was using a different SATA controller and a different hard disk, had the very annoying problem of not flushing the disk write cache on reboots. This caused *a lot* of the problems you see above in the xfs_repair log. I even tried MS Windows, which would chkdsk on every other reboot. So the problem was definitely hardware related. Even though I never fixed the problem, fortunately I never lost an important file!

Now, after the hard disk died, I bought a new one and changed SATA controller (my motherboard has 2), just to be on the safe side. But, well, as you can see above, something's still definitely not working correctly.

This is one of the reasons I need ZFS. I don't want to lose or end up with misteriously corrupted files. I want to see how often data is corrupted. I want to see if corruption only happens after a reboot (which means it's a disk write cache flush problem), or if it happens while the system is running (I can't fsck XFS filesystems while they're being used). Of course, I want to do this in order to diagnose the problem and fix it.

And even if the hardware only corrupts data once in a blue moon, I need my filesystem to properly report a checksum error and retry (or return an error), instead of returning corrupted data. Basically, I want a reliable computer..

Monday, June 26, 2006

zfs and zpool programs successfully compile and link.

Beware -- long post.

Sorry for the lack of updates recently, I've been kind of busy with other stuff.. ;)

I have good news. As you can see from the title, the zfs and the zpool programs now successfully compile and link :)

There are 2 known functionality losses: Linux doesn't seem to have a libdiskmgt equivalent and porting it seems rather complicated if not almost impossible, so there is no "device in use" detection. If anyone knows a good way to solve this, I'm all ears :)

The other loss is the "whole disk" support. This must eventually be solved, since it's rather common in Solaris to dedicate a whole disk as a zpool vdev. I believe this can be circumvented for now by creating a partition that uses the whole disk, and create the zpool vdev using the partition device.

The specific problem is that ZFS uses EFI labels in disks. I only found one working EFI library for Linux, but the API is different from the OpenSolaris implementation. Once again, porting the EFI functionality from OpenSolaris proved to be difficult.


Unfortunately the zfs and zpool programs don't do anything useful yet.

In the original implementation, they use the ioctl(2) system call through the /dev/zfs device to communicate with the kernel. Since this is a userlevel implementation, there will be no /dev/zfs.

In the zfs-fuse implementation, instead of using ioctl(), we communicate through a UNIX domain (local) socket which is created in /tmp/.zfs-unix/. So in order to make these commands actually do something, there must be a zfs-fuse process that answers the messages sent from zpool and zfs.

My plan to make zfs-fuse work is to take advantage of as much code from the original implementation as possible. Yes, this also means I want to use the original ZPL code. It seems to be the most reliable way to do it, and perhaps also the easiest one :)

In order to do that, I've created a libsolkerncompat library that will implement/translate the necessary OpenSolaris kernel code to make the ZPL work. This library will also be necessary in order to use the original zfs_ioctl.c implementation, since it uses some kernel VFS (Virtual File System) operations, along with other things.

This will also take some time to get it working, since a new zfs_context.h must be created or (more likely) the current zfs_context.h must be factored out to libsolkerncompat.

So, in a way, I can say I'm now in phase 3.5, since I'm actually working on the zfs-fuse process (which includes the necessary bits to make the ZPL code work), but I'm still not in phase 4, since zpool and zfs won't work until zfs_ioctl.c is ported.

Thursday, June 22, 2006

Massive cleanup, libzfs almost compiling

Today I did a massive cleanup of the source code.

As of now, there's a new library called libsolcompat that implements or translates all necessary Solaris-specific functions into glibc functions (yes, FreeBSD will require some work here).

This massive cleanup also means that all #includes will be exactly the same between the original source files and the Linux port! :)

This was achieved by overriding some glibc system headers with a local header, with the help of a gcc-specific preprocessor directive called #include_next. This directive allows one to include the overriden system header, while adding some functionality to it.

You can see a trivial example of this in the libsolcompat string.h, where a Solaris-specific function called strlcpy() was added.

With this new file structure in place, it is now much easier to port new files.

This also means libzfs is almost fully compiling. There are only a few functions left to be ported, which don't seem too difficult. However, getting it to work correctly will still require some work, as there are some differences between a real filesystem and a FUSE filesystem (I don't think mount(2) will work, ioctls must be replaced by UNIX sockets, etc), and between Solaris and Linux, obviously (some /dev/dsk references are still in place, etc).

Tuesday, June 20, 2006

Phase 3 has begun


Today I finally started working on Phase 3.

I already got libuutil to compile (zpool needs it), but in the process I stumbled upon a very subtle problem. The problem is that, when porting OpenSolaris code to Linux, the -Wunknown-pragmas flag is dangerous (I was using it to ignore #pragma idents).

Why is it dangerous, you ask? Because it ignores #pragma init and #pragma fini. Then why do the OpenSolaris developers use -Wunknown-pragmas without problems?
Because when gcc is compiling to a Solaris target, it recognizes those pragmas. Not in Linux, though.

Well what does this mean? It means that to be on the safe side (I really don't want to track down an obscure bug related to a #pragma init that I missed) I had to remove that flag from the main SConstruct file, and I had to change 144 source files just to remove #pragma idents... On the bright side, I only have to do this once, since all future code integrations from OpenSolaris are patched automatically.

There's a related problem I had to deal with earlier. The gcc assembler in Solaris recognizes the slash (/) symbol as a way to start comments. Unfortunately, in Linux it doesn't, which means I had to remove all comments from the ported assembly files..

Anyway, in other news, Ernst Rholicek has contributed a Gentoo Ebuild for all lazy Gentooers out there who want to help testing ;)

That's it for now -- as usual, I'll keep you posted on my progress.

Thursday, June 15, 2006

Week report


Unfortunately this week I've been very busy with school and some personal stuff, so I've not been able to make much progress.

However, I've received some successfull ztest reports in single cpu and SMP machines (and a few bug reports too ;)
In the mean time, I've setup a public access subversion repository and I've been releasing minor version updates to fix bugs. Version 0.1.3 is finally expected to work right on 32-bit machines (there were a few timer-related integer overflows).

There is now also a known bug in the glibc version 2.3.2 that comes in Debian Stable (sarge) related to conditional variables that causes a deadlock. You can read the details here. There is no known workaround (except upgrading glibc manually..).
A special thanks goes to Eric Hill for reporting and helping me find this bug (and 2 other ones).

Since I'm going to have an exam tomorrow and I still have a few other things to do, I'm only going to be able to start phase 3 around Sunday or so. In the mean time, if you still haven't done so, you can help testing ;)

Saturday, June 10, 2006

Testing ZFS on FUSE

The result of the phase 2 of this project is now available for download.

If you want to help testing, please download the source code and follow the README and TESTING instructions.

A few notes:
  • If you have access to a real SMP (dual-core, dual-proc or even a quad-proc ;) machine, I would really appreciate if you test this program. There are no known bugs at this moment, but if there are any problems, the most probable cause is in the new threading code.
  • You need SCons to compile (just do the usual 'apt-get install scons', 'emerge scons', or 'yum install scons')
  • Currently, it only runs on the x86 and amd64 architectures.
If you have a sparc machine and are willing to help testing, please let me know so that I can port the necessary code.

If you have a machine of another architecture and want to help port to it, you'll have to implement about 2 or 3 assembly functions. If you're interested, let me know so that I can help ;)

FreeBSD is not my primary goal at this moment. After the SoC finishes, I will make sure it works. If you can't wait, I accept patches :P

Friday, June 09, 2006

Successfully compiled zdb and ztest

Hi everyone, I have great news.

Apparently, I managed to compile zdb and ztest ;)

After fixing a few bugs, I also managed to run ztest a few times for 5 minutes with no problems (there are still bugs, however).

Once I manage to fix the remaining bugs (tomorrow or the day after, I hope), I'm going to upload the source code and give instructions for whoever wants to help testing.

Stay tuned :)

Saturday, June 03, 2006

Current issues

These are the libzpool files that are already compiling without warnings: util.c, bplist.c, dmu.c, dmu_objset.c, dmu_traverse.c, spa.c, spa_misc.c, vdev.c, zap.c and zap_micro.c. Also, zdb.c and zdb_il.c are already compiling.
The ZFS developers have done a very good job, since all of those files required only a few simple fixes, and I expect a lot of the other ones will too (except of course kernel.c, and perhaps taskq.c - I didn't have a close look at it yet).
So far, the hard part was getting the headers right, especially zfs_context.h.

Most of the code changes so far were specific to gcc (fixing warnings) and POSIX threads (and maybe a few glibc ones, I don't know for sure). I intend to separate Linux-specific code into a linux.c file, or at least mark it with a /* LINUX */ tag, so that porting to FreeBSD will be easier.

Anyway, here's what I still have to do to make zdb work:
  • Port kernel.c (hard)
  • Port the remaining libzpool files (should be easy)
  • Figure out how to implement the atomic_add_64() and atomic_cas_64() functions..
  • Figure out how to make read-write locks work (using POSIX threads). It already has read-write locks (when __USE_UNIX98 is defined in the system header features.h), but I think it's impossible to know if the current thread owns a read (or write) lock, which is necessary to implement the RW_xxx_HELD() macros...
  • Port libumem. Actually, I'm planning on doing a simple stub implementation using malloc() and mutexes, and later integrate the libumem linux port.
  • Port whatever else is needed that I still haven't found out ;)
Comments are welcome :)