Tuesday, December 26, 2006

First alpha of ZFS on FUSE with write support

Ladies (?) and gentlemen, the first preview of ZFS on FUSE/Linux with full write support is finally here!

You can consider it my (late) Christmas gift for the Linux community ;)

Don't forget this is an alpha-quality release. Testing has been very limited.

Performance sucks right now, but should improve before 0.4.0 final, when a multi-threaded event loop and kernel caching support are working (both of these should be easy to implement, FUSE provides the kernel caching).

For more information, see the README and the STATUS file for working/not working features. Download here.

Let me know how it works, and don't forget to report bugs!

Friday, December 15, 2006

Read-only support for ZFS on Linux

I know it has been a loong time since my last post (sorry!), but today I'm very excited to bring you zfs-fuse 0.3.0 which is able to mount ZFS filesystems in read-only mode :)

Current status:
  • It is possible to create and destroy ZFS pools, filesystems and snapshots.
  • It is possible to use disks (any block device, actually) and files as virtual devices (vdevs).
  • It is possible to use any vdev configuration supported by the original ZFS implementation. This includes striping (RAID-0), mirroring (RAID-1), RAID-Z and RAID-Z2.
  • It is possible to change properties of filesystems.
  • It is possible to mount ZFS filesystems, but you can only read files or directories, you can not create/modify/remove files or directories yet.
  • ZIL replay is not implemented yet.
  • It is not possible to mount snapshots.
  • It is not possible to use 'zfs send/recv'.
  • ACLs and extended attributes do not work.
  • There is no support for ZVols.
  • It's buggy and probably has a few memory leaks :p
If you want to test it, just download it and follow the README (don't forget to read the prerequisites).

A few notes:

  • Even though you can't write to filesystems, the pools are opened in read-write mode. There are bugs and they can possibly corrupt your pools, so don't use zfs-fuse on important files!
  • There's no point in running benchmarks, since it is still highly unoptimized.
  • You cannot write to ZFS filesystems yet, so the best you can do right now is populate a filesystem in Solaris and then mount it in Linux. I recommend you create your zpools on files (since it's easier to switch between Linux and Solaris), but you can also use it directly on block devices.
  • I recommend you use EVMS if you use ZFS pools on block devices, since it places all of them in /dev/evms - makes it easier to import pools.
  • If you create your zpools in Solaris directly on whole disks, it will create an EFI label, so to properly mount it on Linux you'll need GPT/EFI partition support configured in the kernel (I think most x86 and amd64 kernels don't have it enabled, so you must compile the kernel yourself). Since my USB disk has died, and I'm still waiting for a replacement, I can't properly test this yet. The last time I tried I had some difficulty getting it to work, but I think I was able to do it using EVMS.
  • In order to import zpools on block devices, you'll need to do 'zpool import -d /dev'. Be careful since at the moment zpool will try to open and find ZFS pools on every device in /dev! If you're using EVMS, use /dev/evms instead.
The project is progressing at a fast pace since last week, when I did some major code restructuring and finished uncommenting most of the original ZPL code :)

And I am still highly confused about vnode lifetimes, so expect some bugs and probably some memory leaks until I sort it out eventually..

Enjoy!

Tuesday, October 03, 2006

First VFS operation working

After 3 afternoons of work during my recovery period and a couple of hours today, I got the first VFS operation to work! :)

So.. if you compile the latest development code from the Mercurial repository, you can now use 'df' on ZFS filesystems in Linux (but I don't recommend you try it on filesystems you don't want to corrupt) :p

It was a lot of work since I had to make zfs_mount() work, which depends on a lot of other code (I even had to import the range locks code).

The VFS is quite complex -- I still don't have a firm grasp of the whole VFS and vnode allocation/deallocation/reference counting, so my code is very messy, and there are still a lot of bugs :p

I also got a little disappointed because FUSE doesn't support remounting, so you won't be able to change any mount property on mounted filesystems (like for example, 'zfs set readonly=on pool') - you'll have to unmount and then mount it again..

Next step is to fix bugs/clean-up code and then implement readdir and stat (in order to use 'ls').

Wednesday, September 13, 2006

What's new

So what's new about zfs-fuse?
Unfortunately not much :p

Since my last post I've been basically enjoying what was left of my summer vacations (sorry :p). School started this week, and I've managed to schedule 2 class-free days per week, which is good for this project ;)

In the mean time, I've posted some patches to zfs-code, started to work on the FUSE part of the project (finally!), and migrated the code repository from Subversion to Mercurial (you can access it here).

Mercurial will greatly simplify my code management. It's an awesome CMS, and I've always wanted to learn how to use it, so I thought this would be a great time. Mercurial still has a few limitations -- for one, it doesn't handle symlinks, so I had to create a few rules in SCons to automatically create them but it's nonetheless a great improvement over Subversion, even for a single developer.

Anyway, next week I'll be having surgery (it's sleep apnea related, nothing serious), and I'll need to be home for about 1 week recovering. I'm seriously hoping to take advantage of that time to finally get zfs-fuse to mount filesystems and do some basic operations, so stay tuned :)

Tuesday, August 22, 2006

ZFS being ported to the FreeBSD kernel

These last few days Pawel Dawidek has been working on porting ZFS to the FreeBSD kernel.

And he's made tremendous progress! He can already mount filesystems, list directories, create files/directories, change permissions.. The best of all, he did that in only 10 days!

Wow, now that's impressive.

Sunday, August 20, 2006

zfs-fuse version 0.2.0 released

Hi,

If you were jealous of my previous post, now you can play with ZFS on Linux too ;)

Just head over to the download page and follow the README.

Note that it's still not possible to mount ZFS filesystems, so you won't be able to read or write files, however you can already manage ZFS pools and filesystems.

As always, you should report any bugs or problems to rcorreia at wizy dot org, or by using the bug database.

Have fun!

Saturday, August 19, 2006

ZFS on Linux status

Hi everyone,

A lot of time has passed since my last post - sorry about that. I simply hadn't made any visible progress. My free time has been less than I expected and the needed time for this part of the project is a bit more than I originally thought it would be ;)

Anyway... on to the news.

zfs_ioctl.c and libzpool-kernel are finally compiling, linking and partially working.

It's not possible to mount ZFS filesystems yet, however a few commands already work:

$ uname -a
Linux wizy 2.6.15-26-amd64-generic #1 SMP PREEMPT Thu Aug 3 02:52:35 UTC 2006 x86_64 GNU/Linux

$ ~/zfs/trunk/zfs-fuse/zfs-fuse &

$ ./zpool status
no pools available

$ dd if=/dev/zero of=/tmp/test1 bs=1M count=100
$ dd if=/dev/zero of=/tmp/test2 bs=1M count=100
$ dd if=/dev/zero of=/tmp/test3 bs=1M count=100

$ ./zpool create pool raidz /tmp/test1 /tmp/test2 /tmp/test3
cannot mount '/pool': failed to create mountpoint

$ ./zpool list
NAME SIZE USED AVAIL CAP HEALTH ALTROOT
pool 286M 87K 286M 0% ONLINE -

$ ./zpool scrub pool

$ ./zpool status
pool: pool
state: ONLINE
scrub: scrub completed with 0 errors on Sat Aug 19 03:45:45 2006
config:

NAME STATE READ WRITE CKSUM
pool ONLINE 0 0 0
raidz1 ONLINE 0 0 0
/tmp/test1 ONLINE 0 0 0
/tmp/test2 ONLINE 0 0 0
/tmp/test3 ONLINE 0 0 0

errors: No known data errors

$ dd if=/dev/urandom of=/tmp/test2 bs=1M count=30

$ ./zpool scrub

$ ./zpool status
pool: pool
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://www.sun.com/msg/ZFS-8000-4J
scrub: scrub completed with 0 errors on Sat Aug 19 03:47:37 2006
config:

NAME STATE READ WRITE CKSUM
pool DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
/tmp/test1 ONLINE 0 0 0
/tmp/test2 UNAVAIL 0 0 0 corrupted data
/tmp/test3 ONLINE 0 0 0

errors: No known data errors

$ ./zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool 60.6K 158M 2.00K /pool

$ ./zfs create pool/test
cannot mount '/pool/test': failed to create mountpoint
filesystem successfully created, but not mounted

$ ./zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool 66.6K 158M 2.00K /pool
pool/test 2.00K 158M 2.00K /pool/test


There is still a major glitch in the zfs_ioctl <-> zpool/zfs communication, so I haven't uploaded the latest code to the SVN repository just yet, but I definitely expect to fix it tomorrow.

There's also an interesting bit of code that I implemented in order to help me debug zfs-on-fuse (also still not uploaded to SVN) that I'll talk about in my next post ;)

Stay tuned.

Friday, July 21, 2006

Status update

Woohoo, exams are over!! :)

Finally I'm going to have time to work on the project, yay :))

--

Today I got zfs_ioctl.c to compile (not linking yet, I've got to get libzpool to compile in the simulated kernel context, which probably means copy/pasting most of zfs_context.h to the correct libsolkerncompat headers).
However, even after zfs_ioctl links with libzpool-kernel, I still have to code some additional functionality in order to get the zfs and zpool commands working.

--

In other news, this week I've got a free 3-month Safari account, thanks to Google (and O'Reilly), which will be quite useful. It's incredible how these guys are always surprising me :D

After a little browsing of the available books, I've found one which has already proved itself to be helpful: Solaris Internals - Core Kernel Components. Although it was written at a time when only Solaris 7 was available, the VFS chapter content was still mostly accurate. I only wish it was more detailed.. :)

So, even with the help of the book (and the OpenSolaris OpenGrok browser, which I've been using since the beginning -- amazing, I already can't live without it), I've had some difficulty understanding some Solaris vfs/vnode interfaces, but I think I got it mostly right.

Of course, even if I haven't, I'm sure my kind and dedicated testers will help me find all the bugs, eventually.. ;)

Wednesday, July 12, 2006

FUSE implications on ZFS

Hi,

I know it's been almost 2 weeks since my last post, but I'm still in my university exam season. Anyway, after my last exam next wednesday (the 19th), I'll be free to work on this project full-time ;)

Today I've received a few interesting questions from Jim Thompson that I (and him) think you should know about.

"(snip) ...in reading the ZFS mailing list I've seen a couple of mentions that ZFS turns off the write cache on the disks it manages. There may be other low-level disk control issues in ZFS as well. Is it possible for ZFS to accomplish these low-level operations when running from user code in FUSE?

Secondly, how does FUSE+ZFS ensure that the linux kernel's disk cache doesn't interfere with ZFS's writes to the disk. When ZFS thinks it's written a block to disk, is there any possibility that the block is actually cached inside the linux kernel's list of dirty disk pages?"


Actually, regarding the write cache, ZFS on (Open)Solaris enables it if you give it a whole disk. The problem about disks's write caches is actually the reordering of the writes. ZFS must have a guarantee that all the writes in the current transaction are flushed to the disk platter before writing the uberblock, in case power fails.

This will be accomplished in zfs-fuse by calling fsync(2) on file vdevs and ioctl(BLKFLSBUF) on block devices at the appropriate times (which ZFS already does), in order to flush all writes to disk. The (Linux) kernel guarantees that this will happen (on sane disks).

This is the only low-level interaction with the disks that ZFS cares about.

If your hard disk is broken/misbehaving so that it ignores the command to flush the write cache, you can always disable the write cache with hdparm(8)/sdparm(8)/blktool(8), like you had to do with any other journaling filesystem.

I don't recommend disabling the write cache unless you know your disk misbehaves, because it's actually a good thing - it improves performance and your disk will last longer.

However, there's another thing that worries me a little more, and that I'll have to look into it later on.

The issue is with the Linux kernel read cache. I don't know exactly at what level ZFS caches nodes/blocks, so if I'm not careful, there could be cache duplication, which will manifest itself in wasted memory usage.
FUSE has a few mount options that allows one to control the kernel cache behaviour - direct_io, kernel_cache and auto_cache.

Actually, I don't know what will be better - disabling the kernel cache or disabling the ZFS cache (or portions of it).
I'll try to investigate this issue when the time comes :)

Friday, June 30, 2006

Why ZFS is needed even in desktops and laptops

Personally, I can't use my computer unless I know I have reliable software and hardware. I think I have achieved that goal mostly, since my computer pretty much never crashes or loses data (apart from the occasional application bugs).

Now, even though I use a reliable journaling filesystem (XFS) in my Linux system, I like to do a filesystem consistency check once in a while (usually not less than once every 3 months), which can only happen in those rare times when I need (or want) to reboot. Today was one of those days.

And here are the results: xfs_repair.txt. I ended up with 90 files and empty dirs in lost+found. Why did this happen? It could be a hardware problem - either the hard disk, the SATA cable, the SATA controller or even the power supply; or a software bug - either in the SATA driver, the XFS code or somewhere else in the Linux kernel.

I actually suspect this is a hardware problem. This particular machine, back when I was using a different SATA controller and a different hard disk, had the very annoying problem of not flushing the disk write cache on reboots. This caused *a lot* of the problems you see above in the xfs_repair log. I even tried MS Windows, which would chkdsk on every other reboot. So the problem was definitely hardware related. Even though I never fixed the problem, fortunately I never lost an important file!

Now, after the hard disk died, I bought a new one and changed SATA controller (my motherboard has 2), just to be on the safe side. But, well, as you can see above, something's still definitely not working correctly.

This is one of the reasons I need ZFS. I don't want to lose or end up with misteriously corrupted files. I want to see how often data is corrupted. I want to see if corruption only happens after a reboot (which means it's a disk write cache flush problem), or if it happens while the system is running (I can't fsck XFS filesystems while they're being used). Of course, I want to do this in order to diagnose the problem and fix it.

And even if the hardware only corrupts data once in a blue moon, I need my filesystem to properly report a checksum error and retry (or return an error), instead of returning corrupted data. Basically, I want a reliable computer..

Monday, June 26, 2006

zfs and zpool programs successfully compile and link.

Beware -- long post.

Sorry for the lack of updates recently, I've been kind of busy with other stuff.. ;)

I have good news. As you can see from the title, the zfs and the zpool programs now successfully compile and link :)

There are 2 known functionality losses: Linux doesn't seem to have a libdiskmgt equivalent and porting it seems rather complicated if not almost impossible, so there is no "device in use" detection. If anyone knows a good way to solve this, I'm all ears :)

The other loss is the "whole disk" support. This must eventually be solved, since it's rather common in Solaris to dedicate a whole disk as a zpool vdev. I believe this can be circumvented for now by creating a partition that uses the whole disk, and create the zpool vdev using the partition device.

The specific problem is that ZFS uses EFI labels in disks. I only found one working EFI library for Linux, but the API is different from the OpenSolaris implementation. Once again, porting the EFI functionality from OpenSolaris proved to be difficult.

--

Unfortunately the zfs and zpool programs don't do anything useful yet.

In the original implementation, they use the ioctl(2) system call through the /dev/zfs device to communicate with the kernel. Since this is a userlevel implementation, there will be no /dev/zfs.

In the zfs-fuse implementation, instead of using ioctl(), we communicate through a UNIX domain (local) socket which is created in /tmp/.zfs-unix/. So in order to make these commands actually do something, there must be a zfs-fuse process that answers the messages sent from zpool and zfs.

My plan to make zfs-fuse work is to take advantage of as much code from the original implementation as possible. Yes, this also means I want to use the original ZPL code. It seems to be the most reliable way to do it, and perhaps also the easiest one :)

In order to do that, I've created a libsolkerncompat library that will implement/translate the necessary OpenSolaris kernel code to make the ZPL work. This library will also be necessary in order to use the original zfs_ioctl.c implementation, since it uses some kernel VFS (Virtual File System) operations, along with other things.

This will also take some time to get it working, since a new zfs_context.h must be created or (more likely) the current zfs_context.h must be factored out to libsolkerncompat.

So, in a way, I can say I'm now in phase 3.5, since I'm actually working on the zfs-fuse process (which includes the necessary bits to make the ZPL code work), but I'm still not in phase 4, since zpool and zfs won't work until zfs_ioctl.c is ported.

Thursday, June 22, 2006

Massive cleanup, libzfs almost compiling

Today I did a massive cleanup of the source code.

As of now, there's a new library called libsolcompat that implements or translates all necessary Solaris-specific functions into glibc functions (yes, FreeBSD will require some work here).

This massive cleanup also means that all #includes will be exactly the same between the original source files and the Linux port! :)

This was achieved by overriding some glibc system headers with a local header, with the help of a gcc-specific preprocessor directive called #include_next. This directive allows one to include the overriden system header, while adding some functionality to it.

You can see a trivial example of this in the libsolcompat string.h, where a Solaris-specific function called strlcpy() was added.

With this new file structure in place, it is now much easier to port new files.

This also means libzfs is almost fully compiling. There are only a few functions left to be ported, which don't seem too difficult. However, getting it to work correctly will still require some work, as there are some differences between a real filesystem and a FUSE filesystem (I don't think mount(2) will work, ioctls must be replaced by UNIX sockets, etc), and between Solaris and Linux, obviously (some /dev/dsk references are still in place, etc).

Tuesday, June 20, 2006

Phase 3 has begun

Hi,

Today I finally started working on Phase 3.

I already got libuutil to compile (zpool needs it), but in the process I stumbled upon a very subtle problem. The problem is that, when porting OpenSolaris code to Linux, the -Wunknown-pragmas flag is dangerous (I was using it to ignore #pragma idents).

Why is it dangerous, you ask? Because it ignores #pragma init and #pragma fini. Then why do the OpenSolaris developers use -Wunknown-pragmas without problems?
Because when gcc is compiling to a Solaris target, it recognizes those pragmas. Not in Linux, though.

Well what does this mean? It means that to be on the safe side (I really don't want to track down an obscure bug related to a #pragma init that I missed) I had to remove that flag from the main SConstruct file, and I had to change 144 source files just to remove #pragma idents... On the bright side, I only have to do this once, since all future code integrations from OpenSolaris are patched automatically.

There's a related problem I had to deal with earlier. The gcc assembler in Solaris recognizes the slash (/) symbol as a way to start comments. Unfortunately, in Linux it doesn't, which means I had to remove all comments from the ported assembly files..

Anyway, in other news, Ernst Rholicek has contributed a Gentoo Ebuild for all lazy Gentooers out there who want to help testing ;)

That's it for now -- as usual, I'll keep you posted on my progress.

Thursday, June 15, 2006

Week report

Hi,

Unfortunately this week I've been very busy with school and some personal stuff, so I've not been able to make much progress.

However, I've received some successfull ztest reports in single cpu and SMP machines (and a few bug reports too ;)
In the mean time, I've setup a public access subversion repository and I've been releasing minor version updates to fix bugs. Version 0.1.3 is finally expected to work right on 32-bit machines (there were a few timer-related integer overflows).

There is now also a known bug in the glibc version 2.3.2 that comes in Debian Stable (sarge) related to conditional variables that causes a deadlock. You can read the details here. There is no known workaround (except upgrading glibc manually..).
A special thanks goes to Eric Hill for reporting and helping me find this bug (and 2 other ones).

Since I'm going to have an exam tomorrow and I still have a few other things to do, I'm only going to be able to start phase 3 around Sunday or so. In the mean time, if you still haven't done so, you can help testing ;)

Saturday, June 10, 2006

Testing ZFS on FUSE

The result of the phase 2 of this project is now available for download.

If you want to help testing, please download the source code and follow the README and TESTING instructions.

A few notes:
  • If you have access to a real SMP (dual-core, dual-proc or even a quad-proc ;) machine, I would really appreciate if you test this program. There are no known bugs at this moment, but if there are any problems, the most probable cause is in the new threading code.
  • You need SCons to compile (just do the usual 'apt-get install scons', 'emerge scons', or 'yum install scons')
  • Currently, it only runs on the x86 and amd64 architectures.
If you have a sparc machine and are willing to help testing, please let me know so that I can port the necessary code.

If you have a machine of another architecture and want to help port to it, you'll have to implement about 2 or 3 assembly functions. If you're interested, let me know so that I can help ;)

FreeBSD is not my primary goal at this moment. After the SoC finishes, I will make sure it works. If you can't wait, I accept patches :P

Friday, June 09, 2006

Successfully compiled zdb and ztest

Hi everyone, I have great news.

Apparently, I managed to compile zdb and ztest ;)

After fixing a few bugs, I also managed to run ztest a few times for 5 minutes with no problems (there are still bugs, however).

Once I manage to fix the remaining bugs (tomorrow or the day after, I hope), I'm going to upload the source code and give instructions for whoever wants to help testing.

Stay tuned :)

Saturday, June 03, 2006

Current issues

These are the libzpool files that are already compiling without warnings: util.c, bplist.c, dmu.c, dmu_objset.c, dmu_traverse.c, spa.c, spa_misc.c, vdev.c, zap.c and zap_micro.c. Also, zdb.c and zdb_il.c are already compiling.
The ZFS developers have done a very good job, since all of those files required only a few simple fixes, and I expect a lot of the other ones will too (except of course kernel.c, and perhaps taskq.c - I didn't have a close look at it yet).
So far, the hard part was getting the headers right, especially zfs_context.h.

Most of the code changes so far were specific to gcc (fixing warnings) and POSIX threads (and maybe a few glibc ones, I don't know for sure). I intend to separate Linux-specific code into a linux.c file, or at least mark it with a /* LINUX */ tag, so that porting to FreeBSD will be easier.

Anyway, here's what I still have to do to make zdb work:
  • Port kernel.c (hard)
  • Port the remaining libzpool files (should be easy)
  • Figure out how to implement the atomic_add_64() and atomic_cas_64() functions..
  • Figure out how to make read-write locks work (using POSIX threads). It already has read-write locks (when __USE_UNIX98 is defined in the system header features.h), but I think it's impossible to know if the current thread owns a read (or write) lock, which is necessary to implement the RW_xxx_HELD() macros...
  • Port libumem. Actually, I'm planning on doing a simple stub implementation using malloc() and mutexes, and later integrate the libumem linux port.
  • Port whatever else is needed that I still haven't found out ;)
Comments are welcome :)

Wednesday, May 31, 2006

Project is now in Phase 2 (TM)

Yesterday I decided to proceed to phase 2 of the project (Porting libzpool, ztest and zdb). And I'm happy to announce that libavl and libnvpair are already ported :)

Actually, it wasn't that difficult. I only had to change 7 lines of code for it to compile cleanly with gcc -Wall -Wno-unknown-pragmas (I'm using this flag because all files in the OpenSolaris source seem to have #pragma ident, which gcc doesn't recognize).

Now the real work begins. For libzpool to work, I'll have to implement a zfs_context.h which works with NPTL, and from the looks of it, it doesn't seem trivial :P

By the way, the ZFS on FUSE website was inacessible today because my Internet connection failed. Sorry about that.

Sunday, May 28, 2006

Project website created

Hi,

The project website is finally up.

Now I really, really, really have to sleep. Bye! :)

Friday, May 26, 2006

ZFS On-Disk Specification

So, these past couple of days I've been reading the ZFS On-Disk Specification, which is not quite up-to-date, but it serves its purpose.

I find it relatively easy to understand, but I got slightly confused regarding the DSL and especially the Fat Zap structure. But I'm sure it'll start to make more sense when I dive into the code (then again, maybe I'll be more confused ;)

Anyway, tomorrow I'll definitely create the main project site, finish reading the spec and if time permits, I'll start importing the code.

Well, that's it for today :)

Announcing ZFS on FUSE/Linux

Hi everyone,

I'm very pleased to announce that, thanks to Google, Linux will (hopefully) have a working ZFS implementation by August 21st, 2006.

This site is where I'll keep you informed about my progress. A web site will soon be created for general information.

I'd just like to say that I'm very excited about this project. I think ZFS is a great filesystem, and I can't wait to have it working on Linux!

Of course, none of this would be possible without the work of Sun's engineers, who have done (and are still doing) a wonderful job, and of course, Google, who is sponsoring my project (and +600 other ones!).

Yay for Goooooooooooooogle!!!