CRFS source available, now with kernel building, mostly

Two weeks ago we received the approval to release the CRFS source code. It can be found at http://oss.oracle.com/projects/crfs/. Last week I sent a message to linux-fsdevel letting people know CRFS was up. I suppose I should be embarrassed that both Val and Evgeniy mentioned it before I did.

Over the last few days I fixed a few bugs in CRFS that were messing up kernel builds. You can now unpack and build a linux kernel in a CRFS client mount. There are some lingering bugs, of course, but it’s enough for us to get a rough idea of the performance to expect relative to NFS.

As usual, the comparison doesn’t seem very sporting. A make -j2 in CRFS on this particular client took about 50 seconds, NFS took about 3 minutes. The crfsd process on the server was writing to a single spindle. The nfs server, linux knfsd, was writing to a ram disk. Even with the advantage of much faster storage, NFS can’t keep up because it’s caching model requires it to hit the network much more often than CRFS.

The next major step in CRFS is to get the cache coherency protocol working so that multiple clients can be mounted. I think I’m about a quarter of the way into the first draft of an implementation that only covers file data. It’s looking pretty good so far.

LCA2008 CRFS talk went well

Well, it’s been almost three weeks since I gave a talk on CRFS at LCA 2008 and I’m just now getting around to sharing my thoughts on how it went. We’ll pretend that the delay makes the thoughts that much more.. thoughtful, but clearly that’s already not the case.

I was impressed by the quality of volunteers at LCA. On the morning of my talk they had two people in the room running the AV equipment and making sure that I got time cues. After the talk they had me put a PDF of the slides on to a USB flash drive. Within 24 hours they had both the slides and the video of the talk available for download from the conference’s programme page. People noticed, too. The next morning I awoke to find emails from people half-way around the world who had read the slides and had decent questions to ask about how CRFS works. That’s pretty great.

I am a little worried that these “linux.conf.au” links will break next year when the next incarnation of the conference builds their web site. I guess if I was clever I’d grab a copy now and serve up the talk materials locally.

I will admit to having some trouble deciding just which pieces of CRFS to try and squeeze into a short introductory talk. I tried to stick to the most fundamental basics but I’m not sure I can trust my judgment here. I have a tendency to misjudge the level of pre-existing knowledge in a given audience. I’d love to hear feedback from my colleagues who have different levels of experience with file systems.

I will also happily admit to going a little too far with LCA’s motto of being “fun, informal and seriously technical”. I really hammed it up in a few places. I felt like the audience enjoyed it but the video didn’t pick up the reasonably steady trickle of giggles from the audience so the viewer can be forgiven for thinking that I was just being a crazy person :). I think I’ll take Val’s positive characterization of the talk as “technical stand-up improv comedy” as an indication that I was doing something right.

The frighteningly keen LCA attendee may have noticed that one or two (or three) of us put “bonghits” in our talks. I blame the dangerous intersection of Dave Jones and conference subsidized bottles of wine.

As for CRFS, it continues on at full speed. There have been signs of life in the process of getting approval to release the source so maybe I’ll have something exciting to report soon. I’ve just doomed the process by typing those words, of course.

Last week I converted crfsd from being a confusing threaded process to a group of processes with explicit boundaries for sharing state. I should have called it the honorary Rusty Russell Hates Threads commit but I chickened out. Commit messages are forever!

At the moment I’m pushing to get the coherency protocol stumbling along such that the initial release can bear more resemblance to what the final CRFS system will look like. With luck I’ll make it in time.

Melbourne bound!

Well, I’m heading down to Melbourne for LCA 2008 in a few hours. I’m not exactly excited by the length of the trip (SKW6084 and UAL839) but I’m definitely looking forward to attending LCA and to seeing Melbourne. It looks like a nice city. It’s a shame that I didn’t arrange to stay longer. Ah, well.

I’ll be giving a talk on CRFS while I’m down there. I did a practice run for a small audience of friendly Linux folks in Portland which was well received so I have high hopes that people at the conference will enjoy it. I know I certainly enjoy talking about this technology, but, well, I guess I would ;).

I thought I’d share a slide from the talk that I find geeky and satisfying:

silly-rename003.png

The slide is demonstrating a particularly weird behaviour of the Linux NFS client adorably called silly renaming. I like the slide because it’s using a relatively small set of system calls to illustrate how differently NFS can behave than “local” file systems. I use it during the talk to illustrate one of my primary motivations for working on CRFS — that we have a network file system that doesn’t penalize its users by requiring that their applications know to work around its behavioural quirks.

Anyway, if this stuff interests you I hope you’ll come have fun at the talk with us.

A little more CRFS detail

In my previous post about CRFS metadata performance I said that I didn’t want to go into too much detail until the source is released. I still don’t want to but Evgeniy Polyakov is tempting me! He’s having a good time learning by experimenting with network file systems and posted some theories about CRFS. I’ll respond to his theories with a series of facts about the CRFS protocol and implementation because, well, I love talking about this stuff and rarely get a chance.

The userspace server I’ve implemented (”crfsd”) is btrfs specific. It works directly with the on-disk structures in a btrfs volume. You don’t specify a file system directory tree to export, you specify a block device which contains a btrfs file system. crfsd has exclusive access to the contents of that block device while it is running.

The CRFS client kernel module (”crfs.ko”) doesn’t require kernel patches. I happen to be tracking mainline but, so far, there has been nothing significant in the implementation that restricts it to modern kernels. The use of ->write_begin() will probably be the first thing that starts to restrict the kernel versions that it will support but that hasn’t happened yet.

CRFS does perform writeback caching of metadata operations. The huge performance benefit this brings justifies the complexity of implementing it, which can’t be overestimated. Designing the protocol and then implementing the kernel client such that we can keep this complexity under control is one of the most important aspects of the CRFS system as a whole.

The CRFS network protocol could be said to batch operations, it’s true, though phrasing it that way gives the wrong impression. It’s not like some kind of explicit compound RPC mechanism. Think of it more like the batching that happens when ext3 reads in a block full of inodes as it goes to read a specific inode that it is interested in. CRFS achives similar results from a very different organization of metadata. Think of it as reading and writing groups of items from btrfs leaf blocks because that’s exactly what it is. The opportunistic priming of client caches when they perform normal metadata read requests, at insignificant additional cost, is a natural side-effect of the way CRFS represents metdata.

And with that, I should really return to a nice holiday break.

CRFS performance teaser

Friends and colleagues have been hearing me talk about CRFS for a while. CRFS is an acronym that stands for “coherent remote file system”. It’s a project that I’ve been working on to implement a networked file system that is, well, great. I haven’t been too public about it partially for fear of being accused of peddling vapourware but mostly because we’re still working in Oracle to get approval to release the code.

That said, the implementation is far enough along that I can make some meaningful performance measurements. I thought I’d share one which demonstrates what CRFS can do for metadata performance.

These tests were run between two machines. Each have onboard e1000 chips connected to a cheap consumer-grade netlink gigabit switch. They each have 2 gig of memory and single dual-core intel processors of the Penryn generation.

Each test iteration is trivial. We make a new file system on the server, mount it on the client, untar a kernel source tree, purge the client’s data cache, and then read back the file data. Specifically, we run the following commands on the client:

tar -xf /dev/shm/linux-2.6.17.tar
echo 1 > /proc/sys/vm/drop_caches
find linux-2.6.17 -type f | xargs cat > /dev/null

We repeat this series first with the server storing the file system on a single SATA drive and then in ram (tmpfs) only. The CRFS numbers would be pretty baffling on their own so we also run the test over NFS (v3, TCP). We record measurements just like the time(1) command: real wall clock time, cpu time spent in userspace, cpu time spent in the kernel.

                   seconds                   command
                (real user sys)

            nfs                 crfs

disk: 45.12 0.12 10.22 : 12.55 0.09 2.69 : tar -xf /dev/shm/linux-2.6.17.tar
      19.21 0.05 3.54  : 11.04 0.05 1.17 : find linux-2.6.17 -type f | xargs cat > /dev/null

 ram: 43.83 0.13 9.91 :  7.90 0.12 2.66 : tar -xf /dev/shm/linux-2.6.17.tar
      18.64 0.08 3.61 : 10.68 0.05 1.00 : find linux-2.6.17 -type f | xargs cat > /dev/null

The NFS numbers are roughly the same whether its storing on disk or in ram because we’re using the ‘async’ option. Asking NFS to actually perform each write operation on disk wouldn’t have been sporting at all.

CRFS is limited by the disk speed because its userspace server is waiting for writes to hit disk before sending a response to the client.

CRFS is able to do the same work in less time, even when writes go all the way to disk, because its network protocol goes to great lengths to reduce conversation over the network.

I won’t waste everyone’s time with details until the code is out there and available for people to play with. My intention is to give people something to look forward to :).

The description of my upcoming CRFS talk at LCA ‘08 in Melbourne provides a little more detail. Do come to the talk if you can! It should be fun.