CRFS source available, now with kernel building, mostly

Two weeks ago we received the approval to release the CRFS source code. It can be found at http://oss.oracle.com/projects/crfs/. Last week I sent a message to linux-fsdevel letting people know CRFS was up. I suppose I should be embarrassed that both Val and Evgeniy mentioned it before I did.

Over the last few days I fixed a few bugs in CRFS that were messing up kernel builds. You can now unpack and build a linux kernel in a CRFS client mount. There are some lingering bugs, of course, but it’s enough for us to get a rough idea of the performance to expect relative to NFS.

As usual, the comparison doesn’t seem very sporting. A make -j2 in CRFS on this particular client took about 50 seconds, NFS took about 3 minutes. The crfsd process on the server was writing to a single spindle. The nfs server, linux knfsd, was writing to a ram disk. Even with the advantage of much faster storage, NFS can’t keep up because it’s caching model requires it to hit the network much more often than CRFS.

The next major step in CRFS is to get the cache coherency protocol working so that multiple clients can be mounted. I think I’m about a quarter of the way into the first draft of an implementation that only covers file data. It’s looking pretty good so far.

LCA2008 CRFS talk went well

Well, it’s been almost three weeks since I gave a talk on CRFS at LCA 2008 and I’m just now getting around to sharing my thoughts on how it went. We’ll pretend that the delay makes the thoughts that much more.. thoughtful, but clearly that’s already not the case.

I was impressed by the quality of volunteers at LCA. On the morning of my talk they had two people in the room running the AV equipment and making sure that I got time cues. After the talk they had me put a PDF of the slides on to a USB flash drive. Within 24 hours they had both the slides and the video of the talk available for download from the conference’s programme page. People noticed, too. The next morning I awoke to find emails from people half-way around the world who had read the slides and had decent questions to ask about how CRFS works. That’s pretty great.

I am a little worried that these “linux.conf.au” links will break next year when the next incarnation of the conference builds their web site. I guess if I was clever I’d grab a copy now and serve up the talk materials locally.

I will admit to having some trouble deciding just which pieces of CRFS to try and squeeze into a short introductory talk. I tried to stick to the most fundamental basics but I’m not sure I can trust my judgment here. I have a tendency to misjudge the level of pre-existing knowledge in a given audience. I’d love to hear feedback from my colleagues who have different levels of experience with file systems.

I will also happily admit to going a little too far with LCA’s motto of being “fun, informal and seriously technical”. I really hammed it up in a few places. I felt like the audience enjoyed it but the video didn’t pick up the reasonably steady trickle of giggles from the audience so the viewer can be forgiven for thinking that I was just being a crazy person :). I think I’ll take Val’s positive characterization of the talk as “technical stand-up improv comedy” as an indication that I was doing something right.

The frighteningly keen LCA attendee may have noticed that one or two (or three) of us put “bonghits” in our talks. I blame the dangerous intersection of Dave Jones and conference subsidized bottles of wine.

As for CRFS, it continues on at full speed. There have been signs of life in the process of getting approval to release the source so maybe I’ll have something exciting to report soon. I’ve just doomed the process by typing those words, of course.

Last week I converted crfsd from being a confusing threaded process to a group of processes with explicit boundaries for sharing state. I should have called it the honorary Rusty Russell Hates Threads commit but I chickened out. Commit messages are forever!

At the moment I’m pushing to get the coherency protocol stumbling along such that the initial release can bear more resemblance to what the final CRFS system will look like. With luck I’ll make it in time.

Melbourne bound!

Well, I’m heading down to Melbourne for LCA 2008 in a few hours. I’m not exactly excited by the length of the trip (SKW6084 and UAL839) but I’m definitely looking forward to attending LCA and to seeing Melbourne. It looks like a nice city. It’s a shame that I didn’t arrange to stay longer. Ah, well.

I’ll be giving a talk on CRFS while I’m down there. I did a practice run for a small audience of friendly Linux folks in Portland which was well received so I have high hopes that people at the conference will enjoy it. I know I certainly enjoy talking about this technology, but, well, I guess I would ;).

I thought I’d share a slide from the talk that I find geeky and satisfying:

silly-rename003.png

The slide is demonstrating a particularly weird behaviour of the Linux NFS client adorably called silly renaming. I like the slide because it’s using a relatively small set of system calls to illustrate how differently NFS can behave than “local” file systems. I use it during the talk to illustrate one of my primary motivations for working on CRFS — that we have a network file system that doesn’t penalize its users by requiring that their applications know to work around its behavioural quirks.

Anyway, if this stuff interests you I hope you’ll come have fun at the talk with us.

A little more CRFS detail

In my previous post about CRFS metadata performance I said that I didn’t want to go into too much detail until the source is released. I still don’t want to but Evgeniy Polyakov is tempting me! He’s having a good time learning by experimenting with network file systems and posted some theories about CRFS. I’ll respond to his theories with a series of facts about the CRFS protocol and implementation because, well, I love talking about this stuff and rarely get a chance.

The userspace server I’ve implemented (”crfsd”) is btrfs specific. It works directly with the on-disk structures in a btrfs volume. You don’t specify a file system directory tree to export, you specify a block device which contains a btrfs file system. crfsd has exclusive access to the contents of that block device while it is running.

The CRFS client kernel module (”crfs.ko”) doesn’t require kernel patches. I happen to be tracking mainline but, so far, there has been nothing significant in the implementation that restricts it to modern kernels. The use of ->write_begin() will probably be the first thing that starts to restrict the kernel versions that it will support but that hasn’t happened yet.

CRFS does perform writeback caching of metadata operations. The huge performance benefit this brings justifies the complexity of implementing it, which can’t be overestimated. Designing the protocol and then implementing the kernel client such that we can keep this complexity under control is one of the most important aspects of the CRFS system as a whole.

The CRFS network protocol could be said to batch operations, it’s true, though phrasing it that way gives the wrong impression. It’s not like some kind of explicit compound RPC mechanism. Think of it more like the batching that happens when ext3 reads in a block full of inodes as it goes to read a specific inode that it is interested in. CRFS achives similar results from a very different organization of metadata. Think of it as reading and writing groups of items from btrfs leaf blocks because that’s exactly what it is. The opportunistic priming of client caches when they perform normal metadata read requests, at insignificant additional cost, is a natural side-effect of the way CRFS represents metdata.

And with that, I should really return to a nice holiday break.

CRFS performance teaser

Friends and colleagues have been hearing me talk about CRFS for a while. CRFS is an acronym that stands for “coherent remote file system”. It’s a project that I’ve been working on to implement a networked file system that is, well, great. I haven’t been too public about it partially for fear of being accused of peddling vapourware but mostly because we’re still working in Oracle to get approval to release the code.

That said, the implementation is far enough along that I can make some meaningful performance measurements. I thought I’d share one which demonstrates what CRFS can do for metadata performance.

These tests were run between two machines. Each have onboard e1000 chips connected to a cheap consumer-grade netlink gigabit switch. They each have 2 gig of memory and single dual-core intel processors of the Penryn generation.

Each test iteration is trivial. We make a new file system on the server, mount it on the client, untar a kernel source tree, purge the client’s data cache, and then read back the file data. Specifically, we run the following commands on the client:

tar -xf /dev/shm/linux-2.6.17.tar
echo 1 > /proc/sys/vm/drop_caches
find linux-2.6.17 -type f | xargs cat > /dev/null

We repeat this series first with the server storing the file system on a single SATA drive and then in ram (tmpfs) only. The CRFS numbers would be pretty baffling on their own so we also run the test over NFS (v3, TCP). We record measurements just like the time(1) command: real wall clock time, cpu time spent in userspace, cpu time spent in the kernel.

                   seconds                   command
                (real user sys)

            nfs                 crfs

disk: 45.12 0.12 10.22 : 12.55 0.09 2.69 : tar -xf /dev/shm/linux-2.6.17.tar
      19.21 0.05 3.54  : 11.04 0.05 1.17 : find linux-2.6.17 -type f | xargs cat > /dev/null

 ram: 43.83 0.13 9.91 :  7.90 0.12 2.66 : tar -xf /dev/shm/linux-2.6.17.tar
      18.64 0.08 3.61 : 10.68 0.05 1.00 : find linux-2.6.17 -type f | xargs cat > /dev/null

The NFS numbers are roughly the same whether its storing on disk or in ram because we’re using the ‘async’ option. Asking NFS to actually perform each write operation on disk wouldn’t have been sporting at all.

CRFS is limited by the disk speed because its userspace server is waiting for writes to hit disk before sending a response to the client.

CRFS is able to do the same work in less time, even when writes go all the way to disk, because its network protocol goes to great lengths to reduce conversation over the network.

I won’t waste everyone’s time with details until the code is out there and available for people to play with. My intention is to give people something to look forward to :).

The description of my upcoming CRFS talk at LCA ‘08 in Melbourne provides a little more detail. Do come to the talk if you can! It should be fun.

OLS 2000: we were more fun then, also less old

I was cleaning up some files this morning and ran across a set of photos from OLS 2000. I wasn’t more than five or six photos in before I found myself lost in a fit of giggles. What I came to find more interesting, though, were how many were actually decent photos of my lovely friends!

So I put the whole lot in a set on flickr entitled, wait for it, OLS 2000. I figure seven years is long enough for embarrassment to have ripened into nostalgia.

It’s funny to put funny things on your head:


And these aren’t half bad:



you can has AP9211/9606

I’ve long been a fan of of the discontinued APC AP9211 distribution unit. To wit:

  1. It’s an efficient 1U, taking up just enough space for the 8 outlets.
  2. It is gloriously devoid of any moving parts, including noisy fans.
  3. powerman comes with scripts to manage its outlets from the command line over ethernet. Once you get used to the convenience of “powerman -c $host” you never go back.
  4. It’s common enough to easily be found on ebay.

That last point brings us to this post. Mark got a pair for his machines after I showed him the light. He got them from an auction that has 3 days left and, at the time of this writing, 19 units left. Each has a pretty reasonable price for immediate purchase with shipping.

APC MasterSwitch AP9211 w/ AP9606 Control Moduleebay auction 230196539398

It’s a chance for poor kernel developers out there to stop being frustrated by having to lose time rebooting boxes in person.

inserting files into Thunderbird without reformatting

Let’s start with the observation that the Linux kernel community discourages sending patches as attachments (tpp, section 1a). Sending patches as attachments creates pain for the sender, regardless of what you or I might think of this fact.

Thus, it’s a shame that Thunderbird doesn’t have a trivial mechanism to insert a file in to its plain text composing interface without altering the file. There are at least three stumbling blocks: translating tabs into spaces, trimming trailing whitespace, and wrapping lines.

One suggested hack to work around this is to use an external editor extension which lets you fire up, say, vim to insert the file into the editing buffer. This doesn’t work well in OS X, for much the same reasons that it doesn’t work well for eclipse.

After some trial and error I found Quicktext, an extension for inserting signatures, which can be abused to insert files. After installing the extension, one:

  1. sets the word wrapping preference to 0 before composing the new message
  2. inserts the file as HTML, to preserve trailing whitespace
  3. restores the previous word wrapping preference

This is violently imperfect. Inserting as HTML to preserve whitespace runs the risk of escaping HTML which might be in the file. I do this so infrequently that it’ll do for now. I usually send patches with tools. (git-send-email, hg email, sendpatchset).

I was reasonably excited to find that it seems like someone who understands Mozilla extensions could build an extension to insert text without altering the input. nsIPlaintextEditor seems to have knobs to disable translation of whitespace and word wrapping. I might try my hand at this some day but would be even happier if someone beat me to it.

direct write cache invalidation failure illustrated

An associate recently berated me for not posting about work things recently. Fair enough. Here’s the start of an attempt to do more blogging about my daily work.

For me, last week ended with a thread on lkml wherein a poor user reported having actually hit the nasty case where an O_DIRECT write doesn’t invalidate the page cache after a buffered reader races to bring in stale cached data during the write.

In his case he had a writer advancing through a file writing new contents. As it wrote it’d wake a buffered reader who would read up to the point of the new content that the writer had just written.

The problem was that the reader could trigger the kernel to read-ahead up into the region where the writer is currently writing with O_DIRECT. The kernel was failing to invalidate the existing page cache after the O_DIRECT write completes. The buffered reader will then wake a read stale data which was brought in with read-ahead from its previous reads. Wackiness ensues!

So, I threw together a test case. I know you, dear readers, have been just dying to see what this terrifying corner case actually looks like. Wonder no more!

[zab@hammer c]$ ./aio-dio-invalidate-check /tmp/something
( lots of time passes )
writing 2 to 3248128
setting write_pos to to 3248128
writing 2 to 3252224
reading from 2850816 to to 3248128 looking for 1
read 3248128 write 3248128
writing 2 to 3256320
writing 2 to 3260416
writing 2 to 3264512
writing 2 to 3268608
writing 2 to 3272704
setting write_pos to to 3272704
writing 2 to 3276800
writing 2 to 3280896
setting write_pos to to 3280896
writing 2 to 3284992
writing 2 to 3289088
writing 2 to 3293184
writing 2 to 3297280
reading from 3248128 to to 3280896 looking for 1
reader found old byte at pos 3252224

[zab@hammer c]$ od -A d -x /tmp/something
0000000 0202 0202 0202 0202 0202 0202 0202 0202
*
3252224 0101 0101 0101 0101 0101 0101 0101 0101
*
3256320 0202 0202 0202 0202 0202 0202 0202 0202
*
3301376 0101 0101 0101 0101 0101 0101 0101 0101
*
8388608

[root@hammer ~]# echo 1 > /proc/sys/vm/drop_caches

[zab@hammer c]$ od -A d -x /tmp/something
0000000 0202 0202 0202 0202 0202 0202 0202 0202
*
3301376 0101 0101 0101 0101 0101 0101 0101 0101
*
8388608

That last bit shows the stale data present in the cache, the cache being purged, and then the stale data vanishing as the file is read back from disk.

Terrifying stuff, I know, but it is almost Halloween.

digg comments on btrfs

A friend pointed out that a reference to btrfs appeared on digg. I wasn’t sure that it merited much attention but a colleague expressed interest in learning more about btrfs.

I should first set the stage by explaining my relation to btrfs. Chris Mason, its primary developer, is my manager at Oracle. He and I started working on btrfs quite a few months ago. I fell back into a more advisory role after I moved on to work on a related project while Chris continued working diligently on the initial btrfs implementation. While I’m not intimately familiar with the code, I’m pretty familiar with the design trade-offs that it currently makes.

I’ll address some of the honest confusion expressed in the comments to that digg post by translating them into questions that one might ask while not suffering from the effects of John Gabriel’s GIF Theory

btrfs isn’t considered stable and isn’t supported. That scares me. Why is btrfs available before it is feature-complete and stable?

Once a file system is complete and supported it becomes very hard to work in features that weren’t originally available. Adding new features that require changes to the format of persistent data on disk becomes much, much, harder. By making it available at this stage we give people the opportunity to request features that might not have occurred to us. All file systems go through this stage, we’re just exposing it to a wider group of people. One is always welcome to simply ignore btrfs until it’s supported if that’s what one desires.

I live a very busy life and couldn’t be bothered to look at the license that btrfs is released under and instead chose to imply that it wasn’t free and open. Was this not the most clever thing I’ve done recently?

Probably. btrfs is released under the GPLv2, the same license as the Linux kernel.

For whatever reason, I have a negative impression of software that is related to the word Oracle. Should I transfer that negativity to btrfs because it is also associated with the word Oracle?

Probably not. The kernel development team at Oracle that produces btrfs is made up of people who worked on the Linux kernel long before they agreed to come work on the kernel for Oracle. Never fear, we tend to work from home in distant states, countries, and continents — far from the influence of whatever magical anti-awesome sauce it is that you think Oracle puts in its developers’ food.

Oracle also developed OCFS2. Are the two projects related?

Not really, although I worked on OCFS2 for a time. The two file systems solve different problems and their development efforts have different resources at their disposal. OCFS2 is about helping multiple machines work on a shared file system without corrupting each others’ efforts. That’s incredibly difficult. btrfs is about making the best of modern file system features available to the majority of Linux installations for the simple case where there’s only one computer using it. That’s relatively less difficult.

btrfs is a new file system. I also know of another new file system, ZFS. Does btrfs make ZFS unneccessary?

I can think of no way in which a current ZFS user would be satisfied by btrfs. If for no other reason than the simple fact that btrfs is not supported anywhere and ZFS is not seriously available to Linux users. Maybe one could entertain having this conversation once btrfs is supported on Linux and Solaris and ZFS is supported on Linux.

All this talk of ZFS and btrfs reminds me that I once heard that ZFS can be slow, or something. Might that also be said of btrfs?

Yes, in as much as that can be said of each and every file system in existence. File system engineering is, at it’s core, a game of having to choose amongst conflicting desires. It’s often the case that implementing a feature in a particular way will benefit one usage pattern while harming some other usage pattern. btrfs and ZFS, both incorporating design elements more modern than the Reagan administration, will tend to chose to skew the trade-offs in similar directions, most of the time.

There are already lots (and lots) of file systems available for Linux. What does btrfs do that those file systems don’t?

Sometimes it can be hard for those of us who work on file systems to clearly communicate why it is that we dislike existing designs. It’s complicated stuff. There’s one property of current Linux file systems, though, that seems like it should be universally ill-received.

Almost all Linux file systems provide almost no protection against data corruption. The only protection they offer is to propagate errors from the storage system up to the application. If the storage system doesn’t realize that the data has been corrupted, perhaps because the corruption happened after the drive, these file systems can get very confused. Returning bad data to applications, overwriting the wrong data on disk, crashing machines, etc.

Now, storage systems have been surprisingly reliable, it turns out. But Linux thrives on cheap commodity hardware, which is not exactly famous for being rock solid. The persistent march of hardware towards commoditization and cheaper manufacturing does not bode well for the future.

That btrfs takes strong measures to address the risk of corruption is the most exciting run-time feature for me. I want flakey hardware to result in a console message indicating data corruption, not mysterious behaviour or kernel panics that some incredibly expensive human has to diagnose.

I mean, no one would ever consider disabling checksumming in TCP. Why on earth do we allow our file systems to operate without similar protection?

Booting built kernels with PXE

Linux kernel development can be made a lot nicer by automating some steps in the process. Sadly, there is no shared software package which performs this kind of automation. Most everyone who takes kernel development seriously ends up writing their own tools tailored to their environment. I thought I’d spend a few blog posts sharing the tools I’ve thrown together over the years.

I first tried writing this in a tone which didn’t assume that the reader already knew about kernel development, but it just didn’t work. So, my apologies to those who have no idea what on earth I’m talking about here.

I thought I’d start by describing scripts which help remove the irritating step of installing a new kernel on a test machine’s local drive. Doing so speeds up the compile-boot-test cycle. It goes something like this.

First, I install extra e1000 cards in all the machines so that I can use their PXE boot ROM to boot kernels and initrds over the network. This may not be a great fix for everyone, it just happens to work for me because I was given a box of discarded e1000 cards.

Distros tailor initrds to the hardware of a specific machine (foolishly, I believe, but here we are). To get each machine booting its own initrds I configure dhcpd to point each host at a specific pxe.cfg which in turn references initrds for that host. Each pxe.cfg is built from a per-architecture stub by a script.

The initrds are generated from a copy of the initrd which the distro built for a given host. A script simply replaces each kernel module in the distro’s initrd with the kernel module from the newly built kernel. It’s only a rough approximation of correct behaviour but it has worked so far.

The next problem is that the distro assumes that it will find all the modules for this kernel in the root file system. To accomplish this the script includes all the modules in the initrd. It mounts the root file system read-write and uses a statically linked rsync to copy all the modules from the recent build of the kernel into /lib/modules on the host. It then remounts the root fs read-only before continuing on with the boot.

That’s it, really. Part of this is based on observation that it is, in fact, the 21st century. I’m not booting from 10mbit ether or floppies and don’t care one bit if the initrds are “huge”.

$ ls -hs *2.6.21.1-*
 30M initrd-2.6.21.1-syslets  1.8M vmlinuz-2.6.21.1-syslets

Here it is in action:

[zab@kaori 2.6-syslets]$  zk-install-pxe-initrd
[zk] preparing initrd for hammer
[zk] Warning: hammer needs uhci-hcd.ko
[zk] Warning: hammer needs ehci-hcd.ko
[zk] Warning: hammer needs ohci-hcd.ko
[zk] building initrd for hammer
148402 blocks
[zab@kaori 2.6-syslets]$ zk-build-pxe -r
[zk] hammer:
[zk] 2.6.21.1-syslets
[zk] making '2.6.21.1-syslets' the default pxeboot label
[zab@tetsuo ~]# powerman -1 hammer
Command completed successfully
[zab@tetsuo ~]$ console hammer
[Enter `^Ec?' for help]
PXELINUX 3.10 2005-08-24  Copyright (C) 1994-2005 H. Peter Anvin
boot:
Loading hosts/hammer/vmlinuz-2.6.21.1-syslets................................
Loading hosts/hammer/initrd-2.6.21.1-syslets.......................[many, many, dots]
Ready.
[    0.000000] Linux version 2.6.21.1-syslets [...]
[ ... ]

Fedora Core release 6 (Zod)
Kernel 2.6.21.1-syslets on an x86_64

hammer login:

The zk prefix I chose for these little scripts ostensibly stands for “zabbo kernel”, but it’s really an inside joke that refers to the ZK_ prefix that ZeroKnowledge used when reimplementing the entire world in their software. Starting with, no seriously, ZK_TRUE and ZK_FALSE.

Andrew says wise things at FOSDEM 2007

Andrew Morton’s talk at FOSDEM is available in ogg format. It’s worth a watch if you’re interested in Linux kernel development. He walks through technology that is currently consuming kernel developers. He starts with an interesting explicit link between the life cycles of products which include the kernel and the willingness of companies producing those products to fund development of the mainline kernel. He ends with the refreshing admission that the kernel development community exists to serve users.

I will admit that I was referred to this video because of his description of ext4: “the next horrible version of a horrible file system.” It’s hard for me to disagree, given this modern age of ZFS.

I was playing his video along in the background while working on an O_DIRECT bug when my ears perked up at his mention of AIO. I was pleased to hear him give a very reasonable overview of the awkward fibrils prototype I sent out which lead to Ingo’s more refined syslets. I’ll even go so far as to transcribe that part of his presentation:

AIO has been a problem for a long time. Even though we have all the AIO interfaces there, they don’t actually work. They’re only actually asynchronous for direct IO. So if you’re doing a normal buffered read or write to disk via AIO, all the interfaces work, but in fact we do the IO synchronously. There are patches out there to make the buffered IO asynchronous, but I’ve been the obstacle to merging those for the past couple of years.

And amazingly, Zach Brown at Oracle and Ingo Molnar have come up with a completely different way of doing it which has made me look very smart for not merging that code. What they’re proposing is, basically, make any system call asynchronous. So, potentially, the whole range of system calls in the kernel… you could fire them off and return to your application and later on get a notification when the system call completes. That, then, just means that all the AIO support we currently have we’ll no longer need ’cause you can just use normal old read and write and you just say “Yup, I want to do this in the background, please”.

Indeed, particularly that last bit.

spell checking and vim syntax highlighting

Today I sent out some kernel patches to fix some bug. Our esteemed colleague Randy “eagle eyes” Dunlap pointed out that I had some spelling error.

<rdd> zab: darn, missed the window for s/intead/instead/

How embarrassing! That got me wondering why I don’t have my editor politely raising an eyebrow at me when I misspell things. It is the 21st century, and all. So I read up on syntax highlighting and spell checking in vim. Turning it on is easy enough.

:setlocal spell spelllang=en_us

With that, vim barely hides its true intent behind its default color scheme: to burn a hole in the back of your retina.

MY EYES

Let’s chose some colors that won’t send us into epileptic fits.


:highlight clear SpellBad
:highlight SpellBad term=standout ctermfg=1 term=underline cterm=underline
:highlight clear SpellCap
:highlight SpellCap term=underline cterm=underline
:highlight clear SpellRare
:highlight SpellRare term=underline cterm=underline
:highlight clear SpellLocal
:highlight SpellLocal term=underline cterm=underline

Now misspelled words are underlined and red while other words that it thinks are questionable, for seemingly uninteresting reasons, are simply underlined.

Phew.

Now we can go about our business. z= offers alternative spellings for the word under the cursor, zg adds a word to the list of accepted words, etc.

To round it off we add our own acceptable words list.

:set spellfile=~/.vim/spellfile.{encoding}.add

So there we go! This one’s for you, Randy!

AIO through stack scheduling

I don’t think it’s very controversial to say that Linux has lame AIO support.

From the user’s perspective you can only perform a few operations asynchronously, and it’s only in certain conditions that the submitting process won’t block. For example, you can submit an AIO disk write only if it’s O_DIRECT (which not all file descriptors support), and even then it may still block if it has to read meta-data off disk to find out where to perform the write. Want to perform other slow disk operations like open(), rename(), stat(), or unlink() asynchronously? Sorry. Networking? Nope. How about an AIO interface to hardware crypto accelerators? That’s just adorable!

This starts to make sense when you look at the AIO implementation in the kernel. The interface was implemented (I share the blame for this) as a separate subsystem. To bring AIO support to an existing interface (say, sys_write()), one has to duplicate its arguments over in the AIO subsystem and then provide a code path which implements the interface and communicates with the AIO subsystem when an operation blocks. The code path can take on the responsibility of explicitly suspending and resuming the operation as progress is made. This requires changing the code at any blocking point to return an error code (EIOCBQUEUED) and eventually call a completion routine (aio_complete()). The AIO subsystem also has the notion of retrying an operation until it finally succeeds without blocking, though to this daynothing in the mainline kernel uses this code — imagine how robust this unused (untested) code is.

In my opinion, we’re in the current situation due to the most fundamental problem with this implementation: it asks a maintainer of already complicated code paths to destabilize their code to bring AIO support. Not only are there initial development costs, in addition one ends up with different code paths that have to be tested. All this for a small portion of the Linux installed base.

Wasn’t it sneaky how I snuck that last sentence in there? That brings us to the sad fact of Linux AIO work: it suffers from a chicken-n-egg problem. There aren’t a lot of users of AIO out there because the kernel support is poor. With few users it’s hard to fund the development and justify the risk of AIO. There are very few users who are willing to spend significant sums on initial AIO development. Once they’ve gotten their pet interface working in AIO, they move on to their next priority. This is pretty self-evident from the current code. The most actively debugged AIO interface, O_DIRECT block IO, is the one that IBM and Oracle care about for their databases. (Red Hat helps, hi Jeff!)

This all came to a head in an email thread which started with Linus suggesting that we just tear out the current AIO support. This initial extremism eventually turned into an interesting suggestion for an alternative. The scheduler knows when a code path blocks. What if we implement AIO by implementing a micro-scheduler inside a task which allows multiple stacks to execute in the kernel on behalf of a process? If a disk operation blocks its stack is swapped out and another is allowed to run. Eventually when that disk operation can make progress again its stack is swapped back in.

From the user’s perspective, this would be fantastic. Any system call could be used asynchronously without altering its behaviour. An async getpid(), for example, would return the callers PID, not some strange PID of a thread that was secretly performing the operation in the background. The submitting call itself would never block. If one of its submitted operations went to block for any reason — semaphores, file system IO, even memory allocation — the scheduler will swap it out and return to the submitting stack.

From the Linux kernel developer’s perspective, this proposal sends shivers down the spine. There is a vast ocean of implementation detail that makes this difficult. Initially Ben and I latched on to a few of these hurdles and dismissed the idea as unworkable.

A year has passed since that email thread. I spent a good portion of that year debugging the interaction between O_DIRECT and AIO in fs/direct-io.c. What little remaining respect I had for the current implementation was chipped away. I’ve come to the conclusion that the potential benefits of scheduling stacks are significant enough to justify a proof-of-concept that would give us a concrete example of what it is that we’re talking about. My shiny new manager (the one with the beard) agreed.

For the past few weeks I’ve been spending idle cycles on that proof-of-concept. Today a very exciting thing happened: an O_DIRECT read from an IDE block device completed. A tiny example application produces the following output:

# ./dio
submit returned 2 at 1165617048.836733
completion returned 1 at 1165617048.837626, return code 3741 cookie 1234
completion returned 1 at 1165617048.844264, return code 512 cookie 5678

What’s happening here is roughly as follows:

  1. asys_submit() translates the specified system calls and arguments into stacks which, when executed, call into the each system call handler with the given arguments. It marks these stacks as runnable and calls into the scheduler.
  2. The scheduler swaps out the current stack, which is executing the submission syscall, and swaps in the stack that calls getpid(). getpid() doesn’t block so the handler runs to completion. The syscall handler returns to a function that takes the return code from the syscall and puts it in a completion event which is queued. It then calls into the scheduler.
  3. The scheduler finds our stack which executes the O_DIRECT read, swaps it in, and executes it. The read prepares and issues the IO and eventually comes to wait for it by entering the scheduler.
  4. The scheduler finds that the submitting stack is still runnable and swaps it in. It returns to userspace and userspace calls back into asys_await_completion(). It first returns the waiting completion from getpid(). The completion call then waits for more completion events to arrive by calling into the scheduler. Our IO is still in flight so we don’t have anymore runnable stacks in the task. The task is put to sleep.
  5. Eventually our IO completes. It marks the stack executing the read as runnable and wakes the task. The task notices the runnable read stack and swaps it in. The read system call handler now returns to that helper which takes the return code from the read and queues it in a completion event. In so doing it marks the stack in the completion call as runnable. It calls the scheduler.
  6. The scheduler swaps out the read stack and swaps in the completion gathering stack. It finds the waiting event and returns it to userspace.
  7. Fin.

This is accomplished with a set of patches whose diffstat looks like this:


$ diffstat -p1 patches/*.patch
arch/i386/kernel/asm-offsets.c | 4
arch/i386/kernel/syscall_table.S | 2
fs/direct-io.c | 18 +-
include/asm-i386/system.h | 36 +++++
include/asm-i386/unistd.h | 2
include/linux/asys.h | 2
include/linux/hrtimer.h | 2
include/linux/init_task.h | 4
include/linux/sched.h | 29 ++++
include/linux/wait.h | 15 ++
kernel/Makefile | 2
kernel/asys.c | 254 +++++++++++++++++++++++++++++++++++++++
kernel/exit.c | 7 +
kernel/fork.c | 7 +
kernel/hrtimer.c | 56 +++++---
kernel/rtmutex.c | 2
kernel/sched.c | 203 +++++++++++++++++++++++++++++++
kernel/wait.c | 41 ++++++
lib/rwsem-spinlock.c | 2
lib/rwsem.c | 50 ++++---
20 files changed, 681 insertions(+), 57 deletions(-)

I have to say, I’m finding this stuff pretty exciting!

The next step is to polish these very rough patches into something presentable. Then they’ll go off to linux-kernel to kick-start debate. There are a huge number of details and trade-offs to discuss before this idea can be seriously implemented.

RDS socket API

Part of my job at Oracle has involved working on a project called RDS. Over the past few months I’ve found myself failing to explain it clearly to friends who have asked what exactly this is. For their edification, if nothing else, I thought I’d take a few minutes to describe the project in more detail.

We can set the stage by laying out the basic properties of a certain kind of Oracle deployment that one often finds at customer sites. Imagine a few thousand processes on a handful of nodes. Each process is doing work and sending messages to many other processes. The one to many relationship starts to explain why this messaging is currently implemented with UDP with acknolwdgement and retranmission handled in the processes. Using TCP, for example, could mean holding a TCP connection open between each pair of communicating processes. The overhead of doing this adds up surprisingly quickly.

The attentive will quickly spot a problem with implementing reliability in the processes. If these processes are performing work that blocks waiting for IO, which they are, the acks that they send could be delayed. This could cause a sending process to spuriously retransmit a message that was in fact received but not acknowledged promptly. “Mmmm hmmm”, I might say to such an attentive person. This problem is seen under heavy load.

At some point Infiniband became an attractive potential solution to this problem. One of the things it can do is push reliability constructs into hardware so that the processes need not burden themselves with the task of promptly sending acks. uDAPL was attempted but didn’t work out. I get to avoid having to tell that story because I don’t know it — it was before my time. SDP is a socket API built on top of Infiniband which would take care of reliability, but it has per-process-pair overhead problems like TCP.

This is when RDS started to take shape. It was designed as a socket API which would let processes send messages from one socket to many recipient processes. Reliability is ideally provided by hardware and the cost of doing so should not increase significantly with the number of processes involved in communication. A prototype was written for the 2.4 kernel which implemented RDS on top of Infiniband.

This is when the Oracle messaging people started talking to me about getting involved. They were looking for an implementation for 2.6 that could also support RDS on top of commodity ethernet. As initially described it sounded like they wanted some ethernet level protocol. This explains my earlier blog post about RDS/eth. We hadn’t quite gotten to understanding each other at that point. It eventually became clear that they wanted a 2.6 implementation that supported RDS on top of various transports — “reliabile connection queue pairs” for Inifiniband and TCP for commodity ethernet.

That, in the end, is what has been built as is now available in a subversion repostory found off of http://oss.oracle.com/projects/rds/. It’s a kernel socket API which maintains connections between nodes and multiplexes messages between processes down those per-node connections. There are lots of interesting (and occasionally surprising) details, but perhaps those are better saved for another post. At least now I hope folks will have a better understanding of what it is I mean when I talk about “that freaking RDS thing.”

a really hoopy frood

Today I learned that Martin (mkp) has a blog. In it he mentions having accepted the offer to join our group at Oracle which means I can now talk about it without worrying about impropriety, or whatever is the moral equivalent in the free-wheeling t-shirt-and-beer Linux industry.

I’m excited. Martin and I go way back. I have some hope that we’ll get to work on fun projects together once I get the current RDS project out of the way. More about that soon.

RDS/eth round trip latency

I’ve been working on an interesting project at work for a while now. It’s an implementation of a socket API that provides reliable message delivery over ethernet. I’m being brief here because we’re still going through the process of getting approval to share the code. The intent, of course, is to see it properly maintained in the kernel.

I’m excited because I’ve been testing it on some machines at work and have some encouraging results. The test is a little application that measure the time it takes to to get a response to a message that is sent to a machine who immediately just sends it right back. The machines are dual opterons with e1000 gigabit cards. The following table reports the fastest round trip time seen over a period of a few seconds.

(I apologize in advance for the awful style of this table, maybe I’ll try my hand at some CSS)

message size
(bytes)
round trip (ųsecs)
TCP UDP RDS/eth
4 86 63 67
8 88 63 68
16 92 63 70
32 98 66 73
64 110 74 79
128 135 88 92
256 185 115 121
512 285 173 180
1024 485 287 295

This is exciting for at least two reasons.

First, the latency is lower than TCP. This shows that users who want reliable messages but are sensitive to latency might well be interested in this. Some significant pieces of Oracle are certainly interested in low latency reliable messages, hence my involvement.

Secondly, it’s awfully close to the latency of UDP. There’s still enough room for improvement in the RDS/eth sending path to gain back that difference, I think. We could well get reliable messages with better latencies than UDP, which would make me smile.

OCFS2

Today Linus announced 2.6.16-rc1 which is the first -rc to hit after OCFS2 was merged into the mainline kernel. OCFS2 is one of the first things I worked on after joining Oracle and I’m pretty happy that it’s come this far.

I guess I should step back and explain it a little. OCFS2 is a clustered file system. It lets a collection of nodes treat shared storage as a single file system. Nodes are equal in their use of the file system which means that a node can fail, say by losing power, and the rest of the nodes can carry on using the file system after they briefly clean up what the failed node was doing.

We tried hard to make it easy to use. The most basic setup will not come as a shock to Linux administrators. You:

  1. install an ocfs2-util rpm on all nodes
  2. make sure a config file represents nodes in the cluster and is identical on all nodes
  3. start cluster services by running an init script
  4. run mkfs.ocfs2 from a node to format the file system
  5. mount the file system from all nodes

It still has some warts, of course, but all in all I think it’s a good piece of engineering. I’d be interested in hearing if any of my friends end up playing with it. I know that Ubuuntu and SLES have been including OCFS2 for a while and there is a good chance that we’ll get the tools into Fedora Core extras. If nothing else, the tools can be found in the OCFS2 Tools project on oss.oracle.com.

Xen made my laptop cry

I finally sat down and took a stab at getting Xen going on my laptop. As this entry’s title may have lead you to believe, it didn’t go particularly well.

I was pretty excited to have an army of virtual machines that were suited to kernel development. Especially after foolishly wasting time working with some machines at work that I knew weren’t up to the task.

The User’s guide got me in the mood. The Fedora wiki also has a helpful Quick start page.

I installed the RPMs, rebooted, and ran immediately into an existing bug. I tried Rik’s newer Xen RPMs for FC4 and they did much better. X started up and things were going pretty well. Then it tried to talk to the network and the screen went blank and the machine was dead. Dang.

I guess it’s not quite as polished as I was hoping. I imagine it might be fun, of a sort, to hook up a serial console and debug it with some Xen guys. I’ll get right on that just after my employer stops demanding my services in exchange for letting me funnel their money to the bank that holds our mortage.

OLS 2005

727

I spent last week in Ottawa for this years Ottawa Linux Symposium along with my fantastic Belgian boss and a flotilla of Linux aficianados. I’m told the attendance this year came in at a good eight hundred which seems like quite a bit to those of us who stumbled through the first. I think they had it right back then — minimize the casualties incurred during nonsensical talks.

This year, my first visit after a long hiatus, was also my first time attending the Kernel Summit which has preceded the main conference for the last few years. I’m still not entirely sure what to make of it. I was very encouraged by some clear-thinking shining stars and somewhat discouraged by some poo-slinging children. On balance I think I’d have to say that the people worth listening too outnumbered the destructive vocal minority. Just like any non-trivial group of people, I suppose. I’m told that this year’s was not as exciting and contentious as previous years and I could certainly buy that. I was particularly glad to see Linus sitting down to a Mercurial demo with Matt. Here’s to hoping.

As for the main conference, I really enjoyed Ian Pratt’s talk on Xen — and not just because Robert now works for Xensource. I have only been admiring Xen from the periphery so it was quite interesting to see how it really gets things done, particularly the page sharing business. Its domain migration business also looks pretty exciting. There was a great moment during his talk when he showed a graph of some web benchmark that ran while the domain with the web server was transparently being migrated between machines. Migration’s problem can be basically narrowed down to trying to keep pages in sync between the old and new host while the task is actively dirtying new pages. It employes this clever algorithm where it allows itself to consume, say, 10% of the system migrating pages, hopefully at a higher rate than the task can dirty them. Eventually there will be such a small amount of pages dirty left that the task can be stopped and finally migrated to the new host with a small amount of down time. The graph of the throughput of the web benchmark during the migration showed exactly this. There was a stable degredation of the throughput down 10% for a few seconds and then a painless few milliseconds of idle while the task hoped over to the new host. I only remember this, and apologize for wandering through it with you, because the audience broke into spontaneous applause at the slide. When’s the last time you heard of that happening? Not sarcastic “haha, the IA64 doesn’t have ISA” applause — real honest-to-god “holy crap, it Just Works” applause.

I was more actively involved in an AIO BOF later on in the conference that seemed to go pretty well. We spent a fair amount of time worrying about how to really demonstrate the benefits of buffered filesystem AIO interfaces. Samba’s use will help, but that might not be enough. There was also a pretty interesting revivial of the O_STREAMING discussion which seems to come up every few years. This time there was a interesting twist, mostly driven by sct, involving directories and alleviating dcache pressure. I might see if I can find some time to whip together an implementation, we’ll see.

Primarily, though, it was just good fun to reconnect with friends that I hadn’t seen in ages. Jes and I got to talk quite a bit about home ownership and msw made sure to point out how old and ridiculous my various techy toys are. Good times! Being in Ottawa also provided a great excuse to hang out with Deb and Rob. I got to spend some time in their lovely apartment in the Glebe hanging out with their kitties, eating fun food, and playing way too much ogame. I wish all my friends lived in one place. And that it was a good place. And that I lived there, too.

To top it off, United bumped me up into first on the long leg back from O’Hare as part of some grand seat shuffling arrangement to get a family seated together. To all the unruly families who usually make traveling on summer weekends as enjoyable as eating ones own face, I take it all back — cheers!