propose a talk for LPC, come enjoy Portland

This weekend marks the deadline for submitting speaker proposals for the Linux Plumbers Conf. I figure that CRFS falls under the category of future Linux storage so I submitted a proposal to talk about it.

If you have something that you’d like to discuss with your peers, and which falls under their broad list of categories, you should send in a submission.

If it’s accepted then you’ll get to come enjoy Portland with us! I have a secret plan that will make this sound like a lot more fun than it seems like it should at first glance. I want to try and provide some kind of list of the more interesting places to dine in Portland. I don’t know about you guys, but I’m pretty tired of the beer-and-mediocre-pizza meal that our community so often gravitates to.

Doesn’t that sound great? I sure hope so. Here, as a pair of teasers, are two places for after dinner treats: Cacao and Teardrop.

CRFS source available, now with kernel building, mostly

Two weeks ago we received the approval to release the CRFS source code. It can be found at http://oss.oracle.com/projects/crfs/. Last week I sent a message to linux-fsdevel letting people know CRFS was up. I suppose I should be embarrassed that both Val and Evgeniy mentioned it before I did.

Over the last few days I fixed a few bugs in CRFS that were messing up kernel builds. You can now unpack and build a linux kernel in a CRFS client mount. There are some lingering bugs, of course, but it’s enough for us to get a rough idea of the performance to expect relative to NFS.

As usual, the comparison doesn’t seem very sporting. A make -j2 in CRFS on this particular client took about 50 seconds, NFS took about 3 minutes. The crfsd process on the server was writing to a single spindle. The nfs server, linux knfsd, was writing to a ram disk. Even with the advantage of much faster storage, NFS can’t keep up because it’s caching model requires it to hit the network much more often than CRFS.

The next major step in CRFS is to get the cache coherency protocol working so that multiple clients can be mounted. I think I’m about a quarter of the way into the first draft of an implementation that only covers file data. It’s looking pretty good so far.

LCA2008 CRFS talk went well

Well, it’s been almost three weeks since I gave a talk on CRFS at LCA 2008 and I’m just now getting around to sharing my thoughts on how it went. We’ll pretend that the delay makes the thoughts that much more.. thoughtful, but clearly that’s already not the case.

I was impressed by the quality of volunteers at LCA. On the morning of my talk they had two people in the room running the AV equipment and making sure that I got time cues. After the talk they had me put a PDF of the slides on to a USB flash drive. Within 24 hours they had both the slides and the video of the talk available for download from the conference’s programme page. People noticed, too. The next morning I awoke to find emails from people half-way around the world who had read the slides and had decent questions to ask about how CRFS works. That’s pretty great.

I am a little worried that these “linux.conf.au” links will break next year when the next incarnation of the conference builds their web site. I guess if I was clever I’d grab a copy now and serve up the talk materials locally.

I will admit to having some trouble deciding just which pieces of CRFS to try and squeeze into a short introductory talk. I tried to stick to the most fundamental basics but I’m not sure I can trust my judgment here. I have a tendency to misjudge the level of pre-existing knowledge in a given audience. I’d love to hear feedback from my colleagues who have different levels of experience with file systems.

I will also happily admit to going a little too far with LCA’s motto of being “fun, informal and seriously technical”. I really hammed it up in a few places. I felt like the audience enjoyed it but the video didn’t pick up the reasonably steady trickle of giggles from the audience so the viewer can be forgiven for thinking that I was just being a crazy person :). I think I’ll take Val’s positive characterization of the talk as “technical stand-up improv comedy” as an indication that I was doing something right.

The frighteningly keen LCA attendee may have noticed that one or two (or three) of us put “bonghits” in our talks. I blame the dangerous intersection of Dave Jones and conference subsidized bottles of wine.

As for CRFS, it continues on at full speed. There have been signs of life in the process of getting approval to release the source so maybe I’ll have something exciting to report soon. I’ve just doomed the process by typing those words, of course.

Last week I converted crfsd from being a confusing threaded process to a group of processes with explicit boundaries for sharing state. I should have called it the honorary Rusty Russell Hates Threads commit but I chickened out. Commit messages are forever!

At the moment I’m pushing to get the coherency protocol stumbling along such that the initial release can bear more resemblance to what the final CRFS system will look like. With luck I’ll make it in time.

greasemonkey (and firebug) made lca2008 happy

Today I noticed that the video of the lightening talks from LCA2008 is available. It has probably been available for a while but I only just noticed :).

I was going to have you download the video that includes all the talks and skip to a particular talk that Paul Fenwick gave on greasemonkey which also mentions firebug. But a bit of searching lead me to Paul’s blog post which mentions the talk and which, in keeping with his apparent passion to make a web that doesn’t suck, includes an embedded youtube movie his talk alone. Nicely done, sir!

The audio recording does a fair job of communicating how much the audience loved the talk. I’m not sure if the audience loved the tools or hated myspace, or what, but either way I had a great time being in the audience . I wore a pretty goofy grin for most of the talk because I was happy for my friends (and loved one!) at moco.

I meant to point the talk out to them but was distracted because the video didn’t appear soon after the talk. With luck some of them won’t have seen it yet and will find some joy in hearing a few hundred people cheering at pieces of the software they work so hard on.

Well played, Murphy

Alright, put yourself in the mindset of a server in my basement. You’re kind of sad that the guy who maintains you is a few thousand miles away. Then his lovely wife has the nerve to go to California. It’s pretty lonely down there. What do you do?

Yes, that’s right, you have a few fans fail. Then heat gathers in the top of your ancient PC case. Which causes the power supply, cleverly designed to sit in the top of the case where heat gathers, to fail. The faulty power supply pulls power from two drives in a four drive array which flips the array into degraded mode wherein it can only return errors. Which hangs the machine as ext3 gets IO errors in the journal. You’ll show them!

I’m quite lucky to live a few blocks from one of the most capable sysadmins that I had the pleasure of starting my career with. I gave him a call, we shared some Simpson’s quotes (mostly Professor Frink), and managed to get things up and running again. He was able to transplant a power supply from a neighbouring test box. Thankfully the power drop didn’t damage the drives. Phew.

This was made that much funner by the fact that I hadn’t yet synced the most recent CRFS changes from that machine to a box at Oracle. The source that I’m giving a talk about on Friday here in Melbourne. Where the source is intended to be released.

So, I guess this means I get to play Christmas on Newegg with PC hardware when I get back. Yay, prezzies!

Melbourne bound!

Well, I’m heading down to Melbourne for LCA 2008 in a few hours. I’m not exactly excited by the length of the trip (SKW6084 and UAL839) but I’m definitely looking forward to attending LCA and to seeing Melbourne. It looks like a nice city. It’s a shame that I didn’t arrange to stay longer. Ah, well.

I’ll be giving a talk on CRFS while I’m down there. I did a practice run for a small audience of friendly Linux folks in Portland which was well received so I have high hopes that people at the conference will enjoy it. I know I certainly enjoy talking about this technology, but, well, I guess I would ;).

I thought I’d share a slide from the talk that I find geeky and satisfying:

silly-rename003.png

The slide is demonstrating a particularly weird behaviour of the Linux NFS client adorably called silly renaming. I like the slide because it’s using a relatively small set of system calls to illustrate how differently NFS can behave than “local” file systems. I use it during the talk to illustrate one of my primary motivations for working on CRFS — that we have a network file system that doesn’t penalize its users by requiring that their applications know to work around its behavioural quirks.

Anyway, if this stuff interests you I hope you’ll come have fun at the talk with us.

A little more CRFS detail

In my previous post about CRFS metadata performance I said that I didn’t want to go into too much detail until the source is released. I still don’t want to but Evgeniy Polyakov is tempting me! He’s having a good time learning by experimenting with network file systems and posted some theories about CRFS. I’ll respond to his theories with a series of facts about the CRFS protocol and implementation because, well, I love talking about this stuff and rarely get a chance.

The userspace server I’ve implemented (”crfsd”) is btrfs specific. It works directly with the on-disk structures in a btrfs volume. You don’t specify a file system directory tree to export, you specify a block device which contains a btrfs file system. crfsd has exclusive access to the contents of that block device while it is running.

The CRFS client kernel module (”crfs.ko”) doesn’t require kernel patches. I happen to be tracking mainline but, so far, there has been nothing significant in the implementation that restricts it to modern kernels. The use of ->write_begin() will probably be the first thing that starts to restrict the kernel versions that it will support but that hasn’t happened yet.

CRFS does perform writeback caching of metadata operations. The huge performance benefit this brings justifies the complexity of implementing it, which can’t be overestimated. Designing the protocol and then implementing the kernel client such that we can keep this complexity under control is one of the most important aspects of the CRFS system as a whole.

The CRFS network protocol could be said to batch operations, it’s true, though phrasing it that way gives the wrong impression. It’s not like some kind of explicit compound RPC mechanism. Think of it more like the batching that happens when ext3 reads in a block full of inodes as it goes to read a specific inode that it is interested in. CRFS achives similar results from a very different organization of metadata. Think of it as reading and writing groups of items from btrfs leaf blocks because that’s exactly what it is. The opportunistic priming of client caches when they perform normal metadata read requests, at insignificant additional cost, is a natural side-effect of the way CRFS represents metdata.

And with that, I should really return to a nice holiday break.

CRFS performance teaser

Friends and colleagues have been hearing me talk about CRFS for a while. CRFS is an acronym that stands for “coherent remote file system”. It’s a project that I’ve been working on to implement a networked file system that is, well, great. I haven’t been too public about it partially for fear of being accused of peddling vapourware but mostly because we’re still working in Oracle to get approval to release the code.

That said, the implementation is far enough along that I can make some meaningful performance measurements. I thought I’d share one which demonstrates what CRFS can do for metadata performance.

These tests were run between two machines. Each have onboard e1000 chips connected to a cheap consumer-grade netlink gigabit switch. They each have 2 gig of memory and single dual-core intel processors of the Penryn generation.

Each test iteration is trivial. We make a new file system on the server, mount it on the client, untar a kernel source tree, purge the client’s data cache, and then read back the file data. Specifically, we run the following commands on the client:

tar -xf /dev/shm/linux-2.6.17.tar
echo 1 > /proc/sys/vm/drop_caches
find linux-2.6.17 -type f | xargs cat > /dev/null

We repeat this series first with the server storing the file system on a single SATA drive and then in ram (tmpfs) only. The CRFS numbers would be pretty baffling on their own so we also run the test over NFS (v3, TCP). We record measurements just like the time(1) command: real wall clock time, cpu time spent in userspace, cpu time spent in the kernel.

                   seconds                   command
                (real user sys)

            nfs                 crfs

disk: 45.12 0.12 10.22 : 12.55 0.09 2.69 : tar -xf /dev/shm/linux-2.6.17.tar
      19.21 0.05 3.54  : 11.04 0.05 1.17 : find linux-2.6.17 -type f | xargs cat > /dev/null

 ram: 43.83 0.13 9.91 :  7.90 0.12 2.66 : tar -xf /dev/shm/linux-2.6.17.tar
      18.64 0.08 3.61 : 10.68 0.05 1.00 : find linux-2.6.17 -type f | xargs cat > /dev/null

The NFS numbers are roughly the same whether its storing on disk or in ram because we’re using the ‘async’ option. Asking NFS to actually perform each write operation on disk wouldn’t have been sporting at all.

CRFS is limited by the disk speed because its userspace server is waiting for writes to hit disk before sending a response to the client.

CRFS is able to do the same work in less time, even when writes go all the way to disk, because its network protocol goes to great lengths to reduce conversation over the network.

I won’t waste everyone’s time with details until the code is out there and available for people to play with. My intention is to give people something to look forward to :).

The description of my upcoming CRFS talk at LCA ‘08 in Melbourne provides a little more detail. Do come to the talk if you can! It should be fun.

OLS 2000: we were more fun then, also less old

I was cleaning up some files this morning and ran across a set of photos from OLS 2000. I wasn’t more than five or six photos in before I found myself lost in a fit of giggles. What I came to find more interesting, though, were how many were actually decent photos of my lovely friends!

So I put the whole lot in a set on flickr entitled, wait for it, OLS 2000. I figure seven years is long enough for embarrassment to have ripened into nostalgia.

It’s funny to put funny things on your head:


And these aren’t half bad:



you can has AP9211/9606

I’ve long been a fan of of the discontinued APC AP9211 distribution unit. To wit:

  1. It’s an efficient 1U, taking up just enough space for the 8 outlets.
  2. It is gloriously devoid of any moving parts, including noisy fans.
  3. powerman comes with scripts to manage its outlets from the command line over ethernet. Once you get used to the convenience of “powerman -c $host” you never go back.
  4. It’s common enough to easily be found on ebay.

That last point brings us to this post. Mark got a pair for his machines after I showed him the light. He got them from an auction that has 3 days left and, at the time of this writing, 19 units left. Each has a pretty reasonable price for immediate purchase with shipping.

APC MasterSwitch AP9211 w/ AP9606 Control Moduleebay auction 230196539398

It’s a chance for poor kernel developers out there to stop being frustrated by having to lose time rebooting boxes in person.

fibre channel hardware sent out to pasture

I’m very excited. In a small number of hours someone from the PostgreSQL project will be coming by to take away my old fiibre channel storage setup. I had used it for OCFS2 development, mostly, but haven’t touched it in ages. I do hope it works out well for them. It consists of the following:

  1. 8 incredibly long copper cables
  2. about a million (ok, 11) copper GBICs
  3. 8 qla2100 PCI cards (these are not excellent)
  4. A tray of 10k rpm Cheetahs, complete with funky NetApp firmware, which could kill a man who attempted to lift them unaided
  5. a fibre channel switch whose fans could then wake said dead man

Did I mention the excitement? This crap — er, fantastic storage infrastructure — will no longer take up space in the basement.

These days I do my storage work on an HP DS2405 which is directly connected to an Emulex LPe11002, both donated by said companies. The old setup could hit 100MB/s on a good day. The current kit hits 450MB/s without trying very hard. If you have the means, I highly recommend picking one up.

inserting files into Thunderbird without reformatting

Let’s start with the observation that the Linux kernel community discourages sending patches as attachments (tpp, section 1a). Sending patches as attachments creates pain for the sender, regardless of what you or I might think of this fact.

Thus, it’s a shame that Thunderbird doesn’t have a trivial mechanism to insert a file in to its plain text composing interface without altering the file. There are at least three stumbling blocks: translating tabs into spaces, trimming trailing whitespace, and wrapping lines.

One suggested hack to work around this is to use an external editor extension which lets you fire up, say, vim to insert the file into the editing buffer. This doesn’t work well in OS X, for much the same reasons that it doesn’t work well for eclipse.

After some trial and error I found Quicktext, an extension for inserting signatures, which can be abused to insert files. After installing the extension, one:

  1. sets the word wrapping preference to 0 before composing the new message
  2. inserts the file as HTML, to preserve trailing whitespace
  3. restores the previous word wrapping preference

This is violently imperfect. Inserting as HTML to preserve whitespace runs the risk of escaping HTML which might be in the file. I do this so infrequently that it’ll do for now. I usually send patches with tools. (git-send-email, hg email, sendpatchset).

I was reasonably excited to find that it seems like someone who understands Mozilla extensions could build an extension to insert text without altering the input. nsIPlaintextEditor seems to have knobs to disable translation of whitespace and word wrapping. I might try my hand at this some day but would be even happier if someone beat me to it.

direct write cache invalidation failure illustrated

An associate recently berated me for not posting about work things recently. Fair enough. Here’s the start of an attempt to do more blogging about my daily work.

For me, last week ended with a thread on lkml wherein a poor user reported having actually hit the nasty case where an O_DIRECT write doesn’t invalidate the page cache after a buffered reader races to bring in stale cached data during the write.

In his case he had a writer advancing through a file writing new contents. As it wrote it’d wake a buffered reader who would read up to the point of the new content that the writer had just written.

The problem was that the reader could trigger the kernel to read-ahead up into the region where the writer is currently writing with O_DIRECT. The kernel was failing to invalidate the existing page cache after the O_DIRECT write completes. The buffered reader will then wake a read stale data which was brought in with read-ahead from its previous reads. Wackiness ensues!

So, I threw together a test case. I know you, dear readers, have been just dying to see what this terrifying corner case actually looks like. Wonder no more!

[zab@hammer c]$ ./aio-dio-invalidate-check /tmp/something
( lots of time passes )
writing 2 to 3248128
setting write_pos to to 3248128
writing 2 to 3252224
reading from 2850816 to to 3248128 looking for 1
read 3248128 write 3248128
writing 2 to 3256320
writing 2 to 3260416
writing 2 to 3264512
writing 2 to 3268608
writing 2 to 3272704
setting write_pos to to 3272704
writing 2 to 3276800
writing 2 to 3280896
setting write_pos to to 3280896
writing 2 to 3284992
writing 2 to 3289088
writing 2 to 3293184
writing 2 to 3297280
reading from 3248128 to to 3280896 looking for 1
reader found old byte at pos 3252224

[zab@hammer c]$ od -A d -x /tmp/something
0000000 0202 0202 0202 0202 0202 0202 0202 0202
*
3252224 0101 0101 0101 0101 0101 0101 0101 0101
*
3256320 0202 0202 0202 0202 0202 0202 0202 0202
*
3301376 0101 0101 0101 0101 0101 0101 0101 0101
*
8388608

[root@hammer ~]# echo 1 > /proc/sys/vm/drop_caches

[zab@hammer c]$ od -A d -x /tmp/something
0000000 0202 0202 0202 0202 0202 0202 0202 0202
*
3301376 0101 0101 0101 0101 0101 0101 0101 0101
*
8388608

That last bit shows the stale data present in the cache, the cache being purged, and then the stale data vanishing as the file is read back from disk.

Terrifying stuff, I know, but it is almost Halloween.

Bi-Mon-Sci-Fi-Con

Over the years I’ve built up a collection of badges from tech conferences. They’re an interesting record of the path my career took from UNIX sysadmin work to Linux kernel development. I noticed that the oldest of them is now 10 years old and thought it would be a fun time to show them off. I put up a Conference Badge photo set on flickr.

They come to 112MB of 600dpi scans which motivated me to shell out for a Pro flickr account. I had to crop the badge from thebazaar as its enormous speaker ribbon pushed the file size up to 17MB, compressed.

Ted vs. Infotainment

A few months ago I got an email out of the blue from a researcher at ABC News. After some prodding she admitted that she wanted to talk about background information on Hans Reiser. I happily declined, having managed to avoid that particular corner of the Linux file system world.

Today Ted mentions that he tried to help ABC understand the basics.

Let’s hope that it doesn’t end up like the disastrous piece in Wired. I’m not holding my breath.

being sneaky on ebay

They never saw me coming.

digg comments on btrfs

A friend pointed out that a reference to btrfs appeared on digg. I wasn’t sure that it merited much attention but a colleague expressed interest in learning more about btrfs.

I should first set the stage by explaining my relation to btrfs. Chris Mason, its primary developer, is my manager at Oracle. He and I started working on btrfs quite a few months ago. I fell back into a more advisory role after I moved on to work on a related project while Chris continued working diligently on the initial btrfs implementation. While I’m not intimately familiar with the code, I’m pretty familiar with the design trade-offs that it currently makes.

I’ll address some of the honest confusion expressed in the comments to that digg post by translating them into questions that one might ask while not suffering from the effects of John Gabriel’s GIF Theory

btrfs isn’t considered stable and isn’t supported. That scares me. Why is btrfs available before it is feature-complete and stable?

Once a file system is complete and supported it becomes very hard to work in features that weren’t originally available. Adding new features that require changes to the format of persistent data on disk becomes much, much, harder. By making it available at this stage we give people the opportunity to request features that might not have occurred to us. All file systems go through this stage, we’re just exposing it to a wider group of people. One is always welcome to simply ignore btrfs until it’s supported if that’s what one desires.

I live a very busy life and couldn’t be bothered to look at the license that btrfs is released under and instead chose to imply that it wasn’t free and open. Was this not the most clever thing I’ve done recently?

Probably. btrfs is released under the GPLv2, the same license as the Linux kernel.

For whatever reason, I have a negative impression of software that is related to the word Oracle. Should I transfer that negativity to btrfs because it is also associated with the word Oracle?

Probably not. The kernel development team at Oracle that produces btrfs is made up of people who worked on the Linux kernel long before they agreed to come work on the kernel for Oracle. Never fear, we tend to work from home in distant states, countries, and continents — far from the influence of whatever magical anti-awesome sauce it is that you think Oracle puts in its developers’ food.

Oracle also developed OCFS2. Are the two projects related?

Not really, although I worked on OCFS2 for a time. The two file systems solve different problems and their development efforts have different resources at their disposal. OCFS2 is about helping multiple machines work on a shared file system without corrupting each others’ efforts. That’s incredibly difficult. btrfs is about making the best of modern file system features available to the majority of Linux installations for the simple case where there’s only one computer using it. That’s relatively less difficult.

btrfs is a new file system. I also know of another new file system, ZFS. Does btrfs make ZFS unneccessary?

I can think of no way in which a current ZFS user would be satisfied by btrfs. If for no other reason than the simple fact that btrfs is not supported anywhere and ZFS is not seriously available to Linux users. Maybe one could entertain having this conversation once btrfs is supported on Linux and Solaris and ZFS is supported on Linux.

All this talk of ZFS and btrfs reminds me that I once heard that ZFS can be slow, or something. Might that also be said of btrfs?

Yes, in as much as that can be said of each and every file system in existence. File system engineering is, at it’s core, a game of having to choose amongst conflicting desires. It’s often the case that implementing a feature in a particular way will benefit one usage pattern while harming some other usage pattern. btrfs and ZFS, both incorporating design elements more modern than the Reagan administration, will tend to chose to skew the trade-offs in similar directions, most of the time.

There are already lots (and lots) of file systems available for Linux. What does btrfs do that those file systems don’t?

Sometimes it can be hard for those of us who work on file systems to clearly communicate why it is that we dislike existing designs. It’s complicated stuff. There’s one property of current Linux file systems, though, that seems like it should be universally ill-received.

Almost all Linux file systems provide almost no protection against data corruption. The only protection they offer is to propagate errors from the storage system up to the application. If the storage system doesn’t realize that the data has been corrupted, perhaps because the corruption happened after the drive, these file systems can get very confused. Returning bad data to applications, overwriting the wrong data on disk, crashing machines, etc.

Now, storage systems have been surprisingly reliable, it turns out. But Linux thrives on cheap commodity hardware, which is not exactly famous for being rock solid. The persistent march of hardware towards commoditization and cheaper manufacturing does not bode well for the future.

That btrfs takes strong measures to address the risk of corruption is the most exciting run-time feature for me. I want flakey hardware to result in a console message indicating data corruption, not mysterious behaviour or kernel panics that some incredibly expensive human has to diagnose.

I mean, no one would ever consider disabling checksumming in TCP. Why on earth do we allow our file systems to operate without similar protection?

Booting built kernels with PXE

Linux kernel development can be made a lot nicer by automating some steps in the process. Sadly, there is no shared software package which performs this kind of automation. Most everyone who takes kernel development seriously ends up writing their own tools tailored to their environment. I thought I’d spend a few blog posts sharing the tools I’ve thrown together over the years.

I first tried writing this in a tone which didn’t assume that the reader already knew about kernel development, but it just didn’t work. So, my apologies to those who have no idea what on earth I’m talking about here.

I thought I’d start by describing scripts which help remove the irritating step of installing a new kernel on a test machine’s local drive. Doing so speeds up the compile-boot-test cycle. It goes something like this.

First, I install extra e1000 cards in all the machines so that I can use their PXE boot ROM to boot kernels and initrds over the network. This may not be a great fix for everyone, it just happens to work for me because I was given a box of discarded e1000 cards.

Distros tailor initrds to the hardware of a specific machine (foolishly, I believe, but here we are). To get each machine booting its own initrds I configure dhcpd to point each host at a specific pxe.cfg which in turn references initrds for that host. Each pxe.cfg is built from a per-architecture stub by a script.

The initrds are generated from a copy of the initrd which the distro built for a given host. A script simply replaces each kernel module in the distro’s initrd with the kernel module from the newly built kernel. It’s only a rough approximation of correct behaviour but it has worked so far.

The next problem is that the distro assumes that it will find all the modules for this kernel in the root file system. To accomplish this the script includes all the modules in the initrd. It mounts the root file system read-write and uses a statically linked rsync to copy all the modules from the recent build of the kernel into /lib/modules on the host. It then remounts the root fs read-only before continuing on with the boot.

That’s it, really. Part of this is based on observation that it is, in fact, the 21st century. I’m not booting from 10mbit ether or floppies and don’t care one bit if the initrds are “huge”.

$ ls -hs *2.6.21.1-*
 30M initrd-2.6.21.1-syslets  1.8M vmlinuz-2.6.21.1-syslets

Here it is in action:

[zab@kaori 2.6-syslets]$  zk-install-pxe-initrd
[zk] preparing initrd for hammer
[zk] Warning: hammer needs uhci-hcd.ko
[zk] Warning: hammer needs ehci-hcd.ko
[zk] Warning: hammer needs ohci-hcd.ko
[zk] building initrd for hammer
148402 blocks
[zab@kaori 2.6-syslets]$ zk-build-pxe -r
[zk] hammer:
[zk] 2.6.21.1-syslets
[zk] making '2.6.21.1-syslets' the default pxeboot label
[zab@tetsuo ~]# powerman -1 hammer
Command completed successfully
[zab@tetsuo ~]$ console hammer
[Enter `^Ec?' for help]
PXELINUX 3.10 2005-08-24  Copyright (C) 1994-2005 H. Peter Anvin
boot:
Loading hosts/hammer/vmlinuz-2.6.21.1-syslets................................
Loading hosts/hammer/initrd-2.6.21.1-syslets.......................[many, many, dots]
Ready.
[    0.000000] Linux version 2.6.21.1-syslets [...]
[ ... ]

Fedora Core release 6 (Zod)
Kernel 2.6.21.1-syslets on an x86_64

hammer login:

The zk prefix I chose for these little scripts ostensibly stands for “zabbo kernel”, but it’s really an inside joke that refers to the ZK_ prefix that ZeroKnowledge used when reimplementing the entire world in their software. Starting with, no seriously, ZK_TRUE and ZK_FALSE.

vim quickfix error format for sparse

I, like countless others, use vim’s quickfix mode to ease the pain of the compile-fix-compile cycle. vim parses the output of the build so that it can present a summary of errors and enable navigation between them.

sparse is a tool that knows how to find errors in C code that compilers like gcc don’t notice. It requires minimal annotation in the source but provides invaluable functionality, like warning when endian conversions are forgotten.

Which brings us to the point of this post. sparse spits out multi-line errors messages that vim doesn’t completely understand:

tests/btree-stress.c:121:55: warning: incorrect type in initializer (different base types)
tests/btree-stress.c:121:55: expected restricted unsigned long long [usertype] b_offset
tests/btree-stress.c:121:55: got long long [signed] [usertype] offset

vim doesn’t know that each of these error messages belong to the same error. It offers them to the user as three separate errors:

:clist
35 tests/btree-stress.c:121 col 55: warning: incorrect type in initializer (different base types)
36 tests/btree-stress.c:121 col 55: expected restricted unsigned long long [usertype] b_offset
37 tests/btree-stress.c:121 col 55: got long long [signed] [usertype] offset

This is irritating because to navigate past this error you have to know to navigate past the next three errors. This has been an irritating me for, I don’t know, years now. I finally sat down and spent an hour or so poisoning my brain with vim’s arcane configuration.

set efm^=%W%f:%l:%c:\ warning:\ %m,%C%f:%l:%c:\ \ \ \ %m,%Z%f:%l:%c:\ \ \ \ %m

et voila. Now vim considers those three error messages as coming from one error:

:clist
35 tests/btree-stress.c:121 col 55 warning: incorrect type in initializer (different base types) expected restricted unsigned long long [usertype] b_offset got long long [signed] [usertype] offset

I’m sure that format won’t catch all of sparse’s errors but it’ll easy to derive additional formats from it.

I’m also sure that I’m not the first to do this. It would be nice if the sparse guys shipped a sourcable .vimrc along with the tools.

Andrew says wise things at FOSDEM 2007

Andrew Morton’s talk at FOSDEM is available in ogg format. It’s worth a watch if you’re interested in Linux kernel development. He walks through technology that is currently consuming kernel developers. He starts with an interesting explicit link between the life cycles of products which include the kernel and the willingness of companies producing those products to fund development of the mainline kernel. He ends with the refreshing admission that the kernel development community exists to serve users.

I will admit that I was referred to this video because of his description of ext4: “the next horrible version of a horrible file system.” It’s hard for me to disagree, given this modern age of ZFS.

I was playing his video along in the background while working on an O_DIRECT bug when my ears perked up at his mention of AIO. I was pleased to hear him give a very reasonable overview of the awkward fibrils prototype I sent out which lead to Ingo’s more refined syslets. I’ll even go so far as to transcribe that part of his presentation:

AIO has been a problem for a long time. Even though we have all the AIO interfaces there, they don’t actually work. They’re only actually asynchronous for direct IO. So if you’re doing a normal buffered read or write to disk via AIO, all the interfaces work, but in fact we do the IO synchronously. There are patches out there to make the buffered IO asynchronous, but I’ve been the obstacle to merging those for the past couple of years.

And amazingly, Zach Brown at Oracle and Ingo Molnar have come up with a completely different way of doing it which has made me look very smart for not merging that code. What they’re proposing is, basically, make any system call asynchronous. So, potentially, the whole range of system calls in the kernel… you could fire them off and return to your application and later on get a notification when the system call completes. That, then, just means that all the AIO support we currently have we’ll no longer need ’cause you can just use normal old read and write and you just say “Yup, I want to do this in the background, please”.

Indeed, particularly that last bit.