AIO through stack scheduling
Friday, December 8, 2006
I don’t think it’s very controversial to say that Linux has lame AIO support.
From the user’s perspective you can only perform a few operations asynchronously, and it’s only in certain conditions that the submitting process won’t block. For example, you can submit an AIO disk write only if it’s O_DIRECT (which not all file descriptors support), and even then it may still block if it has to read meta-data off disk to find out where to perform the write. Want to perform other slow disk operations like open(), rename(), stat(), or unlink() asynchronously? Sorry. Networking? Nope. How about an AIO interface to hardware crypto accelerators? That’s just adorable!
This starts to make sense when you look at the AIO implementation in the kernel. The interface was implemented (I share the blame for this) as a separate subsystem. To bring AIO support to an existing interface (say, sys_write()), one has to duplicate its arguments over in the AIO subsystem and then provide a code path which implements the interface and communicates with the AIO subsystem when an operation blocks. The code path can take on the responsibility of explicitly suspending and resuming the operation as progress is made. This requires changing the code at any blocking point to return an error code (EIOCBQUEUED) and eventually call a completion routine (aio_complete()). The AIO subsystem also has the notion of retrying an operation until it finally succeeds without blocking, though to this daynothing in the mainline kernel uses this code — imagine how robust this unused (untested) code is.
In my opinion, we’re in the current situation due to the most fundamental problem with this implementation: it asks a maintainer of already complicated code paths to destabilize their code to bring AIO support. Not only are there initial development costs, in addition one ends up with different code paths that have to be tested. All this for a small portion of the Linux installed base.
Wasn’t it sneaky how I snuck that last sentence in there? That brings us to the sad fact of Linux AIO work: it suffers from a chicken-n-egg problem. There aren’t a lot of users of AIO out there because the kernel support is poor. With few users it’s hard to fund the development and justify the risk of AIO. There are very few users who are willing to spend significant sums on initial AIO development. Once they’ve gotten their pet interface working in AIO, they move on to their next priority. This is pretty self-evident from the current code. The most actively debugged AIO interface, O_DIRECT block IO, is the one that IBM and Oracle care about for their databases. (Red Hat helps, hi Jeff!)
This all came to a head in an email thread which started with Linus suggesting that we just tear out the current AIO support. This initial extremism eventually turned into an interesting suggestion for an alternative. The scheduler knows when a code path blocks. What if we implement AIO by implementing a micro-scheduler inside a task which allows multiple stacks to execute in the kernel on behalf of a process? If a disk operation blocks its stack is swapped out and another is allowed to run. Eventually when that disk operation can make progress again its stack is swapped back in.
From the user’s perspective, this would be fantastic. Any system call could be used asynchronously without altering its behaviour. An async getpid(), for example, would return the callers PID, not some strange PID of a thread that was secretly performing the operation in the background. The submitting call itself would never block. If one of its submitted operations went to block for any reason — semaphores, file system IO, even memory allocation — the scheduler will swap it out and return to the submitting stack.
From the Linux kernel developer’s perspective, this proposal sends shivers down the spine. There is a vast ocean of implementation detail that makes this difficult. Initially Ben and I latched on to a few of these hurdles and dismissed the idea as unworkable.
A year has passed since that email thread. I spent a good portion of that year debugging the interaction between O_DIRECT and AIO in fs/direct-io.c. What little remaining respect I had for the current implementation was chipped away. I’ve come to the conclusion that the potential benefits of scheduling stacks are significant enough to justify a proof-of-concept that would give us a concrete example of what it is that we’re talking about. My shiny new manager (the one with the beard) agreed.
For the past few weeks I’ve been spending idle cycles on that proof-of-concept. Today a very exciting thing happened: an O_DIRECT read from an IDE block device completed. A tiny example application produces the following output:
# ./dio
submit returned 2 at 1165617048.836733
completion returned 1 at 1165617048.837626, return code 3741 cookie 1234
completion returned 1 at 1165617048.844264, return code 512 cookie 5678
What’s happening here is roughly as follows:
- asys_submit() translates the specified system calls and arguments into stacks which, when executed, call into the each system call handler with the given arguments. It marks these stacks as runnable and calls into the scheduler.
- The scheduler swaps out the current stack, which is executing the submission syscall, and swaps in the stack that calls getpid(). getpid() doesn’t block so the handler runs to completion. The syscall handler returns to a function that takes the return code from the syscall and puts it in a completion event which is queued. It then calls into the scheduler.
- The scheduler finds our stack which executes the O_DIRECT read, swaps it in, and executes it. The read prepares and issues the IO and eventually comes to wait for it by entering the scheduler.
- The scheduler finds that the submitting stack is still runnable and swaps it in. It returns to userspace and userspace calls back into asys_await_completion(). It first returns the waiting completion from getpid(). The completion call then waits for more completion events to arrive by calling into the scheduler. Our IO is still in flight so we don’t have anymore runnable stacks in the task. The task is put to sleep.
- Eventually our IO completes. It marks the stack executing the read as runnable and wakes the task. The task notices the runnable read stack and swaps it in. The read system call handler now returns to that helper which takes the return code from the read and queues it in a completion event. In so doing it marks the stack in the completion call as runnable. It calls the scheduler.
- The scheduler swaps out the read stack and swaps in the completion gathering stack. It finds the waiting event and returns it to userspace.
- Fin.
This is accomplished with a set of patches whose diffstat looks like this:
$ diffstat -p1 patches/*.patch
arch/i386/kernel/asm-offsets.c | 4
arch/i386/kernel/syscall_table.S | 2
fs/direct-io.c | 18 +-
include/asm-i386/system.h | 36 +++++
include/asm-i386/unistd.h | 2
include/linux/asys.h | 2
include/linux/hrtimer.h | 2
include/linux/init_task.h | 4
include/linux/sched.h | 29 ++++
include/linux/wait.h | 15 ++
kernel/Makefile | 2
kernel/asys.c | 254 +++++++++++++++++++++++++++++++++++++++
kernel/exit.c | 7 +
kernel/fork.c | 7 +
kernel/hrtimer.c | 56 +++++---
kernel/rtmutex.c | 2
kernel/sched.c | 203 +++++++++++++++++++++++++++++++
kernel/wait.c | 41 ++++++
lib/rwsem-spinlock.c | 2
lib/rwsem.c | 50 ++++---
20 files changed, 681 insertions(+), 57 deletions(-)
I have to say, I’m finding this stuff pretty exciting!
The next step is to polish these very rough patches into something presentable. Then they’ll go off to linux-kernel to kick-start debate. There are a huge number of details and trade-offs to discuss before this idea can be seriously implemented.