RDS socket API

Part of my job at Oracle has involved working on a project called RDS. Over the past few months I’ve found myself failing to explain it clearly to friends who have asked what exactly this is. For their edification, if nothing else, I thought I’d take a few minutes to describe the project in more detail.

We can set the stage by laying out the basic properties of a certain kind of Oracle deployment that one often finds at customer sites. Imagine a few thousand processes on a handful of nodes. Each process is doing work and sending messages to many other processes. The one to many relationship starts to explain why this messaging is currently implemented with UDP with acknolwdgement and retranmission handled in the processes. Using TCP, for example, could mean holding a TCP connection open between each pair of communicating processes. The overhead of doing this adds up surprisingly quickly.

The attentive will quickly spot a problem with implementing reliability in the processes. If these processes are performing work that blocks waiting for IO, which they are, the acks that they send could be delayed. This could cause a sending process to spuriously retransmit a message that was in fact received but not acknowledged promptly. “Mmmm hmmm”, I might say to such an attentive person. This problem is seen under heavy load.

At some point Infiniband became an attractive potential solution to this problem. One of the things it can do is push reliability constructs into hardware so that the processes need not burden themselves with the task of promptly sending acks. uDAPL was attempted but didn’t work out. I get to avoid having to tell that story because I don’t know it — it was before my time. SDP is a socket API built on top of Infiniband which would take care of reliability, but it has per-process-pair overhead problems like TCP.

This is when RDS started to take shape. It was designed as a socket API which would let processes send messages from one socket to many recipient processes. Reliability is ideally provided by hardware and the cost of doing so should not increase significantly with the number of processes involved in communication. A prototype was written for the 2.4 kernel which implemented RDS on top of Infiniband.

This is when the Oracle messaging people started talking to me about getting involved. They were looking for an implementation for 2.6 that could also support RDS on top of commodity ethernet. As initially described it sounded like they wanted some ethernet level protocol. This explains my earlier blog post about RDS/eth. We hadn’t quite gotten to understanding each other at that point. It eventually became clear that they wanted a 2.6 implementation that supported RDS on top of various transports — “reliabile connection queue pairs” for Inifiniband and TCP for commodity ethernet.

That, in the end, is what has been built as is now available in a subversion repostory found off of http://oss.oracle.com/projects/rds/. It’s a kernel socket API which maintains connections between nodes and multiplexes messages between processes down those per-node connections. There are lots of interesting (and occasionally surprising) details, but perhaps those are better saved for another post. At least now I hope folks will have a better understanding of what it is I mean when I talk about “that freaking RDS thing.”

Comments (2) to “RDS socket API”

  1. Why did RDS need to be in the kernel? Since you can only bind an RDS socket once per address, it seems there has to be a user-space message concentrator anyway.

    Also, have you seen Tibco’s rudp? (Reliable UDP).

  2. Have you looked at TIPC (tipc.sf.net). I is similar except that it isn’t IP based (could be) and the reliablility is per destination node nor per dest socket.
    It has support for connection less and connection oriented sockets, multicast, interesting options on process failure, and a node to node link heartbeat.

    I’d be interested to hear what you think of it.

Post a Comment
*Required
*Required (Never published)