Passing file descriptors over Unix domain sockets

So you can send file descriptors over AF_UNIX sockets, easy right? It’s done by sendmsg/recvmsg calls, using the ancilliary data mechanism. Ok, the API is a bit cryptic when it comes to the ancilliary message data buffer (CMSG_ALIGN what?), but fairly usable.

But consider a SOCK_STREAM socket. When am I going to receive the ancilliary data? Is it logically a separate stream? What happends on partial reads? What happends on partial sends? Did i write my socket code correctly? POSIX is silent, Linux manpages likewise. Let’s see what happends on Linux and FreeBSD.

If you want to skip the Linux/FreeBSD analysis and just learn how to use the mechanism, skip to the Generalization and How to use it? section.

Linux behavior

Even though we have a stream socket, Linux internally buffers the data as individual packets (in struct skb) and the “stream illusion” is maintained by the recv family of calls. Each skb has an associated list of file objects sent with the skb. Precisely it works like this:

When you send data to the socket, unix_stream_sendmsg gets called and allocates (possibly many) skb to hold your data (or at least a part of it). It converts the file descriptors in your ancilliary data into file objects and attaches the list of file objects to the cb section of the first skb. If it fails to attach the file objects (for example there is a limit for in-flight file descriptors), it does not send anything at all and returns ETOOMANYREFS. So you can be sure that if any data were sent, your file descriptors were sent with them.

On the receiving side, unix_stream_recvmsg calls a helper routine unix_stream_read_generic. This routine has its queue of skb packets and processes them sequentially to fulfill your request. It may consume only part of the packet at hand, if there is not enought space left in the buffer, hence the “stream illusion”. If it finds an skb with ancilliary data attached, it detaches the ancilliary data and hands you the file descriptors. It will not continue processing any next skb, but you will get data from the skb with the ancilliary data, or at least part of it if your buffer was not big enough.

So to recap, from the userspace side it looks like this:

  • The ancilliary data seems to be attached to the first byte of your message, you will get it when reading that byte.
  • The kernel will make sure that you will not read two different ancilliary data in single recv call, by causing partial reads, so there is no confusion there.
  • You can’t be really sure where these partial reads happen, this depends on the segmentation on the sending side.
  • For SOCK_SEQPACKET or SOCK_DATAGRAM, your ancilliary data would be simply attached to the packet as a whole.

Note: The information was obtained by reading the 4.14 kernel, check net/unix/.

Note: File objects are the ’things’ that file descriptors point to: sockets, files, DMA-BUFs etc.

FreeBSD behavior

The FreeBSD behavior is similar to the Linux one. Instead of struc skb, struct mbuf is used. The packet consists of an optional address (not used for stream unix sockets), one or more control mbuf followed by one or more data mbuf. The control mbuf is used to store the ancilliary data.

The sending side, uipc_send, converts all file descriptors in the ancilliary data (stored in the contorl mbuf) to file object pointers (unp_internalize) and then simply appends all the mbufs, including the control one, to the receive buffer of the other end of the socket. Similar checks to Linux are done here.

The receiving side, soreceive_generic, looks into the receive queue and first tries to process the control mbuf, converting the file objects back to file descriptors (unp_externalize). After that it processes all regular data packets, the last one may be cut and left in the queue if there is not enough data.

The net result is similar to Linux. There is a slight difference though. On Linux, recv can read data that are before the packet with ancilliary data that it has returned. On FreeBSD, this is the other way around, you may get data that folowed your ancilliary data packet. FreeBSD actaully has no way to tell that the data were not part of the ancilliary data packet.

Note: The information was obtained from sys/kern/uipc_usrreq.cc in FreeBSD GIT 127a24835dd6. It was a quick glance and my explanations of the mbuf mechanism might be wrong.

Note: OOB data and other details were left out.

Generalization

So far, we have looked at two systems, and that is two more than we need for our generalization :). So my guess is you should be safe if you suppose your OS works the following way:

  • The ancilliary data are logically associated with a range of data in the stream.
  • If the sendmsg sends at least one byte, it has also sent your ancilliary data and asociated it with the regular data it has sent (if it were not so, you couldn’t use the API reliably).
  • You are guaranteed that the anciliary data will be read exactly by one recvmsg call, but it might be any recvmsg that returns any data from the range to which the data are attached (for Linux and FreeBSD this is the first call).
  • A single recvmsg call will never return two sets of ancilliary data (from two sendmsg calls), this will be enforced by partial reads.
  • Ordering of the ancilliary data will be preserved.

How to use it?

Well, depends how you are doing your networking.

If you have clearly defined packets in your communication protocol and your recvmsg calls do not span the packet boundaries, you can clearly associate the received file descriptors with the packet. You have to be ready to receive a file descriptor only if you expect it in that packet and you have not received it so far.

If you are doing “buffered” recv, in the sense that you receive as much data in some buffer and work out the structure later, it should be enough to just always be prepared to receive a file descriptor and keep a queue of these descriptors parallel to the received data buffer.

On the sending side, make sure that the file descriptor ends up associated with the data range where you expect it. If you are not doing any buffering, this comes automatically.

If you have buffering, you might cheat by creating a queue of file descriptors parallel to your send buffer and then eagerly associate the FDs with byte-sized sendmsgs (but don’t queue more FDs than data!). A nicer option might be to remember the data ranges along with the FD queue and than split the data to sendmsg calls acoording to that. Just packaging a FD with each sendmsg might be bad idea, you might end up sending the FDs late.

Note: You can have multiple FDs in one ancilliary data set. Seems like trouble to me.