Passing file descriptors over Unix domain sockets
So you can send file descriptors over
AF_UNIX sockets, easy right? It’s done
recvmsg calls, using the ancilliary data mechanism. Ok, the API
is a bit cryptic when it comes to the ancilliary message data buffer
CMSG_ALIGN what?), but fairly usable.
But consider a
SOCK_STREAM socket. When am I going to receive the ancilliary
data? Is it logically a separate stream? What happends on partial reads? What
happends on partial sends? Did i write my socket code correctly? POSIX is
silent, Linux manpages likewise. Let’s see what happends on Linux and FreeBSD.
If you want to skip the Linux/FreeBSD analysis and just learn how to use the mechanism, skip to the Generalization and How to use it? section.
Even though we have a stream socket, Linux internally buffers the data as
individual packets (in
struct skb) and the “stream illusion” is maintained by
recv family of calls. Each
skb has an associated list of file objects
sent with the
skb. Precisely it works like this:
When you send data to the socket,
unix_stream_sendmsg gets called and
allocates (possibly many)
skb to hold your data (or at least a part of it). It
converts the file descriptors in your ancilliary data into file objects and
attaches the list of file objects to the
cb section of the first
it fails to attach the file objects (for example there is a limit for in-flight
file descriptors), it does not send anything at all and returns
So you can be sure that if any data were sent, your file descriptors were sent
On the receiving side,
unix_stream_recvmsg calls a helper routine
unix_stream_read_generic. This routine has its queue of
skb packets and
processes them sequentially to fulfill your request. It may consume only part of
the packet at hand, if there is not enought space left in the buffer, hence the
“stream illusion”. If it finds an
skb with ancilliary data attached, it
detaches the ancilliary data and hands you the file descriptors. It will not
continue processing any next
skb, but you will get data from the
the ancilliary data, or at least part of it if your buffer was not big enough.
So to recap, from the userspace side it looks like this:
- The ancilliary data seems to be attached to the first byte of your message, you will get it when reading that byte.
- The kernel will make sure that you will not read two different ancilliary
data in single
recvcall, by causing partial reads, so there is no confusion there.
- You can’t be really sure where these partial reads happen, this depends on the segmentation on the sending side.
SOCK_DATAGRAM, your ancilliary data would be simply attached to the packet as a whole.
Note: The information was obtained by reading the 4.14 kernel, check
Note: File objects are the ‘things’ that file descriptors point to: sockets, files, DMA-BUFs etc.
The FreeBSD behavior is similar to the Linux one. Instead of
struct mbuf is used. The packet consists of an optional address (not used for
stream unix sockets), one or more control
mbuf followed by one or more data
mbuf. The control
mbuf is used to store the ancilliary data.
The sending side,
uipc_send, converts all file descriptors in the ancilliary
data (stored in the contorl
mbuf) to file object pointers (
and then simply appends all the
mbufs, including the control one, to the
receive buffer of the other end of the socket. Similar checks to Linux are done
The receiving side,
soreceive_generic, looks into the receive queue and first
tries to process the control
mbuf, converting the file objects back
to file descriptors (
unp_externalize). After that it processes all regular
data packets, the last one may be cut and left in the queue if there is not
The net result is similar to Linux. There is a slight difference though. On
recv can read data that are before the packet with ancilliary data that
it has returned. On FreeBSD, this is the other way around, you may get data that
folowed your ancilliary data packet. FreeBSD actaully has no way to tell that
the data were not part of the ancilliary data packet.
Note: The information was obtained from
sys/kern/uipc_usrreq.cc in FreeBSD
127a24835dd6. It was a quick glance and my explanations of the
mechanism might be wrong.
Note: OOB data and other details were left out.
So far, we have looked at two systems, and that is two more than we need for our generalization :). So my guess is you should be safe if you suppose your OS works the following way:
- The ancilliary data are logically associated with a range of data in the stream.
- If the
sendmsgsends at least one byte, it has also sent your ancilliary data and asociated it with the regular data it has sent (if it were not so, you couldn’t use the API reliably).
- You are guaranteed that the anciliary data will be read exactly by one
recvmsgcall, but it might be any
recvmsgthat returns any data from the range to which the data are attached (for Linux and FreeBSD this is the first call).
- A single
recvmsgcall will never return two sets of ancilliary data (from two
sendmsgcalls), this will be enforced by partial reads.
- Ordering of the ancilliary data will be preserved.
How to use it?
Well, depends how you are doing your networking.
If you have clearly defined
packets in your communication protocol and your
recvmsg calls do not span the packet boundaries, you can clearly associate the
received file descriptors with the packet. You have to be ready to receive a
file descriptor only if you expect it in that packet and you have not received
it so far.
If you are doing “buffered” recv, in the sense that you receive as much data in some buffer and work out the structure later, it should be enough to just always be prepared to receive a file descriptor and keep a queue of these descriptors parallel to the received data buffer.
On the sending side, make sure that the file descriptor ends up associated with the data range where you expect it. If you are not doing any buffering, this comes automatically.
If you have buffering, you might cheat by creating a queue of file descriptors
parallel to your send buffer and then eagerly associate the FDs with byte-sized
sendmsgs (but don’t queue more FDs than data!). A nicer option might be to
remember the data ranges along with the FD queue and than split the data to
sendmsg calls acoording to that. Just packaging a FD with each
might be bad idea, you might end up sending the FDs late.
Note: You can have multiple FDs in one ancilliary data set. Seems like trouble to me.