Passing file descriptors over Unix domain sockets
So you can send file descriptors over AF_UNIX
sockets, easy right? It’s done
by sendmsg
/recvmsg
calls, using the ancilliary data mechanism. Ok, the API
is a bit cryptic when it comes to the ancilliary message data buffer
(CMSG_ALIGN
what?), but fairly usable.
But consider a SOCK_STREAM
socket. When am I going to receive the ancilliary
data? Is it logically a separate stream? What happends on partial reads? What
happends on partial sends? Did i write my socket code correctly? POSIX is
silent, Linux manpages likewise. Let’s see what happends on Linux and FreeBSD.
If you want to skip the Linux/FreeBSD analysis and just learn how to use the mechanism, skip to the Generalization and How to use it? section.
Linux behavior
Even though we have a stream socket, Linux internally buffers the data as
individual packets (in struct skb
) and the “stream illusion” is maintained by
the recv
family of calls. Each skb
has an associated list of file objects
sent with the skb
. Precisely it works like this:
When you send data to the socket, unix_stream_sendmsg
gets called and
allocates (possibly many) skb
to hold your data (or at least a part of it). It
converts the file descriptors in your ancilliary data into file objects and
attaches the list of file objects to the cb
section of the first skb
. If
it fails to attach the file objects (for example there is a limit for in-flight
file descriptors), it does not send anything at all and returns ETOOMANYREFS
.
So you can be sure that if any data were sent, your file descriptors were sent
with them.
On the receiving side, unix_stream_recvmsg
calls a helper routine
unix_stream_read_generic
. This routine has its queue of skb
packets and
processes them sequentially to fulfill your request. It may consume only part of
the packet at hand, if there is not enought space left in the buffer, hence the
“stream illusion”. If it finds an skb
with ancilliary data attached, it
detaches the ancilliary data and hands you the file descriptors. It will not
continue processing any next skb
, but you will get data from the skb
with
the ancilliary data, or at least part of it if your buffer was not big enough.
So to recap, from the userspace side it looks like this:
- The ancilliary data seems to be attached to the first byte of your message, you will get it when reading that byte.
- The kernel will make sure that you will not read two different ancilliary
data in single
recv
call, by causing partial reads, so there is no confusion there. - You can’t be really sure where these partial reads happen, this depends on the segmentation on the sending side.
- For
SOCK_SEQPACKET
orSOCK_DATAGRAM
, your ancilliary data would be simply attached to the packet as a whole.
Note: The information was obtained by reading the 4.14 kernel, check
net/unix/
.
Note: File objects are the ’things’ that file descriptors point to: sockets, files, DMA-BUFs etc.
FreeBSD behavior
The FreeBSD behavior is similar to the Linux one. Instead of struc skb
,
struct mbuf
is used. The packet consists of an optional address (not used for
stream unix sockets), one or more control mbuf
followed by one or more data
mbuf
. The control mbuf
is used to store the ancilliary data.
The sending side, uipc_send
, converts all file descriptors in the ancilliary
data (stored in the contorl mbuf
) to file object pointers (unp_internalize
)
and then simply appends all the mbufs
, including the control one, to the
receive buffer of the other end of the socket. Similar checks to Linux are done
here.
The receiving side, soreceive_generic
, looks into the receive queue and first
tries to process the control mbuf
, converting the file objects back
to file descriptors (unp_externalize
). After that it processes all regular
data packets, the last one may be cut and left in the queue if there is not
enough data.
The net result is similar to Linux. There is a slight difference though. On
Linux, recv
can read data that are before the packet with ancilliary data that
it has returned. On FreeBSD, this is the other way around, you may get data that
folowed your ancilliary data packet. FreeBSD actaully has no way to tell that
the data were not part of the ancilliary data packet.
Note: The information was obtained from sys/kern/uipc_usrreq.cc
in FreeBSD
GIT 127a24835dd6
. It was a quick glance and my explanations of the mbuf
mechanism might be wrong.
Note: OOB data and other details were left out.
Generalization
So far, we have looked at two systems, and that is two more than we need for our generalization :). So my guess is you should be safe if you suppose your OS works the following way:
- The ancilliary data are logically associated with a range of data in the stream.
- If the
sendmsg
sends at least one byte, it has also sent your ancilliary data and asociated it with the regular data it has sent (if it were not so, you couldn’t use the API reliably). - You are guaranteed that the anciliary data will be read exactly by one
recvmsg
call, but it might be anyrecvmsg
that returns any data from the range to which the data are attached (for Linux and FreeBSD this is the first call). - A single
recvmsg
call will never return two sets of ancilliary data (from twosendmsg
calls), this will be enforced by partial reads. - Ordering of the ancilliary data will be preserved.
How to use it?
Well, depends how you are doing your networking.
If you have clearly defined packets
in your communication protocol and your
recvmsg
calls do not span the packet boundaries, you can clearly associate the
received file descriptors with the packet. You have to be ready to receive a
file descriptor only if you expect it in that packet and you have not received
it so far.
If you are doing “buffered” recv, in the sense that you receive as much data in some buffer and work out the structure later, it should be enough to just always be prepared to receive a file descriptor and keep a queue of these descriptors parallel to the received data buffer.
On the sending side, make sure that the file descriptor ends up associated with the data range where you expect it. If you are not doing any buffering, this comes automatically.
If you have buffering, you might cheat by creating a queue of file descriptors
parallel to your send buffer and then eagerly associate the FDs with byte-sized
sendmsgs
(but don’t queue more FDs than data!). A nicer option might be to
remember the data ranges along with the FD queue and than split the data to
sendmsg
calls acoording to that. Just packaging a FD with each sendmsg
might be bad idea, you might end up sending the FDs late.
Note: You can have multiple FDs in one ancilliary data set. Seems like trouble to me.