Protocol Design: How Many Bytes?
November 25, 2003
The Internet is built on protocols. Protocols take the raw, unstructured capabilities of the network and, using rules and restrictions, determines what and how programs can communicate. Choosing the right rules is important: they determine to a large degree the security, ease of implementation and performance of the protocol. This is the first in a series of articles discussing basic concepts of protocol design. The issue we will start with is how a protocol knows how much data it is going to receive. Protocols are after all mostly about sending and receiving data.
Before we begin, it's worth noting some basic assumptions. Unless noted otherwise, the protocols being discussed all run over a connection-oriented transport, typically TCP. There is an initiating side that starts the connection and a receiving side that accepts it. In many cases these will match the concepts of "client" and "server", and will have different behavior depending on which they are. The connection is assumed to transport a stream of bytes in an ordered, reliable fashion.
Many protocols involve sending chunks of "payload" bytes -- data which is not part of the protocol itself. An email is a structured sequence of bytes, so when an email is sent or received, the receiver side of the protocol needs to know when the email data ends and the protocol begins again. An email that contains a transcript of a POP3 session should not be able to confuse a POP3 client that is downloading it. In addition, commands and messages of the protocol itself are also structured, and the receiving side needs to know when they end and the next message begins.
The first approach that can be used is an end-of-data indicator: some special way of marking when the transfer of the data is over. For example, when sending a payload, the sending side will send a message meaning the data will now be sent, then the actual payload, and finally a message saying there is no more data. One of the Internet's oldest protocols, SMTP, uses this technique to allow clients to send emails to the server. SMTP is documented in RFC 2821, an updated version of RFC 821, which was written in 1982. In the SMTP protocol, a client connects to a server, sends a series of commands indicating from whom and to whom the email is being sent, the body of the email, and then the server deals with delivery of the message.
SMTP follows (or perhaps, given its age, leads) the convention of "line-based" protocols. An SMTP session is composed of a series of lines: a "line" is a sequence of bytes terminated with CRLF, the bytes with the hex values 0x0D and 0x0A. A line can be a command, a response to a command, or part of a message. Each of these lines recreates in its own small way the end-of-data indicator method for finding the end of a message, in this case CRLF. The basic units of the protocol, the lines, can be any length; the receiving side only knows when they are over. As a result, all SMTP servers set an arbitrary length on the length of lines they accept, otherwise a simple connection sending an infinite stream of non-CRLF characters would use up the server's memory. Here is an example of a simple SMTP session between a client and server, taken from the RFC (note that each printed line would be sent with a CRLF after it):
S: 220 foo.com Simple Mail Transfer Service Ready C: EHLO bar.com S: 250-foo.com greets bar.com S: 250-8BITMIME S: 250-SIZE S: 250-DSN S: 250 HELP C: MAIL FROM:<Smith@bar.com> S: 250 OK C: RCPT TO:<Jones@foo.com> S: 250 OK C: DATA S: 354 Start mail input; end with <CRLF>.<CRLF> C: Blah blah blah... C: ...etc. etc. etc. C: . S: 250 OK C: QUIT S: 221 foo.com Service closing transmission channel
Looking at the example carefully, we'll note two more examples of the end-of-data
indicator. There are multiple responses to the EHLO command, with response code 250,
and the
last response starts with "250
", rather than "250-
", to
indicate that no more responses are forthcoming. A more interesting use is the
"DATA
" command, which is used by the client to send the body of the email.
The email is sent as a series of lines, and a line with a single ".
" (a period)
indicates the end of the email body.
On the face of it this is a reasonable approach, but there are some serious issues
which
have led modern protocols to choose other solutions. Consider what would happen if
the email
contained a line consisting solely of a ".
" character -- the server would get
confused and think the email had ended, even though the period was actually part of
the
email, not an SMTP command. In order to prevent this, the SMTP protocol specifies
that when
sending the contents of a "DATA
" command, any line beginning with a period must
have a period inserted in the beginning. The receiver checks each incoming line, and
if it
has a period followed by other characters, the period is removed, otherwise this is
the end
of data.
While this does work, it is inelegant and inefficient. A cleaner solution would be
to use
length prefixing. Instead of sending "DATA
", the client
implementation of an imaginary improved SMTP protocol would also send the length of
the
message, for example "DATA 1235
" for a message that is 1235 bytes long. The
server would then read exactly 1235 bytes, and then revert back to line-based mode.
No
quoting would be necessary for the client, no unquoting for the server. In practice,
SMTP
has an extension for sending the size of the message, but it is mostly used to allow
the
server to deny overlarge messages, and the server still must use the period indicator
method
to detect the end of the message.
HTTP, the protocol used for what is commonly referred to as "the Web", uses length prefixing to indicate the length of a document it is returning in response to a client request (the headers are still sent using CRLF terminated lines). Here is a sample HTTP server response. Notice that the body is separated from the headers by an extra CRLF, and that the body can be any 12 bytes; there is no need for quoting nor any restrictions on their values.
HTTP/1.1 200 OK Content-Type: text/plain Content-Length: 12 0123456789ab
While quite a nice idea, length prefixing has a problem of its own: it assumes the
length
of the data is known in advance. This is certainly a valid assumption when sending
the
contents of a file, but when generating dynamic content the length of the data is
not known
until all the data is available. In theory it is possible to wait until all the data
has
been generated, and then send it along with its length. In practice this is inefficient,
as
it slows down the data transfer and requires extra temporary storage, either in memory
or on
disk. One solution, used in HTTP
1.0, is to allow omitting the "Content-Length
" header, and indicating the
end of the data by closing the connection. This solution is also problematic: it makes
it
hard to distinguish a failure in the transport (such as a broken TCP connection) from
the
end of the data, and it is also inefficient since multiple HTTP requests to the same
server
require opening multiple TCP connections.
The updated HTTP 1.1
presented a solution that did not have these problems, a combination of length prefixing
and
an end of data indicator. When data is generated on the fly, it is assumed to be generated
as a series of "chunks", each chunk being at least 1 byte long. An HTTP response can
indicate that is returning a chunked response, in which case it returns the data as
a series
of length-prefixed chunks. The end of the data is indicated by sending a chunk whose
length
is 0. A chunk's length is encoded in hexadecimal numerals, and prefixed with CRLF,
after
which the chunk is sent. Here is an example HTTP response using chunked encoding (new
lines
indicate a CRLF). The "a
" means the next chunk is 10 bytes long, the
"3
" means the next chunk is 3 bytes long, and the "0
" indicates
the end of the response.
HTTP/1.1 200 OK Content-type: text/plain Transfer-encoding: chunked a 0123456789 3 abc 0
End of data indicators versus length prefixing are just one of the issues protocol designers must deal with, but one which influences many other aspects. In future articles we will discuss syntax and structure, state and statelessness, handling multiple requests and more.