Protocol Design: How Many Bytes?

November 25, 2003

The Internet is built on protocols. Protocols take the raw, unstructured capabilities of the network and, using rules and restrictions, determines what and how programs can communicate. Choosing the right rules is important: they determine to a large degree the security, ease of implementation and performance of the protocol. This is the first in a series of articles discussing basic concepts of protocol design. The issue we will start with is how a protocol knows how much data it is going to receive. Protocols are after all mostly about sending and receiving data.

Before we begin, it's worth noting some basic assumptions. Unless noted otherwise, the protocols being discussed all run over a connection-oriented transport, typically TCP. There is an initiating side that starts the connection and a receiving side that accepts it. In many cases these will match the concepts of "client" and "server", and will have different behavior depending on which they are. The connection is assumed to transport a stream of bytes in an ordered, reliable fashion.

Many protocols involve sending chunks of "payload" bytes -- data which is not part of the protocol itself. An email is a structured sequence of bytes, so when an email is sent or received, the receiver side of the protocol needs to know when the email data ends and the protocol begins again. An email that contains a transcript of a POP3 session should not be able to confuse a POP3 client that is downloading it. In addition, commands and messages of the protocol itself are also structured, and the receiving side needs to know when they end and the next message begins.

The first approach that can be used is an end-of-data indicator: some special way of marking when the transfer of the data is over. For example, when sending a payload, the sending side will send a message meaning the data will now be sent, then the actual payload, and finally a message saying there is no more data. One of the Internet's oldest protocols, SMTP, uses this technique to allow clients to send emails to the server. SMTP is documented in RFC 2821, an updated version of RFC 821, which was written in 1982. In the SMTP protocol, a client connects to a server, sends a series of commands indicating from whom and to whom the email is being sent, the body of the email, and then the server deals with delivery of the message.

SMTP follows (or perhaps, given its age, leads) the convention of "line-based" protocols. An SMTP session is composed of a series of lines: a "line" is a sequence of bytes terminated with CRLF, the bytes with the hex values 0x0D and 0x0A. A line can be a command, a response to a command, or part of a message. Each of these lines recreates in its own small way the end-of-data indicator method for finding the end of a message, in this case CRLF. The basic units of the protocol, the lines, can be any length; the receiving side only knows when they are over. As a result, all SMTP servers set an arbitrary length on the length of lines they accept, otherwise a simple connection sending an infinite stream of non-CRLF characters would use up the server's memory. Here is an example of a simple SMTP session between a client and server, taken from the RFC (note that each printed line would be sent with a CRLF after it):

S: 220 foo.com Simple Mail Transfer Service Ready

C: EHLO bar.com

S: 250-foo.com greets bar.com

S: 250-8BITMIME

S: 250-SIZE

S: 250-DSN

S: 250 HELP

C: MAIL FROM:<Smith@bar.com>

S: 250 OK

C: RCPT TO:<Jones@foo.com>

S: 250 OK

C: DATA

S: 354 Start mail input; end with <CRLF>.<CRLF>

C: Blah blah blah...

C: ...etc. etc. etc.

C: .

S: 250 OK

C: QUIT

S: 221 foo.com Service closing transmission channel

Looking at the example carefully, we'll note two more examples of the end-of-data indicator. There are multiple responses to the EHLO command, with response code 250, and the last response starts with "250 ", rather than "250-", to indicate that no more responses are forthcoming. A more interesting use is the "DATA" command, which is used by the client to send the body of the email. The email is sent as a series of lines, and a line with a single "." (a period) indicates the end of the email body.

On the face of it this is a reasonable approach, but there are some serious issues which have led modern protocols to choose other solutions. Consider what would happen if the email contained a line consisting solely of a "." character -- the server would get confused and think the email had ended, even though the period was actually part of the email, not an SMTP command. In order to prevent this, the SMTP protocol specifies that when sending the contents of a "DATA" command, any line beginning with a period must have a period inserted in the beginning. The receiver checks each incoming line, and if it has a period followed by other characters, the period is removed, otherwise this is the end of data.

While this does work, it is inelegant and inefficient. A cleaner solution would be to use length prefixing. Instead of sending "DATA", the client implementation of an imaginary improved SMTP protocol would also send the length of the message, for example "DATA 1235" for a message that is 1235 bytes long. The server would then read exactly 1235 bytes, and then revert back to line-based mode. No quoting would be necessary for the client, no unquoting for the server. In practice, SMTP has an extension for sending the size of the message, but it is mostly used to allow the server to deny overlarge messages, and the server still must use the period indicator method to detect the end of the message.

HTTP, the protocol used for what is commonly referred to as "the Web", uses length prefixing to indicate the length of a document it is returning in response to a client request (the headers are still sent using CRLF terminated lines). Here is a sample HTTP server response. Notice that the body is separated from the headers by an extra CRLF, and that the body can be any 12 bytes; there is no need for quoting nor any restrictions on their values.

HTTP/1.1 200 OK

Content-Type: text/plain

Content-Length: 12

 

0123456789ab

While quite a nice idea, length prefixing has a problem of its own: it assumes the length of the data is known in advance. This is certainly a valid assumption when sending the contents of a file, but when generating dynamic content the length of the data is not known until all the data is available. In theory it is possible to wait until all the data has been generated, and then send it along with its length. In practice this is inefficient, as it slows down the data transfer and requires extra temporary storage, either in memory or on disk. One solution, used in HTTP 1.0, is to allow omitting the "Content-Length" header, and indicating the end of the data by closing the connection. This solution is also problematic: it makes it hard to distinguish a failure in the transport (such as a broken TCP connection) from the end of the data, and it is also inefficient since multiple HTTP requests to the same server require opening multiple TCP connections.

The updated HTTP 1.1 presented a solution that did not have these problems, a combination of length prefixing and an end of data indicator. When data is generated on the fly, it is assumed to be generated as a series of "chunks", each chunk being at least 1 byte long. An HTTP response can indicate that is returning a chunked response, in which case it returns the data as a series of length-prefixed chunks. The end of the data is indicated by sending a chunk whose length is 0. A chunk's length is encoded in hexadecimal numerals, and prefixed with CRLF, after which the chunk is sent. Here is an example HTTP response using chunked encoding (new lines indicate a CRLF). The "a" means the next chunk is 10 bytes long, the "3" means the next chunk is 3 bytes long, and the "0" indicates the end of the response.

HTTP/1.1 200 OK

Content-type: text/plain

Transfer-encoding: chunked

 

a

0123456789

3

abc

0

End of data indicators versus length prefixing are just one of the issues protocol designers must deal with, but one which influences many other aspects. In future articles we will discuss syntax and structure, state and statelessness, handling multiple requests and more.