Transporting Binary Data in SOAP

August 28, 2002

You know the old saying, that a picture is worth a thousand words? There's an awful lot of binary data out there, and XML is not going to replace it all or even a significant percentage. After all, what's the benefit to xmlifying things like MPEG's or program executables?

Binary Data in XML

XML doesn't handle embedded binary data very well. Naive developers first try to embed the data directly into their document, reasoning that since Unicode uses all possible byte values, they'll be able to do this. They realize their mistake as soon as their embedded content has a byte with a special value like 0x3C (less than) or perhaps 0x26 (ampersand). The clever naïf might try to fix this by wrapping their content in a CDATA construct, but that only makes the problem less likely, rather than removing it. Suppose the content is a SAX library -- it's quite possible that the CDATA terminator string, "]]>", will show up.

Having lost their innocence to the cruel master of experience, the developer bites the bullet and encodes their data as Base64 and lets XML treat it as a string. The problem with this is two-fold. First, it's not really a string, it's something else. Second, Base64 is one-third larger.

Actually, the combination of those two factors will probably make the overhead penalty worse. If the developer is using a third-party XML or SOAP toolkit, it's most likely that the toolkit will return the embedded data as a string, which means the developer will then have to decode it themselves. This would result in a (temporary) overhead of 1 1/3 -- more than 100%. Unfortunately, while it can be prohibitively expensive (in terms of message size and memory use), Base64 strings have been the only approach that works and is portable.

XML in XML

XML is also not good at embedding XML documents inside each other. There are a number of reasons for this. First, there can be only one prolog, so you have to force everything to be in the same encoding. While almost everything is in UTF-8 right now, it's probably only a matter of time before applications start using encodings optimized for their locale. Second, the embedded document can't have a DTD, requiring the developer to do entity expansion, fill in defaults, and so on -- onerous, but not impossible.

What is impossible is knowing which attributes are XML ID's. This means that the the developer's outer document can't point into parts of the embedded document. More importantly, it's impossible to enforce the ID uniqueness constraint on the resulting compound document. For example, a medical document might define an "id" attribute to be a patient identifier, and not an XML ID at all. A SOAP-based web service that sent three medical documents about the same person would probably fool many generic parsers into marking the compound document as invalid.

In current practice developers usually cross their fingers and make the following simplifying assumptions:

Everything is in UTF-8, and nothing else from the prolog matters
There are no entities or defaults
Every XML ID attribute is named "id" (or perhaps "ID"), and they're not that common, so we'll ignore the potential for conflicts

While this method is an ugly hack, it usually works in the real world. And when it doesn't, our poor developer encodes their embedded XML as a Base64 string and pays the price described above. It's worse, of course, because now you have to rescan to create the embedded document and figure out how to properly associate the two documents to each other. Have you ever used a DOM implementation to "merge" two XML documents? It's neither pretty nor easy.

By this point, it should be clear that it's not good to try to embed arbitrary binary or XML content into another XML document. This is particularly bad news for SOAP and web services, since SOAP messages are XML documents with a thin layer -- a SOAP bubble, perhaps? -- around them.

SwA, DIME, BEEP

The right approach is to pull the embedded content out of the XML container, and replace it with a link. Fortunately, SOAP defines the href attribute that makes such linking fairly easy. For example, a stock service could easily refer to the latest SEC filing and set of indictments:

<SOAP-ENV:Envelope>

  <SOAP-ENV:Body>

    <tns:Ticker>WCOM</tns:Action>

    <tns:Price>0.32</tns:Amount>

    <tns:Filing href="http://edgar.sec.us.gov/10k.cgi?s=wcom"/>

    <tns:Indictments href="http://alcatraz.doj.us.gov/search/wcom"/>

  </SOAP-ENV:Body>

</SOAP-ENV:Envelope>

(Don't waste your time trying the href values; I just made them up.)

Usually it's necessary to bundle the data with the message. When this is done, we typically call the SOAP message the payload and the data that used to be embedded as attachments. There are three common formats for doing this. In no particular order, they are

SOAP Messages with Attachments (SwA), which uses multi-part MIME
DIME, a binary packaging format created by Microsoft
BEEP, a very powerful facility by protocol expert Marshall Rose

We'll look at each of these in turn, starting with SwA for the rest of this column, and DIME and BEEP in subsequent months. While "direct handling of binary data" was explicitly declared to be out of scope for the W3C SOAP working group, this should change once SOAP 1.2 enters the standardization track. Using one of the existing mechanisms seems the most reasonable way to move forward.

SOAP Messages with Attachments

SOAP Messages with Attachments is a W3C Note, just like SOAP 1.1. It was published in December of 2000, seven months after the SOAP Note. The name turns out to have been unfortunate, having usurped the obvious generic term.

SwA is very simple: the first part of the multipart MIME message is the XML SOAP document; the subsequent parts contain the attached data. The bulk of the document addresses URI resolution, particularly relative URI's. If we ignore them and always use absolute URI's (the current recommendation), the specification becomes even simpler. In the example below, we'll use email-like Message-ID's as our identifiers, as they have the convenient properties of being globally unique and absolute. We'll just attach a prefix to a single Message-ID to distinguish the parts.

The first bit is to properly declare the MIME content type; as is common with MIME multipart, the hardest part will probably be determining the message boundary:

Content-Type: multipart/related; type=text/xml;

    boundary="xXxXxXx";

    start="<start-AA11234455.22@www.datapower.com>"



Here is the movie you requested.

Thank you for patronizing the MPAA on-line store.



--xXxXxXx

Content-Type: text/xml; charset="UTF-8"

Content-ID: <start-AA11234455.22@www.datapower.com>



<SOAP-ENV:Envelope>

  <SOAP-ENV:Body>

    <tns:RunningTime>120</tns:Action>

    <tns:Rating>PG</tns:Amount>

    <tns:Movie href="cid:part1-AA11234455.22@www.datapower.com"/>

  </SOAP-ENV:Body>

</SOAP-ENV:Envelope>



--xXxXxXx

Content-Type: application/mpeg

Content-Transfer-Encoding: 8bit

Content-ID: <part1-AA11234455.22@www.datapower.com>



.....



--xXxXxXx--

There are a couple of things to notice. First, if you follow the techniques I used here with stylized use of Message-IDs and Content-ID headers, it should be drop-dead easy to generate and parse SwA messages with the help of a MIME toolkit. Second, note that HTTP forms can be sent using MIME multipart, and if a "file upload" is involved, then they have to be. This means that all web servers probably already have the necessary MIME machinery built in. Any client with a modern mailreader (one capable of sending attachments, if not doing the whole GUI thing), should be in the same situation.

So this is what's good about SwA: it's simple, and if the code isn't already on the platform, it's not onerous to get it. In spite of this, it doesn't seem to have taken off. There are a couple of technical reasons for this. The first is a minor one: MIME can be heavyweight, and might not be appropriate for small or embedded devices. While this is true for full-fledged MIME toolkits, a custom library for SwA-style MIME use need not be big. (For a sense of historical perspective, the same complaints used to be raised about ASN.1/DER libraries -- a binary format used by PKI -- and there seem to be no problems getting the necessary bits of those onto devices like smartcards.)

More from Rich Salz

The second drawback to SwA is that it can't handle data streaming. While the ability to send the data in chunks wasn't part of our original problem statement, once you start using it in the real world for things like multi-media data, it's clear you don't want to require the sender or receiver to have to buffer the entire attachment before processing it.

In order to address the streaming and the implementation footprint issues, Microsoft developed the DIME protocol, which is progressing through the IETF. MS clearly sees DIME as more useful than SwA; although MS was one of the original SwA authors, it's only supported in one MS toolkit, while DIME is part of MS's global XML Architecture.

In next month's column, we'll examine DIME.