XML, SOAP and Binary Data

February 26, 2003

Editor's note: XML.com is happy to publish this white paper from Don Box and his colleagues, which addresses a long term issue in XML, namely the coordinated transport of opaque binary data in conjunction with an XML document. Please use the forum facility at the end of this article to leave your comments and questions -- ED.

Version 1.0
February 24, 2003

Authors

Adam Bosworth, BEA Systems
Don Box, Microsoft
Martin Gudgin, Microsoft
Mark Nottingham, BEA Systems
David Orchard, BEA Systems
Jeffrey Schlimmer, Microsoft

Copyright Notice

Abstract

This white paper discusses the architectural issues encountered when using opaque non-XML data in XML applications, including (but not limited to) Web services and SOAP.

Status

This white paper is provided as-is and for review and evaluation only. Microsoft and BEA hope to solicit your contributions and suggestions in the near future. BEA and Microsoft make no warrantees or representations regarding this document in any manner whatsoever.

1. Introduction
2. Current Approaches to Opaque Data in XML
2.1 Embedding
2.2 Referencing
3. When Worlds Collide
4. Flexibility in Representation, Consistency in Model
5. Conclusions
6. Acknowledgements
7. References

1. Introduction

The desire to integrate XML with pre-existing data formats has been a long-standing and persistent issue for the XML community. Users often want to leverage the structured, extensible markup conventions of XML without abandoning existing data formats that do not readily adhere to XML 1.0 syntax. Often, users want to leave their existing non-XML formats as is, to be treated as opaque sequences of octets by XML tools and infrastructure. Such an approach allows widely used formats such as JPEG and WAV to peacefully coexist with XML.

As XML is increasingly used as a message format (e.g., SOAP), the interest in integrating opaque data with XML has increased to the point where there are at least two competing proposals for doing so (SOAP With Attachments (SwA) and WS-Attachments). Because SwA was the first widely-publicized mechanism for dealing with binary data, it has had a large influence on how the community views the issues surrounding this topic.

Unfortunately, SwA (as well as WS-Attachments) conflates several orthogonal issues. Specifically, both SwA and WS-Attachments assume that a URI-based referencing mechanism by itself is sufficient for supporting opaque binary values in messages. Moreover, at least one of the proposals (SwA) attempts to solve problems that are in no way limited to SOAP, that is, how URI that appear as XML element or attributes content are to be resolved in the presence of multipart MIME.

As field experience with both SwA and WS-Attachments has shown, the lack of an XML-focused approach to opaque data has lead to solutions that are unnecessarily complex for developers and software components. This white paper attempts to present the various issues raised by dealing with opaque data in XML, without nominating a particular solution.

2. Current Approaches to Opaque Data in XML

2.1 Embedding

Traditionally, two techniques for dealing with opaque data in XML have been used; "by value" or "by reference." The former is achieved by embedding opaque data as element or attribute content. XML supports opaque data as content through the use of either base64 or hexadecimal text encoding. This approach is codified by XML Schema's two binary data types, xs:base64Binary and xs:hexBinary. The lexical representation of the xs:hexBinary is a simple hexadecimal character sequence; the lexical representation of xs:base64Binary uses the base64 algorithm as defined by RFC 2045 [rfc2045]. The underlying value space of both types is identical: an ordered sequence of octets.

The following XML instance demonstrates the use of base64 in simple XML document.


<m:data xmlns:m='http://example.org/people' >

  <photo>/aWKKapGGyQ=</photo>

  <sound>sdcfo2JTiXE=</sound>

  <hash>Faa7vROi2VQ=</hash>

</m:data>

In this example, the photo, sound, and hash elements each contain a base64 string (i.e., a sequence of characters) that represents the following octet sequences:


fd a5 8a 29 aa 46 1b 24 (photo)

b1 d7 1f a3 62 53 89 71 (sound)

15 a6 bb bd 13 a2 d9 54 (hash)

The fact that the children of the photo, sound, and hash elements are encoded as base64 is implicit (although discoverable through an XML Schema or RELAX NG schema), but can be made explicit using xsi:type or an application-specific annotation.

It is well-known that base64 encoded data expands by a factor of 1.33x original size, and that hexadecimal encoded data expands by a factor of 2x (assuming an underlying UTF-8 text encoding in both cases; if the underlying text encoding is UTF-16, these numbers double). Also of concern is the overhead in processing costs (both real and perceived) for these formats, especially when decoding back into raw binary. When comparing base64 decoding to a straight-through copy of opaque data, the throughput of at least one popular programming system decreased by a factor of 3 or more.

These performance concerns have discouraged many developers from using embedded data in XML. It is interesting to note, however, that XML Schema defines the value space of the base64Binary and hexBinary data types as the actual octets. This makes it is possible to reduce or eliminate the size and performance costs of base64/hex decoding in many common scenarios (e.g., in-memory DOM trees, SAX pipelines, etc). However, this is not the case when the XML is serialized as UTF-8 or equivalent due of the nature of XML 1.0.

2.2 Referencing

XML 1.0 explicitly supports referencing external opaque data as external unparsed general entities. Considered a fairly esoteric feature of XML, unparsed entities are not widely used. The primary obstacle to using unparsed entities is their heavy reliance on DTDs, which impedes modularity as well as use of XML namespaces. They are also not available to SOAP, which explicitly prohibits document type declarations in messages.

A more common way to reference external opaque data is to simply use a URI as an element or attribute value. XML Schema supports this explicitly through the xs:anyURI type.

<?xml version="1.0" ?>

<data>

  <photo data="http://example.org/me.jpg" />

  <sound data="http://example.org/it.wav" />

  <hash data="http://example.org/my.hsh" />

</data>

An XML schema can describe the content of the data attribute:

<xs:attribute name="data" type="xs:anyURI" use="required" />

as can RELAX NG:

<rng:attribute name="data" 

     datatypeLibrary="http://www.w3.org/2001/XMLSchema-datatypes">

  <rng:data type="anyURI" />

</rng:attribute>

Often (and especially in Web services), referenced opaque data is bundled alongside the XML, using a packaging format. Soap with Attachments, for example, was inspired by M/HTML [rfc2557], which describes a technique to bundle cached resource representations with HTML documents using multipart MIME. In fact, very little of SwA Section 3 (the meat of the spec) is dependent on the use of SOAP, except in name only. In brief, SwA says the following:

SwA senders MUST put a SOAP envelope as the root MIME part of a multipart/related content type, independent of underlying transport.
SwA senders MAY send additional MIME parts with the root, each of which are identified by a URI.
SwA receivers/processors SHOULD consult the additional MIME parts when fetching the representations behind URI that are contained in the message.

The following example shows a SOAP message that uses SwA:

MIME-Version: 1.0

Content-Type: Multipart/Related; boundary=MIME_boundary;

  type=text/xml; start="<mymessage.xml@example.org>"

Content-Description: A SOAP Envelope with my picture in it



--MIME_boundary

Content-Type: text/xml; charset=UTF-8

Content-Transfer-Encoding: 8bit

Content-ID: <mymessage.xml@example.org>



<s:Envelope xmlns:s='http://www.w3.org/2002/12/soap-envelope' >

  <s:Body>

    <m:data xmlns:m='http://example.org/stuff' >

      <photo data="http://example.org/me.jpg" />

      <sound data="http://example.org/it.wav" />

      <hash data="http://example.org/my.hsh" />

    </m:data>

  </s:Body>

</s:Envelope>



--MIME_boundary

Content-Type: image/jpeg

Content-Transfer-Encoding: binary

Content-Location: 'http://example.org/me.jpg



fd a5 8a 29 aa 46 1b 24



--MIME_boundary

Content-Type: sound/wav

Content-Transfer-Encoding: binary

Content-Location: 'http://example.org/it.wav



b1 d7 1f a3 62 53 89 71



--MIME_boundary

Content-Type: binary/hash

Content-Transfer-Encoding: binary

Content-Location: 'http://example.org/my.hsh



15 a6 bb bd 13 a2 d9 54



--MIME_boundary

Referencing opaque data avoids some of the performance and bloat issues associated with base64/hex encoding, but introduces its own problem; because the data is external to the document, it isn't part of the message Infoset.

3. When Worlds Collide

The approach taken by both SwA and WS-Attachments leads to a situation in which there are two data models associated with a message; one that is based on XML and one that is not. This means that layered technologies for processing and describing XML and SOAP need to provide one set of solutions for the XML component of their data and another set of solutions for the external components (e.g., DIME, multipart MIME). Nowhere is this duplication of effort more apparent than in the area of security.

The industry is rapidly adopting XML-based security mechanisms such as XML Digital Signature, XML Encryption, and WS-Security. These technologies were designed for use with the XML data model (and in the case of WS-Security, the SOAP data model). When a second data model is present (e.g., multipart MIME, DIME), additional (and yet to be specified) measures must be taken to ensure the integrity and confidentiality of the non-XML data. For example, a digital signature over a SOAP envelope does not necessarily protect any data referenced by embedded URI. While it is possible to protect this data through additional hashes over the referenced octets, aspects such as MIME or DIME headers are likely not covered by such additional effort. In all likelihood, securing these headers would mean resorting to transport-level security (e.g., SSL) and/or S/MIME, neither of which is robust in the face of XML that is shared by multiple parties or by means other than a simple end-to-end network connection.

The disadvantages of having two data models are especially problematic for SOAP itself. The SOAP specifications were designed around the notion that a SOAP message is simply an XML-based SOAP:Envelope. In SOAP/1.1, the definition of a SOAP message is fairly simple:

The SOAP Envelope element is the top element of the XML document representing the SOAP message.

Throughout the SOAP/1.1 specification, SOAP messages are routinely referred to as the SOAP Envelope. This excerpt from Section 1.3 is one example of this:

[The] following is the response message containing the HTTP message with the SOAP message as the payload:

Following the precedent set forth by SOAP/1.1, the July 2001 SOAP/1.2 Working Draft [soap12wd] (the first public WD from the XMLP WG) states the following:

A SOAP message is an XML document that consists of a mandatory SOAP envelope, an optional SOAP Header, and a mandatory SOAP Body. This XML document is referred to as a SOAP message for the rest of this specification.

This definition has been consistent for the two-year history of the XMLP WG's drafts of the SOAP 1.2 specification. Here is the prose from the December 2002 Candidate Recommendation of SOAP/1.2 [soap12cr]:

A SOAP message is specified as an XML Infoset that consists of a document information item with exactly one member in its [children] property, which MUST be the SOAP Envelope element information item (see 5.1 SOAP Envelope).

SOAP/1.2 has achieved CR status, and therefore represents a considerable amount of consensus-building and engineering oversight, all of which was conducted in an open, public forum. This makes the prospect of redesigning SOAP to accommodate a new data model is daunting, given the amount of issues that would need to be revisited. Moreover, by abandoning an XML-based data model, SOAP message processors would lose the ability to take advantage of the large and growing infrastructure for describing XML (e.g., XML Schema, Relax NG) and for processing XML (e.g., SAX, DOM, XPath, XSLT, XML Query).

The impact of two data models on SOAP is complicated by the presence of SOAP intermediaries. In both SwA and WS-Attachments, referenced data needs to be processed by SOAP nodes (including intermediaries); currently, such a processing model is undefined. Can referenced data be ignored? Must the order of its appearance be retained at all layers of the stack? How is data targeted at a Node other than the ultimate recipient? Must it be forwarded by SOAP intermediaries? Can SOAP intermediaries add or remove non-Envelope data before relaying a message? Unfortunately, the current SOAP processing rules work at the level of the SOAP envelope and do not provide any guidance on these issues. As a result, every future SOAP extension in an attachments world will need to be aware of multiple data and processing models in order to ensure that they do not violate the differing requirements of each.

The problem of having two data models can also be observed by looking at the dilemma surrounding WSDL/1.1. In addition to providing XML Schema-based facilities for describing SOAP headers and payloads, WSDL/1.1 also attempts to describe multipart MIME messages (see Section 5 of WSDL/1.1). However, this feature received very little attention during the development of WSDL and frankly it shows. Very few members of the Web services community find this feature to be well thought-out or, as a result, interoperable. Furthermore, the integration of the SOAP binding with the MIME binding is underspecified, leaving organizations such as SOAPbuilders, WS-I, and others to each devise their own approaches for describing messages that are "more than SOAP."

4. Flexibility in Representation, Consistency in Model

In contrast, retaining the current XML-based data and processing models for messages considerably reduces complexity. Assuming that all message data belongs to a single Infoset based on SOAP:Envelope requires no changes in the Web services architecture; the current schema/WSDL infrastructure is sufficiently expressive and can retain its simplicity, and SOAP's extensibility mechanism, the SOAP header, is adequate to describe the entire data set. Perhaps most importantly, the work done to establish baseline interoperability at the SOAP envelope layer does not need to be repeated for a new message processing and data model.

Moreover, an Infoset-based approach scales well to a world in which not all technologies can handle new encodings or framing techniques such as SwA, DIME or WS-Attachments. By retaining the pure XML Infoset model currently in wide deployment, any message can be rendered in simple XML 1.0 using UTF-8. This means that technologies such as XML Digital Signature and XML Encryption can be used without modification. More importantly, even in the face of potential new representations of the XML Infoset, all SOAP messages can be represented as UTF-8 text, allowing any SOAP message to survive a purely text-based intermediary (e.g., low-functioning mail system, NOTEPAD.EXE, EMACS).

This suggests that the correct approach to handling opaque binary data in XML is that of encoding; that is, representing such data within the Infoset using the xs:base64Binary or xs:hexBinary types. However, as discussed, there are potential performance issues surrounding this technique.

How, then, can one keep the Infoset model consistent without encountering these drawbacks? We believe that the answer lies in standardized transformations of the Infoset, or in representations of it. In this fashion, the message can be considered as an Infoset, yet avoid the penalties of actually encoding and decoding the data with base64 or hexadecimal text.

To illustrate this, consider the XInclude [XInclude] mechanism, which allows one to create a synthetic Infoset by merging two Infosets, or by merging plain text content into an Infoset. This latter use provides for an interesting possibility; what if XInclude were to allow inclusion of binary data, to be transformed to base64 or hexBinary-encoded data in the resulting synthetic Infoset? This would allow binary data to be integrated into the Infoset whilst still being serialized as raw octets.

Yet another approach would be to define an alternate serialization of the Infoset (much as WBXML [WBXML] has done) that serializes xs:base64Binary and xs:hexBinary typed data as the actual octets, rather than text-encoded octets. One can easily imagine on-disk/on-wire representations that allow opaque data to coexist as raw octets with the encoded characters of the surrounding XML structured data.

These are only two examples of how one could preserve the Infoset and avoid the performance issues associated with base64/hex text encoding. They are not fully specified here, but serve to illustrate that viable alternatives to current approaches exist and should be considered before radically changing SOAP and Web services.

5. Conclusions

Retaining SOAP's tradition of purely Infoset-based messages has various advantages:

SOAP extensions only have to be defined in terms of one data model and one processing model, both of which are already defined in SOAP/1.2.
Applications can directly take advantage of the rich set of technologies available for XML processing.
Interface description can provide a single, simple, and consistent model to the developers and tools.
Programmatic interfaces can expose a single programming model to the developer.
A single security model can be applied to the Infoset, encompassing all message content in a uniform manner.
Infoset mappings can be defined for multiple serialization formats, effectively unifying multiple messaging technologies.
Finally (and perhaps most importantly), ALL SOAP messages can be represented in pure text. Even in the face of exotic in-memory or on-disk data structures for representing XML, one can always produce a purely text-based form of the message. The utility of the installed base of text processing tools should not be overlooked. Text is our heritage and abandoning it loses a key feature of the web service architecture.

The authors of this paper believe strongly that that data and processing model for SOAP has always been and should remain purely XML-based. Literally thousands of man-years have been directed at defining and refining an architecture based on these assumptions. Moreover, the data and processing model for SOAP should deviate as little as possible from the current SOAP/1.2 Candidate Recommendation.

The authors also believe that the XMLP WG charter allows sufficient freedom in associating opaque data with SOAP Messages to define a SOAP specific processing model, including the XInclude approach, as well as participating in or leading other approaches.

6. Acknowledgements

This white paper is the result of ongoing conversations with:

Erik Christensen, Chris Fry, Yaron Goland, Chris Kaler, Andrew Layman, Hal Lockhart, and John Shewchuk.

7. References

[soap11]: "SOAP: Simple Object Access Protocol 1.1," W3C Note, May 2000.
[soap12wd]: "SOAP Version 1.2," W3C Working Draft, July 2001.
[soap12cr]: "SOAP Version 1.2 Part 1: Messaging Framework," W3C Candidate Recommendation, December 2002
[XInclude]: "XML Inclusions (XInclude) Version 1.0," W3C Candidate Recommendation, September 2002
[XML]: "Extensible Markup Language (XML) 1.0 (Second Edition)," W3C Recommendation, October 2000
[WBXML]: "WAP Binary XML Content Format," W3C Note, June 1999
[rfc2045]: "Base64 Content-Transfer-Encoding," RFC 2045, Section 6.8, IETF Draft Standard, November 1996
[rfc2557]: "MIME Encapsulation of Aggregate Documents, such as HTML (MHTML)," RFC 2557, IETF Proposed Standard, March 1999