XML: From Bytes to Characters

October 2, 1997

XML

From Bytes to Characters

Bert Bos

Abstract

XML is a syntax for storing hierarchically organized data such as directories, catalogues, user manuals, etc. It can store only textual data, but that is not a severe restriction. This article defines, in some detail, how text is stored in an XML file. It also describes how an XML file is encoded for transportation over the Internet, and upon arrival, decoded again. Under the Internet model for transport of text files, the encoding/decoding may result in a "different" file (i.e., a different sequence of bytes), but retains exactly the same text and structure.

Introduction

An XML file is a simple text file--just like HTML, but unlike a WordPerfect or DBase file, for example. The advantage is that you can usually make sense of an XML file even without XML software, just by viewing it as a text file in WordPad or vi. You can spell-check or grep it. The structure of an XML file is so simple that you can write one "by hand"--that is, with any editor or word processor that can write text files. Of course, with specialized XML software, it will probably be easier.

Choosing a text format for XML has a slight cost in efficiency, since every time an XML file is used by a program, it has to be parsed: i.e., the linear sequence of characters has to be converted into a complex structure in memory. And since the XML format has been kept human-readable and human-writable to some extent, there are many different ways of writing the same XML document--and the program has to understand them all.

There is also another disadvantage of text format: although most computers have a way of dealing with text files, the files aren't always portable. Here's a common problem: somebody creates text with quote marks or accented letters on one computer and gives it to a friend with a different computer. Suddenly the quotes and accents show up as completely different symbols. The problem can get much worse if you exchange files in different languages: have you ever wondered how the Russians manage to get Cyrillic letters on their screen?

In the following sections I'll show how this problem is handled on the Web and how XML itself has provisions for making it portable. But before that, I'll present a definition of an XML document and its representation as a text file that is somewhat different--more precise--than the way XML is usually described.

XML as a Data Type

You may have noticed that I use the terms XML file and XML document. XML file refers to the text file, which is a sequence of characters (letters, digits, and other symbols); and XML document refers to the abstract structure, which may be some data structure in memory. An XML document may be the result of parsing an XML file, or it may be constructed directly in memory. It is possible that several different XML files all represent the same document.

On the Web, when you retrieve a document from a server, you get a stream of bytes, which your browser then interprets. In many cases (but not always) the stream of bytes existed as a file on the server's disk. Regardless of whether a file ever existed on the server's disk, I will refer to the stream of bytes--or more precisely the stream of characters represented by those bytes--as the XML file.

The way XML is usually explained starts with a description of XML file syntax, followed by rules for parsing them. I'll start at the other end, by defining XML documents as an (abstract) data structure, and then showing how this data structure is converted to a sequence of characters. In other words, rather than explaining how to parse an XML file, I will describe how to generate one. We call the latter process marshaling.[1]

Marshaling (a.k.a. "serializing," "pickling" or "flattening") is a term used in the theory of interprocess communication. When two programs need to exchange some data structure, they have to agree on a way to represent the data as a sequence of bytes. That is exactly what is happening when two programs on different machines exchange an XML document. For XML, however, there is also an intermediate representation as a sequence of characters, before the characters are further encoded as bytes.

An XML document is a tree structure. There is a root node, called the root element, and the other nodes are all descendants of the root. All nodes are of one of four types:

Character
Processing instruction (PI)
Comment
Element

A character node contains one character and has no other structure. All characters from the Unicode/ISO-10646 repertoire are allowed. A processing instruction node has two fields: a name (also called target) and a content, which is a sequence of characters. A comment node has only one field: content, which contains a sequence of characters.

Characters, PIs, and comments have no children, so they are always leaf nodes in an XML document tree. Their parent node is always an element node. In fact, XML states that there must be at least one element in each document, viz., the root node.

Element nodes are much more complex. Not only do they have children, but they also have a name (sometimes called a "generic identifier" or GI), and a set of attributes. Attributes are keyword-value pairs. Each keyword may occur at most once in an attribute set. The value is again a sequence of characters.

Elements can have zero or more children, and the children are ordered--for example, you can talk about the "seventh child node." This is in contrast to the element's attributes, which are not ordered, but form a set. An element with zero children is called an "empty element."[2]

An element node may also have a schema (also called "doctype" or "namespace"). In XML 1.0, only the root element can have a schema (and it implicitly applies to all elements in the document), but in XML 1.1 all elements can have one.

An XML-based application can store data in all the different types of nodes and in all the fields of each node. XML places no restrictions on what each node or field may be used for, with the exception of the schema field. The schema, if it exists, must be a URL and it must point to something that contains documentation for how an application uses this element and its child nodes. Part of that documentation may be machine readable, allowing an automatic validity check of the element.[3]

An ADT for XML

This section defines the four types of nodes and their characteristics with mathematical rigor.

An ADT (Abstract Data Type) is a set of mathematical equations that define exactly what information can be stored in a certain data structure. It doesn't show an actual implementation in a computer program, but rather defines the behavior that all implementations must have in common. It relies on set theory. The data type, in this case XML, is regarded as a set. A number of functions are defined on this set to show how it maps back and forth to other well known sets, such as the set of integers and the set of characters.

Because it uses functions, an ADT may look a bit like an API (Application Program Interface), the documentation of a software library. An ADT can indeed be used as a starting point for writing a software library, but for efficiency reasons, the actual functions implemented are often a bit different from the ADT (although they can be expressed in each other). Whereas the primary goal of an ADT is to make the mathematics easy, an API tries to make programming easy (and efficient).

The structure of this ADT is

 function name : domain -> range

Equations that hold for this function are shown here. Variables implicitly have universal scope (i.e., [forall] X)

The sets that are used are:

char[]

The set of all sequences of zero or more characters. The square brackets ([]) are suggestive of the fact that this is a set consisting of sequences.

nil

The singleton set consisting only of the symbol nil.

error

The singleton set consisting only of the symbol error.

Identifier

The set of all "names." Names are in fact character sequences with a syntax defined in the XML specification. I'll not define that syntax here, except to note that a name never contains a space or angle bracket characters (< or >).

boolean

The set consisting of the two symbols "true" and "false."

int

The set of whole numbers {..., -2, -1, 0, 1, 2,...}

The notation 2^X indicates the power set of X, i.e., the collection of all possible subsets of X.

The ADT for XML is shown in Table 1.

Table 1

        create : Identifier -> XML
        setSchema : XML  URL -> XML
        getSchema : XML -> URL + nil
        getSchema(create(S)) = nil
        getSchema(setSchema(X, U)) = U
        setAttribute : XML  Identifier  char[] -> XML 
        getAttribute : XML  Identifier -> char[] + nil 
        getAttributes : XML -> 2Identifier
        getAttribute(create(S), I) = nil
        getAttribute(setAttribute(X, I, S), I) = S
        I1 <> I2 => getAttribute(setAttribute(X, I1, S), I2) = getAttribute(X, I2)
        getAttributes(create(S)) = {}
        getAttributes(setAttribute(X, I, S)) = getAttributes(X) + {I}   
        getAttribute(X, I) = getAttribute(setSchema(X, U), I)
        getAttributes(X) = getAttributes(setSchema(X, U))
        getSchema(X) = getSchema(setAttribute(X, I, S)) 
        addNode : XML  XML  char -> XML + error 
        setNode : XML  int  XML  char -> XML + error 
        getNode : XML  int -> XML + char + nil
        nrNodes : XML -> int
        isNode : XML  XML -> boolean
        isDescendant : XML  XML -> boolean
        isFamily : XML  XML -> boolean
        isNode(X1, X2) <=> E N : getNode(X1, N) = X2 
        isDescendant(X1, X2) <=> isNode(X1, X2) | (E X3 : isNode(X3, X2) &
        isDescendant(X1, X3))
        isFamily(X1, X2) <=> X1 = X2 | isDescendant(X1, X2) | isDescendant(X2, X1) 
        isFamily(X1, X2) => addNode(X1, X2) = error 
        isFamily(X1, X2) => setNode(X1, N, X2) = error 
        N < 1 => setNode(X1, N, X2) = error
        N > nrNodes(X1) => setNode(X1, N, X2) = error 
        N < 1 => getNode(X, N) = nil
        N > nrNodes(X) => getNode(X, N) = nil
        nrNodes(create(I)) = 0
        isFamily(X1, X2) => nrNodes(addNode(X1, X2)) = nrNodes(X1) + 1   
        isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getNode(setNode(X1, N, X2), N)        
        = X2
        isFamily(X1, X2) & N = nrNodes(X1) + 1 => getNode(addNode(X1, X2)), N) = X2
        isFamily(X1, X2) => getNode(X1, N) = getNode(addNode(X1, X2), N)
        isFamily(X1, X2) & 0 < M <= nrNodes(X1) & N <> M => getNode(setNode(X1, M,        
        X2), N) = getNode(X1, N)         
        getNode(setSchema(X1, U), N) = getNode(X1, N)
        getNode(setAttribute(X1, I, S), N) = getNode(X1, N)
        isFamily(X1, X2) => getAttribute(addNode(X1, X2), I) = getAttribute(X1, I)
        isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getAttribute(setNode(X1, N, X2),        
        I) = getAttribute(X1, I)
        isFamily(X1, X2) => getAttributes(addNode(X1, X1)) = getAttributes(X1)
        isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getAttributes(setNode(X1, N, X2))      
        = getAttributes(X1)
        isFamily(X1, X2) => getSchema(addNode(X1, X2)) = getSchema(X1)
        isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getSchema(setNode(X2, N, X2))      
        = getSchema(X1)
        getName : XML -> Identifier
        getName(create(I)) = I
        getName(setSchema(X, U)) = getName(X)
        getName(setAttribute(X, I, S)) = getName(X)  
        isFamily(X1, X2) => getName(addNode(X1, X2)) = getName(X1) 
        isFamily(X1, X2) & 0 < N <= nrNodes(X1) => getName(setNode(X1, N, X2))     
        = getName(X1)
        deleteAttribute : XML  Identifier -> XML 
        deleteAttribute(create(I1), I2) = create(I1)    
        getSchema(deleteAttribute(X, I)) = getSchema(X) 
        getAttribute(deleteAttribute(X, I), I) = nil
        getAttributes(deleteAttribute(X, I)) = getAttributes(X) \ {I}
        getNode(deleteAttribute(X, I), N) = getNode(X, N)
        getName(deleteAttribute(X, I)) = getName(X) 
        popNode : XML -> XML
        popNode(create(I)) = create(I)
        isFamily(X1, X2) => popNode(addNode(X1, X2)) = X1    
        isFamily(X1, X2) & 0 < N < nrNodes(X1) => popNode(setNode(X1, N, X2))      
        = setNode(popNode(X1), N, X2)
        isFamily(X1, X2) & N = nrNodes(X1) => popNode(setNode(X1, N, X2))       
        = popNode(X1)
        nrNodes(X) > 0 => nrNodes(popNode(X)) = nrNodes(X) - 1 
        N < nrNodes(X) => getNode(popNode(X), N) = getNode(X, N)
        getSchema(popNode(X)) = getSchema(X)
        getAttribute(popNode(X), I) = getAttribute(X)  
        getAttributes(popNode(X)) = getAttributes(X)
        getName(popNode(X)) = getName(X)

Linearizing the Tree

The first step in the marshaling process is a conversion from an XML document to an XML file. As I said earlier, there may be several text files that represent the same document, so this process is non-deterministic. Or, expressed differently, the decisions as to what representation to choose may come from elsewhere, such as from considerations of taste or efficiency.

Converting the tree structure to a linear sequence of characters is a recursive process. For each element the same routine is essentially followed. The process starts at the root element.

To linearize an element node, first check if it has a schema--remember that in XML 1.0 only the root element can have a schema. (It is also likely that XML 1.1 will introduce alternative ways to represent a schema, but for now, there is only one.)

If an element has a schema, write the following text to the XML file:

 <!doctype name "schema">

where name is the name of the element and schema is the schema (a URL). The schema must be quoted and certain characters in it must be escaped: if <, >, &, or " occur in the URL, they must be written as "<", ">", "&amp:", and """ respectively, or as equivalent escape codes (see the "Character encoding" section). It is also possible to use single quotes ('), in which case any single quote in the URL must be escaped. The section "An XML Marshaling Function" gives the possible variations.

Next, write the name of the element, preceded by <, like this:

 <name

XML restrict the possible names of elements, but all we need to know here is that names never contain < or >, nor do they contain spaces, tabs, or newlines.

If the element has attributes, write all of them as follows (the order of the attributes doesn't matter):

 name="value"

again escaping the characters <, >, & and ". Then write >:

The three steps above produce what is know as a "start tag." Some examples of starts tags are:

 
        <p> 
        <sect n="7" mode="public"> 
        <emph>

Next, write all the children recursively. After the children, repeat the name of the element, between </ and >:

 </name>

The reason for this repetition of the name is to make it easier to catch errors that may occur if an XML file is created "by hand," and the author forgets to open or close an element.

A complete linearization of an element "name" with children "first" and "last" may look like this:

 <name><first val="Jan"></first>
        <last val="Klaassen"></last></name>

To linearize a comment node, simply write the content field, which is a sequence of characters, enclosed in :

 <!--content-->

Obviously, the --> sequence may not occur in the content. XML doesn't define a rule to escape this occurrence. The expectation is that comments will usually contain text meant for human eyes rather than for processing by a program--so it doesn't really matter how the --> is avoided. A program that does want to make use of comments will simply have to define its own rule for escaping them.

To linearize a PI node, write the following sequence:

 <?name content?>

If the characters <, >, or & occur in the content, they must be escaped.[4]

To linearize a character node, simply write the character. The one exception is when it is a <, >, &, or a CR (Carriage Return, ASCII 13), in which case it must be written as "<", ">", "&", or "" respectively, or as an equivalent escape (see "Character encoding").

Because XML doesn't distinguish CR from LF (Line Feed, ASCII 10), the CR must be escaped. The reason for this is that text editors usually don't allow you to enter one or the other; what the editor creates depends on whether you are on a Mac (where the editor will insert a CR), on UNIX (an LF), or DOS/Windows (a CR+LF pair). XML defines that all three (CR, LF, and CR+LF) actually stand for an LF. This is also the convention used by the standard C I/O library, and similar to many other formats, such as PostScript.

Figure 1

With the above linearization rules, the structure in Figure 1 can be linearized as follows:

 
        <doc date="16/8/97" author="anonymous"><head><title>Title</title
        ></head><body><!--the body-->body</body></doc>

Though you might expect that you could write the elements below each other, on separate lines, like this:

 
        * <doc date="16/8/97" author= 
                    "anonymous"> 
            <head>
              <title> 
                Title 
              </title> 
            </head> 
            <body> 
              <!--the body--> 
              body
            </body> 
          </doc>

that is not possible.[5] Of course, there may be applications based on XML in which it is possible. One example is MathML, the mathematical markup language in which spaces will likely have no meaning [1].[6]

An XML Marshaling Function

Table 2 shows a formal definition of the above, using functions. The notation for the functions uses nondeterministic if-statements, borrowed from the programming language Simple, which in turn is based on Dijkstra's guarded commands:

        if guard1 : stmt1 
        [] guard2 : stmt2 
        ... 
        [] guardn : stmtn 
        fi

When evaluating this statement, from among all those guards that evaluate to true, one is picked nondeterministically and its statement is executed. The following statement:

        if guard1 : stmt1 
        [] guard2 : stmt2 
        ... 
        [] guardn : stmtn 
        [> stmt0 
        fi

is an abbreviation for

        if guard1 : stmt1 
        [] guard2 : stmt2 
        ... 
        [] guardn : stmtn 
        [] ~guard1 & ~guard2 &... & ~guardn : stmt0 
        fi

The linearization function in Table 2 makes use of a linearization function for Identifiers, called writeIdentifier. We assume that function doesn't produce any characters from the set {<, >, ", ', =, !, ?, <space>, <tab>, <CR>, <LF>}

Table 2

        import writeIdentifier; 
        function writeXML(sofar: char[]; X: XML): char[]; 
        var 
        s1, s2, s3, s4, s5: char; 
        begin 
        s1:= writeSchema(sofar, getSchema(X)); 
        s2:= s1 + opendelim(sofar + s1) + writeIdentifier(getName(X)) 
        + writeAttributes(X, getAttributes(X)); 
        if true : 
        s3:= s2 + closedelim(); 
        s4:= s3 + writeNodes(s3, X, 1); 
        s5:= s4 + opendelim(s3 + s4) 
        [] nrNodes = 0 : 
        s5:= s2 + whitespace(0); 
        fi; 
        writeXML:= s5 + "/" + writeIdentifier(getName(X)) + closedelim(); 
        end; 
        
        function writeSchema(sofar: char[]; url: char[]): char[]; 
        begin 
        if url <> nil : 
        writeSchema:= opendelim(sofar) + "!doctype" 
        + whitespace(1) + writeString(url) + closedelim() 
        [> 
        writeSchema:= "" 
        fi 
        end; 
        
        function whitespace(min: integer): char[]; 
        begin 
        if min <= 0 : whitespace:= "" 
        [] true     : whitespace:= " " + whitespace(min - 1) 
        [] true     : whitespace:= " " + whitespace(min - 1) 
        [] true     : whitespace:= "
        [] true     : whitespace:= "
        fi 
        end; 
        
        function space(min: integer): char[]; 
        begin 
        if min <= 0 : space:= "" 
        [] true     : space:= " " + space(min - 1) 
        [] true     : space:= " " + space(min - 1) 
        fi 
        end; 
        
        function opendelim(sofar: char[]): char[]; 
        
        begin 
        if endsWith(sofar, "
        [] endsWith(sofar, "
        [] endsWith(sofar, "
        [] endsWith(sofar, "
        [] endsWith(sofar, "
        [] endsWith(sofar, ">
        [] endsWith(sofar, ">
        [] endsWith(sofar, ">
        [] ~endsWith(sofar, "
        & ~endsWith(sofar, "
        [] ~endsWith(sofar, "
        & ~endsWith(sofar, "
        [] ~endsWith(sofar, "
        & ~endsWith(sofar, "
        fi 
        end; 
        
        function closedelim(): char[]; 
        
        begin 
        if true : closedelim:= whitespace(0) + ">" 
        [] true : closedelim:= whitespace(0) + ">" + "
        [] true : closedelim:= whitespace(0) + ">" + "
        [] true : closedelim:= whitespace(0) + ">" + "
        fi 
        end; 
        
        function writeString(s: char[]): char[]; 
        begin 
        if true : writeString:= "\"" + writeChars(s, 1, "\"") + "\"" 
        [] true : writeString:= "'" + writeChars(s, 1, "'") + "'" 
        fi 
        end; 
        
        function writeChars(s: char[]; start: integer; avoid: char): char[]; 
        begin 
        if start <= length(s) : 
        writeChars:= writeChar(s[start], avoid) 
        + writeChars(s, start + 1, avoid) 
        [> 
        writeChars:= "" 
        end; 
        
        function writeChar(c, avoid: char): char[]; 
        begin 
        if c = "\"" : writeChar:= """ 
        [] c = "'"  : writeChar:= "'" 
        [] c = "<"  : writeChar:= "<" 
        [] c = ">"  : writeChar:= ">" 
        [] c = "&"  : writeChar:= "&" 
        [] c = "
        [] c = "
        [] c = "
        [] true     : writeChar:= "&#" + decimal(ord(c)) + ";" 
        [] true     : writeChar:= "&#x" + hexadecimal(ord(c)) + ";" 
        [] c <> "<" & c <> "&gt;" & c <> "&" 
        & c <> "
        end; 
        
        function writeAttribute(X: XML; names: set of Identifier; 
        name: Identifier): char[]; 
        begin 
        writeAttribute:= whitespace(1) + writeIdentifier(name) 
        + whitespace(0) + "=" + writeString(getAttribute(X, name)) 
        + writeAttributes(X, names - {name}) 
        end; 
        
        function writeAttributes(X: XML; names: set of Identifier): char[]; 
        begin 
        if names = {} : 
        writeAttributes:= "" 
        [> 
        writeAttributes:= writeAttribute(X, names, getMember(names)) 
        fi 
        end; 
        
        function writeNodes(sofar: char[]; X: XML; n: integer): char[]; 
        var 
        s: char[]; 
        begin 
        if getNode(X, n) isa char : 
        s:= writeChar(getNode(X, n)); 
        writeNodes:= s + writeNodes(s, X, n + 1) 
        [] getNode(X, n) isa XML : 
        s:= writeXML(sofar, getNode(X, n)); 
        writeNodes:= s + writeNodes(s, X, n + 1) 
        [> 
        writeNodes:= "" 
        fi 
        end;

Character Encoding

We have now linearized the XML document. To make it a true marshaling function, however, we have to convert still further, until we reach bytes.

XML could have picked a certain character to byte encoding, but there is a better way. The Internet, and the Web with it, have developed a method for computers to keep using their own character encoding, leaving it to receiving machines (the "clients") to convert text to their own encoding.

The model for exchange of text over the Internet is described in RFC 2130 [2], which was produced after a workshop in February 1996. The model describes various levels at which textual data can be exchanged, the lowest three of which deal with coding characters as numbers (the so-called Coded Character Set), numbers as bytes (Character Encoding Scheme), and with optional further transformations such as compression (Transfer Encoding Syntax).

The model works only if the encoding functions are among the well known ones--in other words, those for which a name has been registered with the Internet Assigned Numbers Authority (IANA).

Most computers in the U.S. use ASCII as the encoding, or ASCII with some extensions. ASCII, as used on the Internet, is both a Coded Character Set and a Character Encoding Scheme. It is a function that assigns bytes to about 100 letters, digits and punctuation marks; one byte for each character. For example, the letter A is encoded as the number 65 and then further encoded as the eight bits 01000001.

Western Europe often uses ISO-8859-1, which encodes about 200 characters, also one byte per character. ISO-8859-1 includes mappings for certain accented characters, which are needed for words like the German süß or the French sucré.

In scripts with more letters, one byte per character is not enough. Japanese, for example, has tens of thousands of characters. Two bytes for each character would have worked, but for various reasons--some historical--many encoding functions are more complicated than that. An example is ISO-2022-JP. The number of bytes needed to encode a character varies, depending on the sequence of characters rather than each individual character.

So, for any computer, an XML file will be encoded with the usual character encoding of that computer (or alternatively, the one chosen by the user). When the file is transferred to another machine, the first machine has to tell the receiving end what encoding is used before it sends the file. This obviously puts some burden on the receiving programs, but this situation is acknowledged on the Internet: there are simply too many programs that produce texts in a given computer's native encoding, as well as too many text files in existence, to be able to switch to a common encoding overnight. Both the IETF and the XML Working Group, however, recommend that new programs use Unicode as the Coded Character Set and UTF-8 as the Character Encoding Scheme. The name "UTF-8" is registered to refer to this combination.

Is it always possible to convert between encodings? In general, no, but XML is a special case. First of all, it is always possible to convert to Unicode, since Unicode contains the characters of all other registered encodings. That's why Unicode has been given a special position in XML.

When characters in an XML document need to be written to an XML file that is encoded with a function that doesn't accept certain characters-- say you want to encode Greek characters in ASCII--then XML allows you to escape the Greek characters in such a way that even ASCII can encode them. This works as follows: first find the character in Unicode, then find the code point (the character's number) that Unicode assigns to the character, and then write the character with the following escape code:

 &#code;

where code is the (decimal) Unicode number. If you prefer hexadecimal numbers, you can also write

 &#xcode;

where code is the same number, but in hexadecimal. For example, the Greek lowercase alpha can be written as

 &#945; or &#x3B1;

This uses only letters that are acceptable in ASCII, and in fact (we're lucky!) in all other character encodings as well.

But there is a problem: though all characters in character nodes, the name and content fields of processing instructions, and the value parts of attributes can be escaped this way, XML doesn't allow these escapes to be used in the name fields of elements and attributes.[7]

That means that to be sure that an XML file can be parsed on all computers on the Internet, only ASCII characters (in fact, only 66 of them) are safe to use for names, which is somewhat of a limitation. Luckily, the number of computers that can process full Unicode is growing--or perhaps, in the future, XML will be changed to allow character escapes in names.

About the Author

Bert Bos

2004, route des Lucioles--B.P. 93

06902 Sophia Antipolis Cedex

France

bert@w3.org

Bert has a degree in Mathematics from the University of Groningen, The Netherlands (1987), and a Ph.D. from the Faculty of Arts of that same university, on a thesis about a rapid-prototyping language for Graphical User Interfaces (1993). Afterwards, he lead a project called PROSA, aimed at helping scholars in the Humanities make better use of the Internet and the World-Wide Web in particular. He joined W3C in France in October 1995, as coordinator for internationalization.

< >&

><