XML: From Bytes to Characters
Choosing a text format for XML has a slight cost in efficiency, since every time an XML file is used by a program, it has to be parsed: i.e., the linear sequence of characters has to be converted into a complex structure in memory. And since the XML format has been kept human-readable and human-writable to some extent, there are many different ways of writing the same XML document--and the program has to understand them all.
There is also another disadvantage of text format: although most computers have a way of dealing with text files, the files aren't always portable. Here's a common problem: somebody creates text with quote marks or accented letters on one computer and gives it to a friend with a different computer. Suddenly the quotes and accents show up as completely different symbols. The problem can get much worse if you exchange files in different languages: have you ever wondered how the Russians manage to get Cyrillic letters on their screen?
In the following sections I'll show how this problem is handled on the Web and how XML itself has provisions for making it portable. But before that, I'll present a definition of an XML document and its representation as a text file that is somewhat different--more precise--than the way XML is usually described.
On the Web, when you retrieve a document from a server, you get a stream of bytes, which your browser then interprets. In many cases (but not always) the stream of bytes existed as a file on the server's disk. Regardless of whether a file ever existed on the server's disk, I will refer to the stream of bytes--or more precisely the stream of characters represented by those bytes--as the XML file.
The way XML is usually explained starts with a description of XML file syntax, followed by rules for parsing them. I'll start at the other end, by defining XML documents as an (abstract) data structure, and then showing how this data structure is converted to a sequence of characters. In other words, rather than explaining how to parse an XML file, I will describe how to generate one. We call the latter process marshaling.[1]
Marshaling (a.k.a. "serializing," "pickling" or "flattening") is a term used in the theory of interprocess communication. When two programs need to exchange some data structure, they have to agree on a way to represent the data as a sequence of bytes. That is exactly what is happening when two programs on different machines exchange an XML document. For XML, however, there is also an intermediate representation as a sequence of characters, before the characters are further encoded as bytes.
An XML document is a tree structure. There is a root node, called the root element, and the other nodes are all descendants of the root. All nodes are of one of four types:
Characters, PIs, and comments have no children, so they are always leaf nodes in an XML document tree. Their parent node is always an element node. In fact, XML states that there must be at least one element in each document, viz., the root node.
Element nodes are much more complex. Not only do they have children, but they also have a name (sometimes called a "generic identifier" or GI), and a set of attributes. Attributes are keyword-value pairs. Each keyword may occur at most once in an attribute set. The value is again a sequence of characters.
Elements can have zero or more children, and the children are ordered--for example, you can talk about the "seventh child node." This is in contrast to the element's attributes, which are not ordered, but form a set. An element with zero children is called an "empty element."[2]
An element node may also have a schema (also called "doctype" or "namespace"). In XML 1.0, only the root element can have a schema (and it implicitly applies to all elements in the document), but in XML 1.1 all elements can have one.
An XML-based application can store data in all the different types of nodes and in all the fields of each node. XML places no restrictions on what each node or field may be used for, with the exception of the schema field. The schema, if it exists, must be a URL and it must point to something that contains documentation for how an application uses this element and its child nodes. Part of that documentation may be machine readable, allowing an automatic validity check of the element.[3]
An ADT (Abstract Data Type) is a set of mathematical equations that define exactly what information can be stored in a certain data structure. It doesn't show an actual implementation in a computer program, but rather defines the behavior that all implementations must have in common. It relies on set theory. The data type, in this case XML, is regarded as a set. A number of functions are defined on this set to show how it maps back and forth to other well known sets, such as the set of integers and the set of characters.
Because it uses functions, an ADT may look a bit like an API (Application Program Interface), the documentation of a software library. An ADT can indeed be used as a starting point for writing a software library, but for efficiency reasons, the actual functions implemented are often a bit different from the ADT (although they can be expressed in each other). Whereas the primary goal of an ADT is to make the mathematics easy, an API tries to make programming easy (and efficient).
The structure of this ADT is
function name : domain -> rangeEquations that hold for this function are shown here. Variables implicitly have universal scope (i.e., [forall] X)
The sets that are used are:
char[]
The ADT for XML is shown in Table 1.
Table 1
create : Identifier -> XML
setSchema : XML URL -> XML
getSchema : XML -> URL + nil
getSchema(create(S)) = nil
getSchema(setSchema(X, U)) = U
setAttribute : XML Identifier char[] -> XML
getAttribute : XML Identifier -> char[] + nil
getAttributes : XML -> 2Identifier
getAttribute(create(S), I) = nil
getAttribute(setAttribute(X, I, S), I) = S
I1 <> I2 => getAttribute(setAttribute(X, I1, S), I2) = getAttribute(X, I2)
getAttributes(create(S)) = {}
getAttributes(setAttribute(X, I, S)) = getAttributes(X) + {I}
getAttribute(X, I) = getAttribute(setSchema(X, U), I)
getAttributes(X) = getAttributes(setSchema(X, U))
getSchema(X) = getSchema(setAttribute(X, I, S))
addNode : XML XML char -> XML + error
setNode : XML int XML char -> XML + error
getNode : XML int -> XML + char + nil
nrNodes : XML -> int
isNode : XML XML -> boolean
isDescendant : XML XML -> boolean
isFamily : XML XML -> boolean
isNode(X1, X2) <=> E N : getNode(X1, N) = X2
isDescendant(X1, X2) <=> isNode(X1, X2) | (E X3 : isNode(X3, X2) &
isDescendant(X1, X3))
isFamily(X1, X2) <=> X1 = X2 | isDescendant(X1, X2) | isDescendant(X2, X1)
isFamily(X1, X2) => addNode(X1, X2) = error
isFamily(X1, X2) => setNode(X1, N, X2) = error
N < 1 => setNode(X1, N, X2) = error
N > nrNodes(X1) => setNode(X1, N, X2) = error
N < 1 => getNode(X, N) = nil
N > nrNodes(X) => getNode(X, N) = nil
nrNodes(create(I)) = 0
ÂisFamily(X1, X2) => nrNodes(addNode(X1, X2)) = nrNodes(X1) + 1
ÂisFamily(X1, X2) & 0 < N <= nrNodes(X1) => getNode(setNode(X1, N, X2), N) = X2
ÂisFamily(X1, X2) & N = nrNodes(X1) + 1 => getNode(addNode(X1, X2)), N) = X2
ÂisFamily(X1, X2) => getNode(X1, N) = getNode(addNode(X1, X2), N)
ÂisFamily(X1, X2) & 0 < M <= nrNodes(X1) & N <> M => getNode(setNode(X1, M, X2), N) = getNode(X1, N)
getNode(setSchema(X1, U), N) = getNode(X1, N)
getNode(setAttribute(X1, I, S), N) = getNode(X1, N)
ÂisFamily(X1, X2) => getAttribute(addNode(X1, X2), I) = getAttribute(X1, I)
ÂisFamily(X1, X2) & 0 < N <= nrNodes(X1) => getAttribute(setNode(X1, N, X2), I) = getAttribute(X1, I)
ÂisFamily(X1, X2) => getAttributes(addNode(X1, X1)) = getAttributes(X1)
ÂisFamily(X1, X2) & 0 < N <= nrNodes(X1) => getAttributes(setNode(X1, N, X2)) = getAttributes(X1)
ÂisFamily(X1, X2) => getSchema(addNode(X1, X2)) = getSchema(X1)
ÂisFamily(X1, X2) & 0 < N <= nrNodes(X1) => getSchema(setNode(X2, N, X2)) = getSchema(X1)
getName : XML -> Identifier
getName(create(I)) = I
getName(setSchema(X, U)) = getName(X)
getName(setAttribute(X, I, S)) = getName(X)
ÂisFamily(X1, X2) => getName(addNode(X1, X2)) = getName(X1)
ÂisFamily(X1, X2) & 0 < N <= nrNodes(X1) => getName(setNode(X1, N, X2)) = getName(X1)
deleteAttribute : XML Identifier -> XML
deleteAttribute(create(I1), I2) = create(I1)
getSchema(deleteAttribute(X, I)) = getSchema(X)
getAttribute(deleteAttribute(X, I), I) = nil
getAttributes(deleteAttribute(X, I)) = getAttributes(X) \ {I}
getNode(deleteAttribute(X, I), N) = getNode(X, N)
getName(deleteAttribute(X, I)) = getName(X)
popNode : XML -> XML
popNode(create(I)) = create(I)
ÂisFamily(X1, X2) => popNode(addNode(X1, X2)) = X1
ÂisFamily(X1, X2) & 0 < N < nrNodes(X1) => popNode(setNode(X1, N, X2)) = setNode(popNode(X1), N, X2)
ÂisFamily(X1, X2) & N = nrNodes(X1) => popNode(setNode(X1, N, X2)) = popNode(X1)
nrNodes(X) > 0 => nrNodes(popNode(X)) = nrNodes(X) - 1
N < nrNodes(X) => getNode(popNode(X), N) = getNode(X, N)
getSchema(popNode(X)) = getSchema(X)
getAttribute(popNode(X), I) = getAttribute(X)
getAttributes(popNode(X)) = getAttributes(X)
getName(popNode(X)) = getName(X)
Converting the tree structure to a linear sequence of characters is a recursive process. For each element the same routine is essentially followed. The process starts at the root element.
To linearize an element node, first check if it has a schema--remember that in XML 1.0 only the root element can have a schema. (It is also likely that XML 1.1 will introduce alternative ways to represent a schema, but for now, there is only one.)
If an element has a schema, write the following text to the XML file:
<!doctype name "schema">where name is the name of the element and schema is the schema (a URL). The schema must be quoted and certain characters in it must be escaped: if <, >, &, or " occur in the URL, they must be written as "<", ">", "&:", and """ respectively, or as equivalent escape codes (see the "Character encoding" section). It is also possible to use single quotes ('), in which case any single quote in the URL must be escaped. The section "An XML Marshaling Function" gives the possible variations.
Next, write the name of the element, preceded by <, like this:
<nameXML restrict the possible names of elements, but all we need to know here is that names never contain < or >, nor do they contain spaces, tabs, or newlines.
If the element has attributes, write all of them as follows (the order of the attributes doesn't matter):
name="value"again escaping the characters <, >, & and ". Then write >:
>The three steps above produce what is know as a "start tag." Some examples of starts tags are:
<p> <sect n="7" mode="public"> <emph>Next, write all the children recursively. After the children, repeat the name of the element, between </ and >:
</name>The reason for this repetition of the name is to make it easier to catch errors that may occur if an XML file is created "by hand," and the author forgets to open or close an element.
A complete linearization of an element "name" with children "first" and "last" may look like this:
<name><first val="Jan"></first><last val="Klaassen"></last></name>To linearize a comment node, simply write the content field, which is a sequence of characters, enclosed in <!-- and -->:
<!--content-->Obviously, the --> sequence may not occur in the content. XML doesn't define a rule to escape this occurrence. The expectation is that comments will usually contain text meant for human eyes rather than for processing by a program--so it doesn't really matter how the --> is avoided. A program that does want to make use of comments will simply have to define its own rule for escaping them.
To linearize a PI node, write the following sequence:
<?name content?>If the characters <, >, or & occur in the content, they must be escaped.[4]
To linearize a character node, simply write the character. The one exception is when it is a <, >, &, or a CR (Carriage Return, ASCII 13), in which case it must be written as "<", ">", "&", or " " respectively, or as an equivalent escape (see "Character encoding").
Because XML doesn't distinguish CR from LF (Line Feed, ASCII 10), the CR must be escaped. The reason for this is that text editors usually don't allow you to enter one or the other; what the editor creates depends on whether you are on a Mac (where the editor will insert a CR), on UNIX (an LF), or DOS/Windows (a CR+LF pair). XML defines that all three (CR, LF, and CR+LF) actually stand for an LF. This is also the convention used by the standard C I/O library, and similar to many other formats, such as PostScript.

Figure 1
With the above linearization rules, the structure in Figure 1 can be linearized as follows:
<doc date="16/8/97" author="anonymous"><head><title>Title</title ></head><body><!--the body-->body</body></doc>Though you might expect that you could write the elements below each other, on separate lines, like this:
* <doc date="16/8/97" author=
"anonymous">
<head>
<title>
Title
</title>
</head>
<body>
<!--the body-->
body
</body>
</doc>
that is not possible.[5] Of course, there may be applications based on XML in which it is possible. One example is MathML, the mathematical markup language in which spaces will likely have no meaning [1].[6]
if guard1 : stmt1 [] guard2 : stmt2 ... [] guardn : stmtn fiWhen evaluating this statement, from among all those guards that evaluate to true, one is picked nondeterministically and its statement is executed. The following statement:
if guard1 : stmt1 [] guard2 : stmt2 ... [] guardn : stmtn [> stmt0 fiis an abbreviation for
if guard1 : stmt1 [] guard2 : stmt2 ... [] guardn : stmtn [] ~guard1 & ~guard2 &... & ~guardn : stmt0 fiThe linearization function in Table 2 makes use of a linearization function for Identifiers, called writeIdentifier. We assume that function doesn't produce any characters from the set {<, >, ", ', =, !, ?, <space>, <tab>, <CR>, <LF>}
Table 2
import writeIdentifier;
function writeXML(sofar: char[]; X: XML): char[];
var
s1, s2, s3, s4, s5: char;
begin
s1:= writeSchema(sofar, getSchema(X));
s2:= s1 + opendelim(sofar + s1) + writeIdentifier(getName(X))
+ writeAttributes(X, getAttributes(X));
if true :
s3:= s2 + closedelim();
s4:= s3 + writeNodes(s3, X, 1);
s5:= s4 + opendelim(s3 + s4)
[] nrNodes = 0 :
s5:= s2 + whitespace(0);
fi;
writeXML:= s5 + "/" + writeIdentifier(getName(X)) + closedelim();
end;
function writeSchema(sofar: char[]; url: char[]): char[];
begin
if url <> nil :
writeSchema:= opendelim(sofar) + "!doctype"
+ whitespace(1) + writeString(url) + closedelim()
[>
writeSchema:= ""
fi
end;
function whitespace(min: integer): char[];
begin
if min <= 0 : whitespace:= ""
[] true : whitespace:= " " + whitespace(min - 1)
[] true : whitespace:= " " + whitespace(min - 1)
[] true : whitespace:= "
[] true : whitespace:= "
fi
end;
function space(min: integer): char[];
begin
if min <= 0 : space:= ""
[] true : space:= " " + space(min - 1)
[] true : space:= " " + space(min - 1)
fi
end;
function opendelim(sofar: char[]): char[];
begin
if endsWith(sofar, "
[] endsWith(sofar, "
[] endsWith(sofar, "
[] endsWith(sofar, "
[] endsWith(sofar, "
[] endsWith(sofar, ">
[] endsWith(sofar, ">
[] endsWith(sofar, ">
[] ~endsWith(sofar, "
& ~endsWith(sofar, "
[] ~endsWith(sofar, "
& ~endsWith(sofar, "
[] ~endsWith(sofar, "
& ~endsWith(sofar, "
fi
end;
function closedelim(): char[];
begin
if true : closedelim:= whitespace(0) + ">"
[] true : closedelim:= whitespace(0) + ">" + "
[] true : closedelim:= whitespace(0) + ">" + "
[] true : closedelim:= whitespace(0) + ">" + "
fi
end;
function writeString(s: char[]): char[];
begin
if true : writeString:= "\"" + writeChars(s, 1, "\"") + "\""
[] true : writeString:= "'" + writeChars(s, 1, "'") + "'"
fi
end;
function writeChars(s: char[]; start: integer; avoid: char): char[];
begin
if start <= length(s) :
writeChars:= writeChar(s[start], avoid)
+ writeChars(s, start + 1, avoid)
[>
writeChars:= ""
end;
function writeChar(c, avoid: char): char[];
begin
if c = "\"" : writeChar:= """
[] c = "'" : writeChar:= "&squot;"
[] c = "<" : writeChar:= "<"
[] c = ">" : writeChar:= ">"
[] c = "&" : writeChar:= "&"
[] c = "
[] c = "
[] c = "
[] true : writeChar:= "&#" + decimal(ord(c)) + ";"
[] true : writeChar:= "&#x" + hexadecimal(ord(c)) + ";"
[] c <> "<" & c <> ">" & c <> "&"
& c <> "
end;
function writeAttribute(X: XML; names: set of Identifier;
name: Identifier): char[];
begin
writeAttribute:= whitespace(1) + writeIdentifier(name)
+ whitespace(0) + "=" + writeString(getAttribute(X, name))
+ writeAttributes(X, names - {name})
end;
function writeAttributes(X: XML; names: set of Identifier): char[];
begin
if names = {} :
writeAttributes:= ""
[>
writeAttributes:= writeAttribute(X, names, getMember(names))
fi
end;
function writeNodes(sofar: char[]; X: XML; n: integer): char[];
var
s: char[];
begin
if getNode(X, n) isa char :
s:= writeChar(getNode(X, n));
writeNodes:= s + writeNodes(s, X, n + 1)
[] getNode(X, n) isa XML :
s:= writeXML(sofar, getNode(X, n));
writeNodes:= s + writeNodes(s, X, n + 1)
[>
writeNodes:= ""
fi
end; XML could have picked a certain character to byte encoding, but there is a better way. The Internet, and the Web with it, have developed a method for computers to keep using their own character encoding, leaving it to receiving machines (the "clients") to convert text to their own encoding.
The model for exchange of text over the Internet is described in RFC 2130 [2], which was produced after a workshop in February 1996. The model describes various levels at which textual data can be exchanged, the lowest three of which deal with coding characters as numbers (the so-called Coded Character Set), numbers as bytes (Character Encoding Scheme), and with optional further transformations such as compression (Transfer Encoding Syntax).
The model works only if the encoding functions are among the well known ones--in other words, those for which a name has been registered with the Internet Assigned Numbers Authority (IANA).
Most computers in the U.S. use ASCII as the encoding, or ASCII with some extensions. ASCII, as used on the Internet, is both a Coded Character Set and a Character Encoding Scheme. It is a function that assigns bytes to about 100 letters, digits and punctuation marks; one byte for each character. For example, the letter A is encoded as the number 65 and then further encoded as the eight bits 01000001.
Western Europe often uses ISO-8859-1, which encodes about 200 characters, also one byte per character. ISO-8859-1 includes mappings for certain accented characters, which are needed for words like the German süß or the French sucré.
In scripts with more letters, one byte per character is not enough. Japanese, for example, has tens of thousands of characters. Two bytes for each character would have worked, but for various reasons--some historical--many encoding functions are more complicated than that. An example is ISO-2022-JP. The number of bytes needed to encode a character varies, depending on the sequence of characters rather than each individual character.
So, for any computer, an XML file will be encoded with the usual character encoding of that computer (or alternatively, the one chosen by the user). When the file is transferred to another machine, the first machine has to tell the receiving end what encoding is used before it sends the file. This obviously puts some burden on the receiving programs, but this situation is acknowledged on the Internet: there are simply too many programs that produce texts in a given computer's native encoding, as well as too many text files in existence, to be able to switch to a common encoding overnight. Both the IETF and the XML Working Group, however, recommend that new programs use Unicode as the Coded Character Set and UTF-8 as the Character Encoding Scheme. The name "UTF-8" is registered to refer to this combination.
Is it always possible to convert between encodings? In general, no, but XML is a special case. First of all, it is always possible to convert to Unicode, since Unicode contains the characters of all other registered encodings. That's why Unicode has been given a special position in XML.
When characters in an XML document need to be written to an XML file that is encoded with a function that doesn't accept certain characters-- say you want to encode Greek characters in ASCII--then XML allows you to escape the Greek characters in such a way that even ASCII can encode them. This works as follows: first find the character in Unicode, then find the code point (the character's number) that Unicode assigns to the character, and then write the character with the following escape code:
&#code;where code is the (decimal) Unicode number. If you prefer hexadecimal numbers, you can also write
ode;where code is the same number, but in hexadecimal. For example, the Greek lowercase alpha can be written as
α or αThis uses only letters that are acceptable in ASCII, and in fact (we're lucky!) in all other character encodings as well.
But there is a problem: though all characters in character nodes, the name and content fields of processing instructions, and the value parts of attributes can be escaped this way, XML doesn't allow these escapes to be used in the name fields of elements and attributes.[7]
That means that to be sure that an XML file can be parsed on all computers on the Internet, only ASCII characters (in fact, only 66 of them) are safe to use for names, which is somewhat of a limitation. Luckily, the number of computers that can process full Unicode is growing--or perhaps, in the future, XML will be changed to allow character escapes in names.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.