A Guide to XML
October 2, 1997
A Guide to XML
This article provides a technical introduction to XML with an eye towards guiding the reader to appropriate sections of the XML specification when greater technical detail is desired. This introduction is geared towards a reader with some HTML or SGML experience, although that experience is not absolutely necessary. The XML Link and XML Style specifications are also briefly outlined.
Because other articles in this issue of the Web Journal describe the motivations for XML and some of its goals, this article is intended to serve as a slightly more technical introduction to XML and as an overview of the specification. Throughout this document you will find references of the form [Section 1]; these are references to the XML language specification included in this issue. If you are interested in more technical detail about a particular topic, please consult the specification.
Understanding the Specs
For the most part, reading and understanding the XML specifications does not require extensive knowledge of SGML or any of the related technologies.
One topic that may be new is the use of EBNF to describe the syntax of XML. Please consult the discussion of EBNF in Appendix A for a detailed description of how this grammar works.
What Do XML Documents Look Like?
If you are conversant with SGML or HTML, XML documents will look familiar. Here is a simple XML document:
<?XML version="1.0"?> <oldjoke> <burns>Say <quote>goodnight </quote>, Gracie.</burns> <allen><quote>Goodnight, Gracie. </quote></allen> <applause/> </oldjoke>
A few things may stand out to you:
- The document begins with a processing instruction: <?XML ...?>. This is the XML markup declaration [Section 2.9]. While it is not required, its presence explicitly identifies the document as an XML document and indicates the version of XML to which it was authored.
- There's no document type declaration. Unlike SGML, XML does not require a document type declaration. However, a document type declaration can be supplied, and some documents will require one.
Empty elements (<applause/> in the example above) have a modified syntax. While most elements in a document are wrappers around some content, empty elements are simply markers where something occurs (a horizontal rule for HTML's hr tag, for example, or an xref cross reference in DocBook). The trailing slash in the modified syntax, <name/>, indicates to a program processing the XML document that the element is empty and no matching end-tag should be sought. Since XML documents do not require a document type declaration, without this clue it could be impossible for an XML parser to determine which tags were intentionally empty and which had been left empty by mistake.
In a very recent modification to the specification, another alternate syntax has been introduced for empty elements which allows the end-tag to be present, if it immediately follows the start-tag. Under this syntax, <applause></applause> would be acceptable as well.
XML documents are composed of markup and content. There are six kinds of markup that can occur in an XML document: elements, entity references, comments, processing instructions, marked sections, and document type declarations. The following sections introduce each of these markup concepts.
Elements are the most common form of markup. Delimited by angle brackets (< >), most elements identify the nature of the content they surround. Some elements may be empty, as seen above, in which case they have no content. If an element is not empty, it begins with a start-tag, <element>, and ends with an end-tag, </element>.
Attributes are name-value pairs that occur inside tags after the element name. For example, <div class="preface"> is the div element with the attribute class having the value preface. In XML, all attribute values must be quoted.
In order to introduce markup into a document, some characters have been reserved to identify the start of markup. The left angle bracket (<), for instance, identifies the beginning of an element start- or end-tag. In order to insert these characters into your document as content, there must be an alternative way to represent them. In XML, entities are used to represent these special characters. Entities are also used to refer to often repeated or varying text and to include the content of external files.
Every entity must have a unique name. Defining your own entity names is discussed in the section "Entity Declarations" below. In order to use an entity, you simply reference it by name. Entity references begin with the ampersand character (&) and end with a semicolon (;).
For example, the amp entity inserts a literal ampersand into a document. So the string "O'Reilly & Associates, Inc." can be represented in an XML document as O'Reilly & Associates, Inc.
A special form of entity reference, called a character reference [Section 4.2], can be used to insert arbitrary Unicode characters into your document. This is a mechanism for inserting characters that cannot be directly typed.
Character references take one of two forms:
- Decimal references (℞)
- Hexadecimal references (℞)
Both of these refer to character number U+211E from Unicode (which is the standard Rx prescription symbol).
Comments begin with <!-- and end with -->. Comments can contain any data except the literal string "--". You can place comments between markup anywhere in your document.
Comments are not part of the textual content of an XML document; an XML processor is not required to pass them along to an application.
Processing instructions (PIs) are an escape hatch to provide information to an application. Like comments, they are not textually part of the XML document, but the XML processor is required to pass them to an application.
Processing instructions have the form: <?name pidata?>. The name, called the PI target, identifies the PI to the application. Applications should process only the targets they recognize and ignore all other PIs. Any data that follows the PI target is optional; the data is for the application that recognizes the target. The names used in PIs may be declared as notations in order to formally identify them.
PI names beginning with XML are reserved for XML standardization.
In a document, a CDATA section instructs the parser to ignore most markup characters.
Consider a source code listing in an XML document. It might contain characters that the XML parser would ordinarily recognize as markup (< and &, for example). In order to prevent this, a CDATA section can be used.
<![CDATA[ *p = &q; b = (i <= 3); ]]>
Between the start of the section <![CDATA[ and the end of the section, ]]>, all character data is passed directly to the application. The only string that cannot occur in a CDATA section is ]]>.
Comments are not recognized in a CDATA section. If present, the literal text <!--comment--> will be passed directly to the application.
Document Type Declarations
A large percentage of the XML specification deals with various sorts of declarations that are allowed in XML. If you have experience with SGML, you will recognize these declarations from SGML DTDs (Document Type Definitions). If you have never seen them before, their significance may not be immediately obvious.
One of the greatest strengths of XML is that it allows you to create your own tag names. But for any given application, it is probably not meaningful for tags to occur in a completely arbitrary order. Consider the old joke example introduced earlier. Would this be meaningful?
<quote><oldjoke>Goodnight, <applause/>Gracie</oldjoke> </quote> <burns><gracie>Say <quote>goodnight</quote>, </gracie>Gracie.</burns>
It's so far outside the bounds of what we normally expect that it's nonsensical. It just doesn't mean anything.
However, from a strictly syntactic point of view, there's nothing wrong with that XML document. So, if the document is to have meaning, and certainly if you're writing a stylesheet to present it, there must be some constraint on the sequence and nesting of tags. Declarations are where these constraints can be expressed.
More generally, declarations allow a document to communicate meta-information to the parser about its content. Meta-information includes the allowed sequence and nesting of tags, attribute values and their types and defaults, the names of external files that may be referenced and whether or not they contain XML, the formats of some external (non-XML) data that may be included, and entities that may be encountered.
There are four kinds of declarations in XML: element declarations, attribute list declarations, entity declarations, and notation declarations.
Element declarations [Section 3.2] identify the names of elements and the nature of their content. A typical element declaration looks like this:
<!ELEMENT oldjoke (burns+, allen, applause?)>
This declaration identifies the element named oldjoke. Its "content model" follows the element name. The content model defines what an element may contain. In this case, an oldjoke must contain burns and allen and may contain applause. The commas (,) between element names indicate that they must occur in succession. The plus (+) after burns indicates that it may be repeated more than once but must occur at least once. The question mark (?) after applause indicates that it is optional. A name with no punctuation, such as allen, must occur exactly once.
Declarations for burns, allen, applause, and all other elements used in any content model must also be present for an XML processor to check the validity of a document.
In addition to element names, the special symbol #PCDATA is reserved to indicate character data. The moniker PCDATA stands for "parseable character data."
Elements with both element content [Section 3.2.1] and PCDATA content are said to have "mixed content" [Section 3.2.2].
For example, the definition for burns might be
<!ELEMENT burns (#PCDATA | quote)*>
The vertical bar (|) indicates an "or" relationship and the asterisk (*) indicates that the content is optional (may occur zero or more times); therefore, by this definition, burns may contain zero or more characters and quote tags. All content models that include PCDATA must have this form: PCDATA must come first, all of the elements must be separated by vertical bars, and the entire group must be optional.
Two other content models are possible:
- EMPTY indicates that the element has no content (and consequently no end-tag)
- ANY indicates that any content is allowed
The ANY content model is sometimes useful during document conversion, but should be avoided at almost any cost in a production environment because it disables all content checking in that element.
Here is a complete set of element declarations for the example used at the beginning of this article:
<!ELEMENT oldjoke (burns+, allen, applause?)> <!ELEMENT burns (#PCDATA | quote)*> <!ELEMENT allen (#PCDATA | quote)*> <!ELEMENT quote (#PCDATA)*> <!ELEMENT applause empty>
Attribute declarations [Section 3.3] identify which elements may have attributes, what attributes they may have, what values the attributes may hold, and what default value each attribute has. A typical attribute declaration looks like this:
<!ATTLIST oldjoke name ID #required label CDATA #implied status ( funny | notfunny ) 'funny'>
In this example, the oldjoke element has three attributes:
- name, which is an ID and is required
- label, which is a string (character data) and is not required
- status, which must be either funny or notfunny and defaults to funny if not specified.
Each attribute in a declaration has three parts: a name, a type, and a default value. You are free to select any name you wish, subject to some slight restrictions [Section 1.5, production 5], but names cannot be repeated on the same element. There are six possible attribute types:CDATA
- CDATA attributes are strings; any text is allowed. Don't confuse CDATA attributes with CDATA sections. In CDATA attributes, markup is recognized; specifically, entity references are expanded.
- The value of an ID attribute must be a name [Section 1.5, production 5]. All of the ID values used in a document must be different. IDs uniquely identify individual elements in a document. Elements can have only a single ID attribute.
- An IDREF attribute's value must be the value of a single ID attribute on some element in the document. The value of an IDREFS attribute may contain multiple IDREF values separated by whitespace.
- An ENTITY attribute's value must be the name of a single entity (see the discussion of entity declarations below). The value of an ENTITIES attribute may contain multiple ENTITY values separated by whitespace.
- Name token attributes are a restricted form of string attribute. In general, an NMTOKEN attribute must consist of a single word [Section 1.5, production 7], but there are no additional constraints on the word, it doesn't have to match another attribute or declaration. The value of an NMTOKENS attribute may contain multiple NMTOKEN values separated by whitespace.
- You can specify that the value of an attribute must be taken from a specific list of names. This is frequently called an "enumerated type" because each of the possible values is explicitly enumerated in the declaration.Additionally, you can specify that the names must match a particular notation name (see the "Notation Declarations" section below).
There are four kinds of default values:
- The attribute must have an explicitly specified value on every occurrence of the element in the document.
- The attribute value is not required, and no default value is provided. If a value is not specified, the XML processor must proceed without one.
- An attribute can be given any legal value as a default. The attribute value is not required on each element in the document, but if it is not present, it will appear to be the specified default.
- An attribute declaration may specify that an attribute has a fixed value. In this case, the attribute is not required, but if it occurs, it must have the specified value. One use for fixed attributes is to associate semantics with an element. A complete discussion is beyond the scope of this article, but you can find several examples of fixed attributes in the XLL specification.
Entity declarations [Section 4.3] allow you to associate a name with some other fragment of the document. That construct can be a chunk of regular text, a chunk of the document type declaration, or a reference to an external file containing either text or binary data.
Here are a few typical entity declarations:
<!ENTITY ATI "ArborText, Inc."> <!ENTITY boilerplate SYSTEM "/standard/legalnotice.xml"> <!ENTITY ATIlogo SYSTEM "/standard/logo.gif" NDATA GIF87A>
There are three kinds of entities:
- The first entity in the preceding example is an internal entity [Section 4.3.1] because the replacement text is stored in the declaration. Using &ATI; anywhere in the document inserts "ArborText, Inc." at that location. Internal entities allow you to define shortcuts for frequently typed text or text that is expected to change, such as the revision status of a document.
- Internal entities can include references to other internal entities, but it is an error for them to be recursive.
- The XML specification predefines five internal entities:
- < produces the left angle bracket (<)
- > produces the right angle bracket (>)
- & produces the ampersand (&)
- ' produces a single quote character (')
- " produces a double quote character (")
- The second and third entities are external entities [Section 4.3.2].
- Using &boilerplate; will have the effect of inserting the contents of the file /standard/legalnotice.xml at that location in the document when it is processed. The XML processor will parse the content of that file as if its content had been typed at the location of the entity reference.
- The entity ATIlogo is also an external entity, but its content is binary. The ATIlogo entity can only be used as the value of an ENTITY (or ENTITIES) attribute (on a graphic element, perhaps). The XML processor will pass this information along to an application, but it does not attempt to process the content of /standard/logo.gif.
- External entities allow an XML document to refer to an external file. External entities contain either text or binary data. If they contain text, the content of the external file is inserted at the point of reference and parsed as part of the referring document. Binary data is not parsed and may only be referenced in an attribute. Binary data is used to reference figures and other non-XML content in the document.
- Parameter entities can occur only in the document type declaration. A parameter entity is identified by placing % ("percent-") in front of its name in the declaration. The percent sign is also used in references to parameter entities, instead of the ampersand. Parameter entity references are immediately expanded in the document type declaration and their replacement text is part of the declaration, whereas normal entity references are not expanded.
Notation declarations [Section 4.6] identify specific types of external binary data. This information is passed to the processing application, which may use it however it wishes to. A typical notation declaration is:
<!NOTATION GIF87A SYSTEM "GIF">
Do I Need a Document
As we've seen, XML content can be processed without a declaration. However, there are some instances where the declaration is required:Authoring environments
- Most authoring environments need to read and process document type declarations in order to understand and enforce the content models of the document.
- If an XML document relies on default attribute values, at least part of the declaration must be processed in order to obtain the correct default values.
- Whitespace handling [Section 2.8] is a subtle issue. Consider the following content fragment:
<oldjoke> <burns>Say <quote>goodnight </quote>, Gracie.</burns>
- Is the whitespace (the new line between oldjoke and burns) significant? Probably not. But how can you tell? You can only determine if whitespace is significant if you know the content model of the elements in question. In a nutshell, whitespace is significant in mixed content and is insignificant in element content.
- The rule for XML processors is that in the absence of a declaration that identifies the content model of an element, all whitespace is significant. If you need precise control over whitespace handling, you must provide a declaration.
In applications where a person composes or edits the data (as opposed to data that may be generated directly from a database, for example), a DTD is probably going to be required if any structure is to be guaranteed.
Including a Document Type Declaration
If present, the document type declaration must be the first thing in the document after optional processing instructions and comments [Section 2.9].
The document type declaration identifies the root element of the document and may contain additional declarations. All XML documents must have a single root element that contains all of the content in the document. Additional declarations may come from an external definition (a DTD), may be included directly in the document, or both:
<?XML version="1.0" rmd="internal"?> <!DOCTYPE chapter SYSTEM "dbook.dtd" [ <!ELEMENT ulink (#PCDATA)*> <!ATTLIST ulink xml-link CDATA #FIXED "SIMPLE" xml-attributes CDATA #FIXED "HREF URL" URL CDATA #REQUIRED> ]> <chapter>...</chapter>
This example references an external DTD, dbook.dtd, and includes element and attribute declarations for the ulink element. In this case, ulink is being given the semantics of a simple link from the XLL specification.
In order to determine if a document is valid, the XML processor must read the entire document type declaration (both internal and external). But for some applications, validity may not be required, and it may be sufficient for the processor to read only the internal declaration. In the example above, if validity is unimportant and the only reason to read the doctype declaration is to identify the semantics of ulink, reading the external definition is not necessary.
You can communicate this information in the required markup declaration [Section 2.10]. The required markup declaration, rmd="internal", rmd="all", or rmd="none" occurs in the XML markup declaration. A value of internal indicates that only the internal declarations need be processed. A value of all indicates that both the internal and external declarations must be processed. A value of none indicates that the document can be processed without reading either declarations.
If both is specified, the XML processor reads the internal declaration before the external declaration. This is important if the declarations contain duplicate ATTLIST or ENTITY declarations. In XML, the first declaration takes precedence. Duplicate ELEMENT declarations are not allowed.
Given the preceding discussion of type declarations, it follows that some documents are valid and some are not. There are two categories of XML documents: well-formed and valid.
A document can only be well-formed [Section 2.2] if it obeys the syntax of XML. A document that includes sequences of markup characters that cannot be parsed, or are invalid, cannot be well-formed.
In addition, the document must meet all of the following conditions (understanding some of these conditions may require experience with SGML):
- The document instance must conform to the grammar of XML documents. In particular, some markup constructs (parameter entity references, for example) are only allowed in specific places. The document is not well-formed if they occur elsewhere, even if the document is well-formed in all other ways.
- The replacement text for all parameter entities referenced inside a markup declaration consists of zero or more complete markup declarations. (No parameter entity used in the document may consist of only part of a markup declaration.)
- No attribute may appear more than once on the same start-tag.
- String attribute values cannot contain references to external entities.
- Non-empty tags must be properly nested.
- Parameter entities must be declared before they are used.
- All entities must be declared except the following: amp, lt, gt, apos, and quot.
- A binary entity cannot be referenced in the flow of content; it can only be used in an attribute declared as ENTITY or ENTITIES.
- Neither text nor parameter entities can be recursive, directly or indirectly.
By definition, if a document is not well-formed, it is not XML.
A well-formed document is valid only if it contains a proper document type declaration and if the document obeys the constraints of that declaration (element sequence and nesting is valid, required attributes are provided, attribute values are of the correct type, etc.). The XML specification identifies all of the criteria in detail.
Pulling the Pieces Together
The XML linking specification (XLL), currently under development, introduces a standard linking model for XML. In consideration of space, and the fact that the XLL draft is still developing, what follows is survey of the features of XLL, rather than a detailed description of the specification.
In the parlance of XLL, a link expresses a relationship between resources. A resource is any location (an element identified with an ID or the content of a linking element, for example) that is addressed in a link. The exact nature of the relationship between resources depends on both the application that processes the link and semantic information you supplied.
Some highlights of XLL include the following:
- XLL gives you control over the semantics of the link.
- XLL introduces Extended Links. Extended Links can involve more than two resources.
- XLL introduces Extended Pointers (XPointers). XPointers provide a sophisticated method of locating resources.
Since XML does not have a fixed set of elements, the name of the element cannot be used to locate links. Instead, XML processors identify links by recognizing the XML-LINK attribute. Other attributes can be used to provide additional information to the XML processor. An attribute renaming facility exists to work around name collisions in existing applications.
Two of the attributes, SHOW and ACTUATE, allow you to exert some control over the linking behavior. The SHOW attribute determines whether the document that is linked-to is embedded in the current document, replaces the current document, or is displayed in a new window when the link is traversed. ACTUATE determines how the link is traversed, either automatically or when selected by the user.
Some applications will require much finer control over linking behaviors. For those applications, standard places are provided where the additional semantics may be expressed.
A Simple Link strongly resembles an HTML A link:
<LINK XML-LINK="SIMPLE" HREF="locator">Link Text</LINK>
A Simple Link identifies a link between two resources, one of which is the content of the linking element itself. This is an in-line link.
The locator identifies the other resource. The locator may be a URL, a query, or an Extended Pointer.
Extended Links allow you to express relationships between more than two resources:
<ELINK XML-LINK="EXTENDED" ROLE="ANNOTATION"> <LOCATOR XML-LINK="LOCATOR" HREF="text.loc">The Text </LOCATOR> <LOCATOR XML-LINK="LOCATOR" HREF="annot1.loc">Annotations </LOCATOR> <LOCATOR XML-LINK="LOCATOR" HREF="annot2.loc">More Annotations</LOCATOR> <LOCATOR XML-LINK="LOCATOR" HREF="litcrit.loc">Literary Criticism</LOCATOR> <ELINK>
This example shows how the relationships between a literary work, annotations, and literary criticism of that work might be expressed. Note that this link is separate from all of the resources involved. The semantics of extended links depend on the application, but another example following the discussion of Extended Pointers will demonstrate how extended links can be used to add links to read-only resources.
Extended Links can be in-line, so that the content of the linking element other than the locator elements participates in the link as a resource--but that is not necessarily the case. The example above is an out-of-line link because it does not use its content as a resource.
Cross references with the SGML ID/IDREF mechanism (which is similar to the #fragment mechanism in HTML) require that the document being linked-to has defined anchors where links are desired. This may not always be the case, however; sometimes it is not possible to modify the document to which you wish to link.
XML XPointers borrow concepts from HyTime  and the Text Encoding Initiative (TEI) . XPointers offer a syntax that allows you to locate a resource by traversing the element tree of the document containing the resource.
locates the third child (whatever it may be) of the second oldjoke in the document.
XPointers can span regions of the tree. The XPointer
selects the second and third oldjokes in the document.
In addition to selecting by elements, XPointers allow for selection by ID, attribute value, and string matching. In this article, the XPointer
selects the first occurrence of the word "Here" in the "What Do XML Documents Look Like?" section of this article. This link can be established by an extended link without modifying this document.
Note that an XPointer range can span a structurally invalid section of the document. The XLL specification does not specify how applications should deal with such ranges.
Extended Link Groups
Out-of-line links introduce the possibility that an XML processor may need to process several files in order to correctly display the hypertext document.
Following the annotated text example above, and assuming that the actual text is read-only, the XML processor must load at least the text and the document that contains the extended link.
XLL defines Extended Link Groups for this purpose. The act of loading an Extended Link Group communicates which documents must be loaded to the XML processor. Extended Link Groups can be used recursively; a STEPS attribute is provided to limit the depth of recursion.
Style and Substance
HTML browsers are largely hardcoded. A first level heading appears the way it does because the browser recognizes the H1 tag.
Again, since XML documents have no fixed tag set, this approach will not work. The presentation of an XML document is dependent on a stylesheet.
At the time of this writing, the XSL effort is just getting off the ground. XSL is likely to be focused on DSSSL, the Document Style Semantics and Specification Language . DSSSL is an international standard stylesheet language (ISO/IEC 10179:1996). Some tools already exist which can process XML with DSSSL stylesheets, but none have yet been integrated into browsers.
Other stylesheet languages, like Cascading Style Sheets,  are likely to be supported as well.
In this article, most of the major features of the XML Language have been discussed, and some of the concepts behind XML Link and XML Style have been described. Although some things have been left out in the interest of the big picture (such as character encoding issues), hopefully you now have enough background to pick up and read the XML specifications in this issue without difficulty.
Appendix A: Extended Backus-Naur Form (EBNF)
One of the most significant design improvements in XML is to make it easy to use with modern compiler tools. Part of this improvement involves making it possible to express the syntax of XML in Extended Backus-Naur Form (EBNF) [Section 1.4]. If you've never seen EBNF before, think of it this way:
- EBNF is a set of rules, called "productions."
- Every rule describes a specific fragment of syntax.
- A document is valid if it can be reduced to a single, specific rule, with no input left, by repeated application of the rules.
Let's take a simple example that has nothing to do with XML (or the real rules of language):
 Word ::= Consonant Vowel+ Consonant  Consonant ::= [^aeiou]  Vowel ::= [aeiou]
Rule 1 states that a word is a consonant followed by one or more vowels followed by another consonant. Rule 2 states that a consonant is any letter other than a, e, i, o, or u. Rule 3 states that a vowel is any of the letters a, e, i, o, or u. (The exact syntax of the rules, the meaning of square brackets and other special symbols, is laid out in the XML specification.)
Using the above example, is "red" a Word? Yes, for the following reasons:
- "red" is the letter r followed by the letter e followed by the letter d: 'r' 'e' 'd'.
- r is a Consonant by rule 2, so "red" is: Consonant 'e' 'd'
- e is a vowel by rule 3, so "red" is: Consonant Vowel 'd'.
- By rule 2 again, "red" is: Consonant Vowel Consonant which, by rule 1, is a Word.
By the same analysis, "reed", "road", and "xeaiioug" are also words, but "rate" is not. There is no way to match Consonant Vowel Consonant Vowel using the EBNF above. XML is defined by an EBNF grammar of about 80 rules. Although the rules are more complex, the same sort of analysis allows an XML parser to determine that <greeting>Hello World</greeting> is a syntactically correct XML document while <greeting]Wrong Bracket!</greeting> is not.
In very general terms, that's all there is to it. You'll find all the details about EBNF in Compilers: Principles, Techniques, and Tools  or in any modern compiler text book.
While EBNF isn't an efficient way to represent syntax for human consumption, there are programs that can automatically turn EBNF into a parser. This makes it a particularly efficient way to represent the syntax for a language that will be parsed by a computer.
- Aho, Alfred V., Ravi Sethi, and Jeffrey D. Ullman, Compilers, Principles, Techniques, and Tools. Reading: Addison-Wesley, 1986, rpt. corr. 1988.
About the Author
- ArborText, Inc.
- 1000 Victors Way
- Ann Arbor, MI 48108
Norman Walsh is a Senior Application Analyst at ArborText, Inc. ArborText develops industrial stength SGML authoring and publishing tools and distributes these products worldwide. He has also developed a number of Web resources, including The Internet Font Archives, and is the author of Making TeX Work, published by O'Reilly & Associates.
Norm telecommutes from beautiful Amherst, Massachusetts where he lives with his wife Deborah, two cats, and several frogs.