XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Entities: Handling Special Content

January 31, 2001

This month we've gleaned a couple of questions about handling special content in XML documents.

Q: I have content like <company>Harris & George</company> in my XML document -- will the "&" be seen as a special character?

A: I don't know what specific tools you're using. But if they include a fully XML-compliant parser, then the answer is yes, the ampersand will be seen as a special character because it is a special character.

At the lowest levels an XML parser is just a program that reads through an XML document a character at a time and analyzes it in one way or another, then behaves accordingly. It knows that it's got to process some content differently than other content. What distinguishes these special cases is the presence of such characters as "&" and "<". They act as flags to the parser; they delimit the document's actual content, alerting the parser to the fact that it must do something at this point other than simply pass the adjacent content to some downstream application.

In the case of the ampersand, what the parser expects when it hits one goes something like this: "Everything between the ampersand and the first semi-colon which comes after it is meant as a code standing in for something else." One such code is called an entity (with the enclosing & and ; characters, it's called an entity reference), and the "something else" it stands for can be a single character, a whole block of characters, or even non-XML data. There are two catches, though: what's permitted to go between the opening and closing & and ; of an entity reference, and what the parser can be expected to know about whatever is there.

What's allowed between the & and the ;

Everything between the ampersand and semi-colon must either be numeric (including a # sign) or constitute a valid XML name. A valid name cannot include any whitespace, or most special characters (except for hyphens, underscores, and periods). That's where the above example, "Harris & George," will fall down: the very first character after the & is an illegal blank space.

What a parser knows about entities

If what appears between the & and ; is numeric (either decimal or hexadecimal) and falls in the range of acceptable values for Unicode data, there's no problem at all. Each of these numeric values stands for a single Unicode character, from the common to the exotic. (Of course, what's considered common or exotic varies depending on where you're sitting.)

Here are a handful of examples of these numeric character references:

Character reference Stands for...
&#198; Æ (capital "AE" ligature)
&#x2105; ("in care of" symbol)
&#222; Þ (capital Icelandic thorn)

But if the entity reference includes a name, then there's definitely a potential problem: Given the fact that XML lets you make up your own names for just about everything, how can a parser know in advance how to interpret what every possible named entity refers to?

The first line of defense is a handful of names that any parser is supposed to recognize when it encounters them in well-formed documents.

Entity reference Stands for...
&amp; &
&apos; '
&quot; "
&lt; <
&gt; >

So one way to get around your immediate problem is to replace the ampersand in your content with the appropriate entity reference: <company>Harris &amp; George</company>

Telling the parser about your entities

You're not restricted to just the five named character entities covered above. In fact, you can use any that you want as long as you tell the parser what each name means.

Whether or not your document is based on an existing DTD, you can always declare your own entities in an internal DTD subset. This (optional) portion of a document's prolog looks something like the following.

<?xml version="1.0"?>
<!DOCTYPE rootelem [
<!ENTITY name "value">
]>
<rootelem>...</rootelem>

Here, rootelem would be replaced with the name of whatever your own document's root element is. As for the entity declaration, name will be whatever you want your entity to appear as in your document (that is, the text that appears between the & and ; characters); and value will be what you want the parser to substitute in the document when it encounters the entity reference.

For example,

<?xml version="1.0"?>
<!DOCTYPE names [
<!ENTITY ccedilla "&#231;">
]>
<names>
<name>Franc&ccedilla;ois</name>
</names>

The parser -- and any downstream application to which it passes data -- will read this name as "François."

Q: How can I insert multimedia files into XML documents?

A: The bad news is that you can't. XML documents contain text and text only.

However, you can insert references to multimedia files in your documents. Of course, these references must also be text -- commonly in the form of a Uniform Resource Identifier (URI) like the value of an HTML img element's src attribute. If your downstream software is smart enough, you can tell it quite a bit more about your multimedia content than just where to find it (which is essentially what a URI does). The trick to doing so is to use a variation of the same entity gimmick we saw above.

If you think about what those character entity references are up to, they make accessible to XML software something that otherwise wouldn't be, even something as simple as an ampersand. And that's also what unparsed entity references do. They let the non-XML world in.

You declare an unparsed entity in a DTD in roughly the same way as one of the simpler ones, but notice the differences

<!NOTATION notationname system_public_IDs >
<!ENTITY name SYSTEM "specific_uri" NDATA notationname>

where system_public_IDs is

  • the keyword SYSTEM, followed by a general_uri (in quotation marks); or
  • the keyword PUBLIC, followed by a public_id (in quotation marks); or
  • the keyword PUBLIC, followed by both a public_id and a general_uri (both in quotation marks).

The first declaration tells the XML processor that you're going to be referring to non-XML content (that's what a notation does). This kind of non-XML content is going to go by the name of notationname, and it can be processed by the application located at general_uri or is defined by the specification known as public_id. At this point we still haven't named a specific multimedia file or other resource. We've simply declared general characteristics of a general kind of resource.

The second declaration is where we get specific. We assign a name to the entity. But instead of declaring that this name means a simple character, we associate it with a specific external file/resource, namely, the one located at specific_uri. Then we go a step further: the resource identified by this entity -- so says the NDATA keyword -- is the kind of non-XML content called notationname. The latter maps back to the NOTATION declaration previously.

And then there's one more piece: actually bringing the content into the document. You do this just as you brought the meaning of the character entity references into the document: simply include the entity reference.

For example, let's say we've got a GIF image that we want to incorporate into our document. The declarations, and a portion of the document itself, might look like

<?xml version="1.0"?>
<!DOCTYPE somedoc [
<!NOTATION gif_pix
PUBLIC "-//ISBN 0-7923-9432-1::Graphic Notation//NOTATION CompuServe Graphic Interchange Format//EN">
<!ENTITY my_logo SYSTEM "my_logo.gif" NDATA gif_pix>
]>
<somedoc>
...
&my_logo;
...
</somedoc>

But it's not as easy as this makes it look. Figuring out what mix of SYSTEM and PUBLIC identifiers to use for notation declarations can be unpleasant enough. But the real problem is finding common software that knows what to do with an entity reference such as &my_logo; in this example. Don't expect a Web browser that will be able to do anything with it anytime soon.