Entities: What are They Good For?
August 28, 1998
Q: Tell me about entities in XML documents. What are they and how do I use them?
A: After you're comfortable with XML markup, it's time to tackle entities. The term entity in the XML Recommendation is used for several related, but slightly different things. There are three things that might loosely be called entities in XML, and we'll take a detailed look at each of them:
Internal entities function as typing shortcuts or macros.
External entities allow you to incorporate content from other files.
Parameter entities, which can be internal or external, are only available within the internal and external subsets (the DTD).
Entity Declarations, Attributes and Expansion
Entities must be declared before they can be used. They may be declared in the DTD, if your XML parser processes the DTD (also known as the external subset), or the internal subset. Note: if the same entity is declared more than once, only the first declaration applies and the internal subset is processed before the external subset.
All entities are declared with the "ENTITY" declaration. The exact format of the declaration distinguishes between internal, external, and parameter entities.
An internal entity declaration has the following form:
<!ENTITY entityname "replacement text">
You can use either double or single quotes to delimit the replacement text. The declaration of yoyo, mentioned earlier, would be:
<!ENTITY yoyo 'Yoyodyne Industries, Inc.'>
External entity declarations come in two forms. If the external entity contains XML text, the declaration has the following form:
<!ENTITY entityname [PUBLIC "public-identifier"] SYSTEM "system-identifier">
The system identifier must point to an instance of a resource via a URI, most commonly a simple filename. The public identifier, if supplied, may be used by an XML system to generate an alternate URI (this provides a handy level of indirection on systems that support public identifiers).
An external entity that incorporates chap1.xml into your document might be declared like this:
<!ENTITY chap1 SYSTEM "chap1.xml">
Despite the growing trend to store everything in XML, there are some legacy systems that still store data in non-XML formats. Graphics are sometimes stored in odd formats like PNG and GIF, for example ;-).
External entities that refer to these files must declare that data they contain is not XML. They accomplish this by indicating the format of the external entity in a notation:
<!ENTITY entityname [PUBLIC "public-identifier"] SYSTEM "system-identifier" notation>
See the section called Entity Attributes for more detail. An external entity that refers to the GIF image pic01.gif might be declared like this:
<!ENTITY mypicture SYSTEM "pic01.gif" GIF>
Parameter entity declarations are identified by a % preceding the entity name:
<!ENTITY % pentityname1 "replacement text"> <!ENTITY % pentityname2 SYSTEM "URI">
Note the space following the % in the declaration. Parameter entities can be either internal or external, but they cannot refer to non-XML data (you can't have a parameter entity with a notation).
External entities can be further classified as either "parsed" or "unparsed". Entities which refer to external files that contain XML are called "parsed entities;" entities which refer to other types of data, identified by a notation, are "unparsed."
The parser inserts the replacement text of a parsed entity into the document wherever a reference to that entity occurs. It is an error to insert an entity reference to an unparsed entity directly into the flow of an XML document. Unparsed entities can only be used as attribute values on elements with ENTITY attributes.
Unparsed entities are used most frequently on XML elements that incorporate graphics into a document. Consider the following brief document:
<!DOCTYPE doc [ <!ELEMENT doc (para|graphic)+> <!ELEMENT para (#PCDATA)> <!ELEMENT graphic EMPTY> <!ATTLIST graphic image ENTITY #REQUIRED alt CDATA #IMPLIED > <!NOTATION GIF SYSTEM "CompuServe Graphics Interchange Format 87a"> <!ENTITY mypicture SYSTEM "normphoto.gif" GIF> <!ENTITY norm "Norman Walsh"> ]> <doc> <para>The following element incorporates the image declared as "mypicture":</para> <graphic image="mypicture" alt="A picture of &norm"/> </doc>
You could also declare the image attribute as CDATA and simply type the filename, but the use of an entity offers a useful level of indirection.
There is a somewhat subtle distinction between entity attributes and entity references in attribute values. An "ordinary" (CDATA) attribute contains text. You can put internal entity references in that text, just as you can in any other content. An ENTITY attribute can only contain the name of an external, unparsed entity. In particular, note that it contains the name of the entity, not a reference to the entity.
Character references are expanded immediately. They behave exactly as if you had typed the literal character.
Entity references in the replacement text of other entities are not expanded until the entity being declared is referenced. In other words, this is legal in the internal subset:
<!ENTITY foobar "&f;bar"> <!ENTITY f "foo">
because the entity reference "&f;" isn't expanded until "&foobar;" is expanded.
Parsed entities are recognized in the body of your document, where unparsed entities are forbidden. Unparsed entities are allowed in entity attributes, where parsed entities are forbidden.
Although you can put references to internal entities in attribute values, it is illegal to refer to an external entity in an attribute value.
A couple of significant caveats apply to the use of entities:
Non-validating parsers are not required to resolve entities declared outside the document (in the external subset). In fact, non-validating parsers may not perform entity expansion at all.
At this time (August, 1998), it's not clear to what extent mainstream web browsers will support entities.
Entity references, while they can perhaps be a little tricky, offer a number of benefits:
The ability to define commonly used text in a single location.
The ability to break large documents up into workable modules.
They offer one possible foundation for a reuse strategy.
XML Q&A covers a variety of topics, dictated by you, the viewer. Please share your questions and suggestions for things you'd like to see covered to firstname.lastname@example.org.
Types of Entities
Do you ever get tired of typing the name of your company, "Yoyodyne Industries, Inc."? Have you ever had the pleasure of spelling it incorrectly in an important document? Internal entities offer a convenient solution to these problems.
Instead of typing the same text over and over again, you can define an internal entity to contain the text and then you only need to use the entity where you want to insert the text. Because the entity is expanded by the parser, you can be assured that you'll get the same text in every location. The parser will also catch typos if you misspell an entity name (so long as there's no entity name that matches your typo!).
To use an entity you insert an "entity reference" into your document. You're probably already familiar with some entity references because you need to use them for special characters that cannot be typed directly in an XML document, like "<" and "&". An entity reference is an ampersand (&), followed by the name of the entity, followed by a semicolon (;).
If you've defined the entity "yoyo" to contain the name of your company, then you can use it with the following entity reference "&yoyo;".
The text that is inserted by an entity reference is called the "replacement text". The replacement text of an internal entity can contain markup (elements, attributes, processing instructions, other entity references, etc.), but the content must be balanced (any element that you start in an entity must end in the same entity) and circular entity references are not allowed.
You create internal entities with entity declarations in the internal subset or the DTD.
Five internal entities are predefined in XML:
Table 1. Predefined Entities
|Entity Name||Replacement Text|
|lt||The less than sign (<)|
|gt||The greater than sign (>)|
|amp||The ampersand (&)|
|apos||The single quote or apostrophe (')|
|quot||The double quote (")|
All XML processors are required to support references to these entities, even if they are not declared.
Character references, which are similar in appearance to entity references, allow you to reference arbitrary Unicode characters, even if they aren't available directly on your keyboard. Character references are not properly entities at all.
Character references are numeric and can be used without any special declaration.
The basic format of a character reference is either "&#nnn;" or "&#xhhh;" where "nnn" is a decimal Unicode character number and "hhh" is a hexadecimal Unicode character number.
A character reference inserts the specified Unicode character directly into your document. Note that this does not guarantee that your processing or display system will be able to do anything useful with the character. For example, ⍮ would insert, in the words of the Unicode standard, an "APL Functional Symbol Semicolon Underbar". Whether or not you can print that character is an entirely different issue.
Character references differ from other entity references in a subtle but significant way. They are expanded immediately by the parser. Using '"' is exactly the same as '"'. In particular, this means you can't use the character reference in an attribute value to escape the quotation characters.
External entities offer a mechanism for dividing your document up into logical chunks. Rather than authoring a monolithic document, a book with 10 chapters for example, you can store each chapter in a separate file and use external entities to "source in" the 10 chapters.
Because external entities in different documents can refer to the same files on your file system, external entities provide an opportunity to implement reuse. Reuse of small, discrete components (figures, legal boilerplate, warning messages) is fairly easy to manage. Implementing reuse on a large scale requires an entity management system which XML, by itself, does not provide.
A few notes about external entities:
External entities do not have to consist of a single element; you can make a sequence of three paragraphs, or even a bunch of character data with embedded inline markup into an external entity. But the tags in an external entity must be well balanced (you can't start a tag in an entity and end it in your document or in another entity).
External entities can reference internal or other external entities, but you cannot have circular references.
You can refer to the same external entity several times in a single document. Note, however, that if you do this, you will have to avoid using ID attributes in the external entity if you're concerned about validity. Using an external entity which contains an ID in more than one location in your document will produce a document that has multiple, duplicate IDs which is a validity error.
It is legal to have several external entities that all refer to the same external file.
There are no additional restrictions placed on the character encodings used by external entities. In particular, external entities with differing encodings can be used in the same document.
External entities, like internal entities, have names and are referenced in the same manner, although they are declared differently.
Internal and external entity references are not expanded in the DTD or the internal subset (this allows you to use entity references in the replacement text of other entities without concern about the order of declarations). If you want to have the effect of entities and entity references in your DTD, parameter entities must be used. Parameter entity references use the "%" character instead of the "&". Parameter entities can't be used in the content of your document; they simply aren't recognized.
It is legal to have a parameter entity and an internal or external entity with the same name. They are completely different types of entities and cannot conflict with each other.
One common use of parameter entities is in conditional sections. Conditional sections are a mechanism for parameterizing the DTD. Note, however, that you cannot use conditional sections in the internal subset of XML documents.