The Naming of Parts
Q: What are the rules for a valid XML element name?
I'm thinking, for example, of a rule like "an element name must begin with a letter (alphabet) and can be followed by alphanumeric characters." Are any special characters (like -, _, #, @, etc.) allowed in the name? Where can I find the specification that defines these rules?
A: You're actually pretty close to the real rules.
To begin to understand these official rules, you'll want to check the W3C Web site for the XML Recommendation itself. (The current version is the "second edition" of XML 1.0, basically unchanged from the version originally published in 1998, except for clearing up some ambiguities.)
A clear, useful description of what's allowed in an XML element (or other) name can be found in section 2.3, "Common Syntactic Constructs."
A Name is a token beginning with a letter or one of a few punctuation characters, and continuing with letters, digits, hyphens, underscores, colons, or full stops, together known as name characters.
This definition (like most of the XML Recommendation's definitions) is formally expressed in Extended Backus-Naur Form (EBNF) notation just a couple of lines further down. The EBNF of Name is
 NameChar ::= Letter | Digit | '.' | '-' | '_' |
':' | CombiningChar | Extender
 Name ::= (Letter | '_' | ':') (NameChar)*
Here's a short lesson in deciphering Name's EBNF.
First, note the numbers enclosed in square brackets --
. These numbers are called
productions. It's not uncommon to find references, on
XML-related mailing lists and newsgroups, to such things as
"production 12" and "production 5." What these terms are referring to,
then, are EBNF definitions in the spec. (If someone mentions
"production 12," and you don't know what it means, just open the spec
in your browser and do a text search on the string "".)
Second, you need to understand that all these EBNF blocks -- these
productions -- do is define some term or other. Production 4
defines the term "NameChar"; whenever that term is used
elsewhere in the spec, production 4 provides the, well, the
definitive definition of that term. The double-colon-equals character,
::=, can be read as "is defined as" or "comprises" and
Finally, what's on the right of the
::= is similar in
syntax to a content model in a DTD and uses many of the same
regular-expression notations. According to this syntax, you might
encounter a special character such as the vertical bar
-- also called the "pipe" -- character, |. This character
represents logical "or". So production 4 might be rewritten in English
The term "NameChar" refers to a letter OR a digit OR a period (".") OR a hyphen ("-") OR an underscore ("_") OR a colon (":") OR a CombiningChar OR an Extender.
Unicode character classes
Note that to the right of the
::= is a mixture of
punctuation and terms defined elsewhere, and that the terms on the
right are presented as hyperlinks to their definitions. Thus,
production 4 actually looks like
The four hyperlinks lead to productions 84, 88, 87, and 89,
respectively. These four productions are among those in Appendix B
("Character Classes"), and their definitions -- what's on the right
each of their
::= symbols -- boil down to simple
lists of Unicode value ranges, represented in hexadecimal form.
Of course "simple" is a relative term. You might imagine, for
example, that the term "Digit" in production 4 equates to Unicode
values #x0030 through #x0039 -- the hex representations of the
9. That's only a small
fraction of the "digits" actually available for use as an XML name
character though, as you can see from production 88:
 Digit ::= [#x0030-#x0039] | [#x0660-#x0669] |
[#x0966-#x096F] | [#x09E6-#x09EF] | [#x0A66-#x0A6F] | [#x0AE6-#x0AEF] |
[#x0B66-#x0B6F] | [#x0BE7-#x0BEF] | [#x0C66-#x0C6F] | [#x0CE6-#x0CEF] |
[#x0D66-#x0D6F] | [#x0E50-#x0E59] | [#x0ED0-#x0ED9] | [#x0F20-#x0F29]
|Have you been bitten by XML naming rules or the intricacies of XML spec productions? Share your war stories in our forum.|
All the other ranges represent legitimate "digits." They're just not digits as you may be accustomed to the term in Western "Arabic" numbering systems. (Also remember that these are not actual numeric values. An XML document may contain text representations of numeric values but not the numeric values themselves. In terms familiar to programmers, the character "9" is not the same as the number 9.)
If you're curious about all these hexadecimal Unicode values and the actual characters they represent, the Unicode code charts are the authoritative source, available either as PDF or as GIFs. Note the URLs addressed by the hyperlinks on these pages. The page of Arabic characters, for example, is designated as "U0600". The U stands for Unicode, and the four-digit value which follows it indicates the range of hexadecimal values covered by that PDF or Web page.
The bottom line
Back to your question. What characters may an XML name (element, etc.) contain and in what order? This is where you need to refer to production 5 listed above. To repeat (with hyperlinks):
Note that here the EBNF expression actually consists of a couple of "sub-expressions," grouped with parentheses. This production might be rewritten in English thus:
The term "Name" refers to (a letter OR an underscore OR a colon) FOLLOWED BY (any number of the characters defined by the term "NameChar").
The asterisk in the EBNF means "0 or more".
Also in XML Q&A
Thus, putting together productions 4 and 5, legitimate element names include the following:
All of these names begin with a letter (as defined elsewhere as certain Unicode values), an underscore, or a colon, followed by any combination of letters, digits, underscores, colons, and periods.
The following are not legitimate XML element names:
The first three begin with something other than a letter, underscore, or colon; the last starts out all right, but falls apart because the # is not a legitimate name character.
After you've put together some possible combinations of element names based on the above, I think you'll agree that the rules are really quite simple. What makes them seem complex is that they must be stated precisely and unambiguously, and that they must allow for "name characters" not just in Western language systems but in virtually any language representable as Unicode.
- In a word, Bravo
2001-07-31 17:56:47 Dan Glickman