The Naming of Parts

July 25, 2001

Q: What are the rules for a valid XML element name?

I'm thinking, for example, of a rule like "an element name must begin with a letter (alphabet) and can be followed by alphanumeric characters." Are any special characters (like -, _, #, @, etc.) allowed in the name? Where can I find the specification that defines these rules?

A: You're actually pretty close to the real rules.

To begin to understand these official rules, you'll want to check the W3C Web site for the XML Recommendation itself. (The current version is the "second edition" of XML 1.0, basically unchanged from the version originally published in 1998, except for clearing up some ambiguities.)

A clear, useful description of what's allowed in an XML element (or other) name can be found in section 2.3, "Common Syntactic Constructs."

A Name is a token beginning with a letter or one of a few punctuation characters, and continuing with letters, digits, hyphens, underscores, colons, or full stops, together known as name characters.

This definition (like most of the XML Recommendation's definitions) is formally expressed in Extended Backus-Naur Form (EBNF) notation just a couple of lines further down. The EBNF of Name is

[4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender [5] Name ::= (Letter | '_' | ':') (NameChar)*

Reading EBNF

Here's a short lesson in deciphering Name's EBNF.

First, note the numbers enclosed in square brackets -- [4] and [5]. These numbers are called productions. It's not uncommon to find references, on XML-related mailing lists and newsgroups, to such things as "production 12" and "production 5." What these terms are referring to, then, are EBNF definitions in the spec. (If someone mentions "production 12," and you don't know what it means, just open the spec in your browser and do a text search on the string "[12]".)

Second, you need to understand that all these EBNF blocks -- these productions -- do is define some term or other. Production 4 defines the term "NameChar"; whenever that term is used elsewhere in the spec, production 4 provides the, well, the definitive definition of that term. The double-colon-equals character, ::=, can be read as "is defined as" or "comprises" and so on.

Finally, what's on the right of the ::= is similar in syntax to a content model in a DTD and uses many of the same regular-expression notations. According to this syntax, you might encounter a special character such as the vertical bar -- also called the "pipe" -- character, |. This character represents logical "or". So production 4 might be rewritten in English thus:

The term "NameChar" refers to a letter OR a digit OR a period (".") OR a hyphen ("-") OR an underscore ("_") OR a colon (":") OR a CombiningChar OR an Extender.

Unicode character classes

Note that to the right of the ::= is a mixture of punctuation and terms defined elsewhere, and that the terms on the right are presented as hyperlinks to their definitions. Thus, production 4 actually looks like

[4] NameChar ::= Letter | Digit | '.' | '-' | '_' | ':' | CombiningChar | Extender

The four hyperlinks lead to productions 84, 88, 87, and 89, respectively. These four productions are among those in Appendix B ("Character Classes"), and their definitions -- what's on the right each of their ::= symbols -- boil down to simple lists of Unicode value ranges, represented in hexadecimal form.

Of course "simple" is a relative term. You might imagine, for example, that the term "Digit" in production 4 equates to Unicode values #x0030 through #x0039 -- the hex representations of the characters 0 through 9. That's only a small fraction of the "digits" actually available for use as an XML name character though, as you can see from production 88:

All the other ranges represent legitimate "digits." They're just not digits as you may be accustomed to the term in Western "Arabic" numbering systems. (Also remember that these are not actual numeric values. An XML document may contain text representations of numeric values but not the numeric values themselves. In terms familiar to programmers, the character "9" is not the same as the number 9.)

If you're curious about all these hexadecimal Unicode values and the actual characters they represent, the Unicode code charts are the authoritative source, available either as PDF or as GIFs. Note the URLs addressed by the hyperlinks on these pages. The page of Arabic characters, for example, is designated as "U0600". The U stands for Unicode, and the four-digit value which follows it indicates the range of hexadecimal values covered by that PDF or Web page.

The bottom line

Back to your question. What characters may an XML name (element, etc.) contain and in what order? This is where you need to refer to production 5 listed above. To repeat (with hyperlinks):

[5] Name ::= (Letter | '_' | ':') (NameChar)*

Note that here the EBNF expression actually consists of a couple of "sub-expressions," grouped with parentheses. This production might be rewritten in English thus:

The term "Name" refers to (a letter OR an underscore OR a colon) FOLLOWED BY (any number of the characters defined by the term "NameChar").

The asterisk in the EBNF means "0 or more".

Also in XML Q&A

From English to Dutch?

Trickledown Namespaces?

From XML to SMIL

From One String to Many

Getting in Touch with XML Contacts

Thus, putting together productions 4 and 5, legitimate element names include the following:

axiom _axiom_26 :axiom_veintiséis ora:open.source

All of these names begin with a letter (as defined elsewhere as certain Unicode values), an underscore, or a colon, followed by any combination of letters, digits, underscores, colons, and periods.

The following are not legitimate XML element names:

#axiom @axiom 26th_of_month axiom#26

The first three begin with something other than a letter, underscore, or colon; the last starts out all right, but falls apart because the # is not a legitimate name character.

After you've put together some possible combinations of element names based on the above, I think you'll agree that the rules are really quite simple. What makes them seem complex is that they must be stated precisely and unambiguously, and that they must allow for "name characters" not just in Western language systems but in virtually any language representable as Unicode.