Big Documents, Little Attributes

June 6, 2001

John E. Simpson

Q: How do I process a big XML document?

DOM-based XML parsers load an XML document's tree structure into memory. What if the XML document is too big -- say, 7 million records? This may give an out-of-memory error. On the other hand, SAX-based parsers read character-by-character from an XML document. If I have to search for a record at the end of a document this big, it's going to take a long time to find the matching record.

So what sort of parser should I use to parse a XML file which has around 7 million records -- taking into consideration memory and search time?

A: I don't know anything about your application other than the information you've provided above, but you know what it sounds like to me? It sounds like somewhere along the line, someone (not necessarily you) made a decision to use XML for the application without properly considering the consequences.

As a general rule (to which there are exceptions, no doubt), XML is ill-suited as a large-document storage format. There are a number of intrinsic reasons for this, such as the following.

  • The markup itself (tags, attributes, entity references, and so on) can occupy (and waste) a significant amount of the document's overall size.
  • There's nothing inherent in an XML document to accelerate searching for a node with a particular value. (If you're using XSLT to process the document, you can take some advantage of the "indexing" which most processors do when encountering functions like id() and key(). Trouble is, the index goes away as soon as the document is closed -- and of course needs to be rebuilt the next time it's opened.)
  • All you can truly store in an XML document is text data. You can sort of hack your way around this limitation, using external entities, XLink hrefs, and base64 encoding of binary data. None of which changes the fact that the document itself will contain nothing but text.

If you must process 7 million records, must do it repeatedly, and are determined to do it with XML rather than a DBMS, I think you should forget about "what sort of parser" to use. That's the least of your worries. Instead, consider one of the various XML databases on the market. You can start by checking out Ron Bourret's XML Database Products page.

 Good luck. Plan for some long work days.

Q: I'm confused about specifying attribute values in a DTD

I'm having trouble understanding #FIXED vs: #REQUIRED vs. "default values" attribute defaults. For example:
   <!ATTLIST Publisher #REQUIRED ISBN CDATA "?????">
What does the #REQUIRED do? What does the ISBN CDATA "?????" actually mean?

A: Your example is scrambled. The general syntax for an ATTLIST declaration is
   <!ATTLIST elementname attribname attribvalueinfo>
You've got the element name (Publisher) there all right, and the attribute name (ISBN) too, but the attribute name needs to immediately follow the element name. Thus, setting aside your other questions for a moment, your declaration should read:
   <!ATTLIST Publisher ISBN attribvalueinfo>

Now let's take a look at what goes into attribvalueinfo. This is actually a kind of shorthand for two types of information: the attribute type and the attribute's default-value specification.

Attribute types

A DTD can declare an attribute as one of three general types: enumerated, string, or tokenized.

The first, enumerated, simply lists the allowable values for the attribute in enumerated form: as a series of tokens separated by "pipes" (vertical-bar or | characters). So you might have something like this:
<!ATTLIST Publisher medium (print | online | other) defaultspec>

Then there's the string attribute type. Unlike the enumerated type, for which the author of the DTD provides a list of valid values, a string-type attribute is constrained (as the name implies) simply in that its value must be a string of characters. You indicate a string-type attribute with the keyword CDATA. For instance:
<!ATTLIST Publisher ISBN CDATA defaultspec>

Finally, the tokenized attribute type takes as its value either a single token (roughly speaking, a "word"), or a series of tokens separated by whitespace. The value is further constrained in that it must be of some type of token. For example:
   <!ATTLIST Publisher ISBN ENTITY defaultspec>
limits the value of the ISBN attribute to the name of some entity declared elsewhere in the DTD. And
   <!ATTLIST Publisher ISBN IDREFS defaultspec>
says that the ISBN attribute's value can consist of one or more (whitespace-delimited) tokens whose values match those of ID-type attributes elsewhere in the same document.

Specific attribute default values

Providing a default value for the enumerated and string-type attributes is simple. First, for the enumerated type, you can just select which of the allowable values you want to be the default value, enclosing it in quotation marks. Thus,
<!ATTLIST Publisher medium (print | online | other) "print">
In the absence of any medium attribute for a given Publisher element, the attribute will thus assume the value of print just as if it had been explicitly coded that way by the document's author.

You can also supply a simple, quotation-mark-enclosed string value to be used as the default value of a string-type attribute. This wouldn't make a lot of sense in the case of an ISBN (you wouldn't want to assign a default ISBN to every Publisher element). But it might make a lot of sense in the case of (say) an XLink href attribute, for which you'd want to supply a default location for some resource. Like this:
<!ATTLIST Pub_page xlink:href CDATA "">
This would ensure that the XLink would point to something even if, in a particular instance, the document author failed to provide a URI for it.

However, besides assigning a specific value, you can also use a couple of other forms of the defaultspec portion of the attribute declaration. These are the forms which seem to be tripping you up.

"Generic" attribute default specifications

Also in XML Q&A

From English to Dutch?

Trickledown Namespaces?

From XML to SMIL

From One String to Many

Getting in Touch with XML Contacts

You've got two choices in declaring a non-specific attribute default value: #IMPLIED and #REQUIRED.

#IMPLIED is for use when you don't want your DTD to supply a default value at all and don't care if a given document's author has supplied a value for it, either. Thus,
asserts that a given Publisher element may or may not have an ISBN attribute. If there is no such attribute, its value is undefined.

On the other hand, #REQUIRED says some value must be supplied. Assume the declaration for the ISBN attribute looks like this:
Then the following Publisher element appearing in a document instance would be rejected by a validating XML parser:
because there's no ISBN attribute.

Special case: #FIXED-type attributes

String- and tokenized-type attributes can have their "default" values specified using an additional keyword: #FIXED. I put the "default" in quotes because what the #FIXED keyword actually specifies is the only allowable value for the given attribute. If the attribute is not supplied in a given document, it's assumed to have the value assigned by the ATTLIST declaration; if the attribute is supplied in a given document, it may have that value only. For example,
<!ATTLIST Article author CDATA #FIXED "John E. Simpson">

This may seem like a mostly useless kind of attribute declaration to make. For example, the above asserts  that every Article element will always be understood to have an author attribute whose value will always be "John E. Simpson." How many real applications might there be like this?

One common use for #FIXED-type attributes occurs when you're mixing attributes from various XML vocabularies. You might not control the range of allowable values for a given attribute as expressed in one of these vocabularies, but you need to limit it as expressed in your own vocabulary. For instance, the XLink spec permits the xlink:type attribute to have a value of simple, extended, locator, and so on. In your application you may want to guarantee that for a particular XLinking-type element, this attribute has a value of only simple, only extended, or whatever. That is, in its "native tongue" the attribute may be enumerated, string, or tokenized; but as used in your application it may have only a single specific value.