Big Documents, Little Attributes
June 6, 2001
Q: How do I process a big XML document?
DOM-based XML parsers load an XML document's tree structure into memory. What if the XML document is too big -- say, 7 million records? This may give an out-of-memory error. On the other hand, SAX-based parsers read character-by-character from an XML document. If I have to search for a record at the end of a document this big, it's going to take a long time to find the matching record.
So what sort of parser should I use to parse a XML file which has around 7 million records -- taking into consideration memory and search time?
A: I don't know anything about your application other than the information you've provided above, but you know what it sounds like to me? It sounds like somewhere along the line, someone (not necessarily you) made a decision to use XML for the application without properly considering the consequences.
As a general rule (to which there are exceptions, no doubt), XML is ill-suited as a large-document storage format. There are a number of intrinsic reasons for this, such as the following.
- The markup itself (tags, attributes, entity references, and so on) can occupy (and waste) a significant amount of the document's overall size.
- There's nothing inherent in an XML document to accelerate searching for a node with
particular value. (If you're using XSLT to process the document, you can take some
advantage of the "indexing" which most processors do when encountering functions like
key(). Trouble is, the index goes away as soon as the document is closed -- and of course needs to be rebuilt the next time it's opened.)
- All you can truly store in an XML document is text data. You can sort of hack
your way around this limitation, using external entities, XLink
hrefs, and base64 encoding of binary data. None of which changes the fact that the document itself will contain nothing but text.
If you must process 7 million records, must do it repeatedly, and are determined to do it with XML rather than a DBMS, I think you should forget about "what sort of parser" to use. That's the least of your worries. Instead, consider one of the various XML databases on the market. You can start by checking out Ron Bourret's XML Database Products page.
Good luck. Plan for some long work days.
Q: I'm confused about specifying attribute values in a DTD
I'm having trouble understanding
vs. "default values" attribute defaults. For example:
<!ATTLIST Publisher #REQUIRED ISBN CDATA "?????">
do? What does the
ISBN CDATA "?????"
A: Your example is scrambled. The general syntax for an ATTLIST
You've got the element name (
Publisher) there all right, and the attribute name (
but the attribute name needs to immediately follow the element name. Thus, setting
your other questions for a moment, your declaration should read:
<!ATTLIST Publisher ISBN attribvalueinfo>
Now let's take a look at what goes into
attribvalueinfo. This is
actually a kind of shorthand for two types of information: the attribute type and
attribute's default-value specification.
A DTD can declare an attribute as one of three general types: enumerated, string, or tokenized.
The first, enumerated, simply lists the allowable values for the attribute in
enumerated form: as a series of tokens separated by "pipes" (vertical-bar or |
characters). So you might have something like this:
<!ATTLIST Publisher medium (print | online | other)
Then there's the string attribute type. Unlike the enumerated type, for which the
author of the DTD provides a list of valid values, a string-type attribute is constrained
(as the name implies) simply in that its value must be a string of characters. You
a string-type attribute with the keyword CDATA. For instance:
<!ATTLIST Publisher ISBN CDATA defaultspec>
Finally, the tokenized attribute type takes as its value either a single token
(roughly speaking, a "word"), or a series of tokens separated by whitespace. The value
further constrained in that it must be of some type of token. For example:
<!ATTLIST Publisher ISBN ENTITY
limits the value of the
ISBN attribute to
the name of some entity declared elsewhere in the DTD. And
<!ATTLIST Publisher ISBN IDREFS
says that the
ISBN attribute's value can
consist of one or more (whitespace-delimited) tokens whose values match those of
ID-type attributes elsewhere in the same document.
Specific attribute default values
Providing a default value for the enumerated and string-type attributes is simple.
for the enumerated type, you can just select which of the allowable values you want
the default value, enclosing it in quotation marks. Thus,
<!ATTLIST Publisher medium (print | online | other) "print">
In the absence of any
medium attribute for a given
Publisher element, the
attribute will thus assume the value of
You can also supply a simple, quotation-mark-enclosed string value to be used as the
default value of a string-type attribute. This wouldn't make a lot of sense in the
case of an ISBN (you wouldn't want to assign a default ISBN to every
element). But it might make a lot of sense in the case of (say) an XLink
attribute, for which you'd want to supply a default location for some resource. Like
<!ATTLIST Pub_page xlink:href CDATA "http://www.ora.com">
This would ensure that the XLink would point to something even if, in a particular instance, the document author failed to provide a URI for it.
However, besides assigning a specific value, you can also use a couple of other forms
defaultspec portion of the attribute declaration. These are the
forms which seem to be tripping you up.
"Generic" attribute default specifications
Also in XML Q&A
You've got two choices in declaring a non-specific attribute default value:
#IMPLIED is for use when you don't want your DTD to supply a default value at
all and don't care if a given document's author has supplied a value for it, either.
<!ATTLIST Publisher ISBN CDATA #IMPLIED>
asserts that a given
Publisher element may or may not have an
ISBN attribute. If
there is no such attribute, its value is undefined.
On the other hand,
#REQUIRED says some value must be supplied. Assume
the declaration for the
ISBN attribute looks like this:
<!ATTLIST Publisher ISBN CDATA #REQUIRED>
Then the following
Publisher element appearing in a document instance would be rejected by a
validating XML parser:
<Publisher>...</Publisher>because there's no
Special case: #FIXED-type attributes
String- and tokenized-type attributes can have their "default" values specified using
#FIXED. I put the "default" in quotes because what the
#FIXED keyword actually specifies is the only allowable value for the
given attribute. If the attribute is not supplied in a given document, it's assumed
the value assigned by the
ATTLIST declaration; if the attribute is
supplied in a given document, it may have that value only. For example,
<!ATTLIST Article author CDATA #FIXED "John E. Simpson">
This may seem like a mostly useless kind of attribute declaration to make. For example,
above asserts that every
Article element will always be understood to
author attribute whose value will always be "John E. Simpson." How many
real applications might there be like this?
One common use for
#FIXED-type attributes occurs when you're mixing attributes
from various XML vocabularies. You might not control the range of allowable values
given attribute as expressed in one of these vocabularies, but you need to limit it
expressed in your own vocabulary. For instance, the XLink spec permits the
xlink:type attribute to have a value of
locator, and so on. In your application you may want
to guarantee that for a particular XLinking-type element, this attribute has a value
extended, or whatever. That is, in its "native
tongue" the attribute may be enumerated, string, or tokenized; but as used in your
application it may have only a single specific value.