Big Documents, Little Attributes
June 6, 2001
Q: How do I process a big XML document?
DOM-based XML parsers load an XML document's tree structure into memory. What if the XML document is too big -- say, 7 million records? This may give an out-of-memory error. On the other hand, SAX-based parsers read character-by-character from an XML document. If I have to search for a record at the end of a document this big, it's going to take a long time to find the matching record.
So what sort of parser should I use to parse a XML file which has around 7 million records -- taking into consideration memory and search time?
A: I don't know anything about your application other than the information you've provided above, but you know what it sounds like to me? It sounds like somewhere along the line, someone (not necessarily you) made a decision to use XML for the application without properly considering the consequences.
As a general rule (to which there are exceptions, no doubt), XML is ill-suited as a large-document storage format. There are a number of intrinsic reasons for this, such as the following.
- The markup itself (tags, attributes, entity references, and so on) can occupy (and waste) a significant amount of the document's overall size.
- There's nothing inherent in an XML document to accelerate searching for a node with
a
particular value. (If you're using XSLT to process the document, you can take some
advantage of the "indexing" which most processors do when encountering functions like
id()
andkey()
. Trouble is, the index goes away as soon as the document is closed -- and of course needs to be rebuilt the next time it's opened.) - All you can truly store in an XML document is text data. You can sort of hack
your way around this limitation, using external entities, XLink
href
s, and base64 encoding of binary data. None of which changes the fact that the document itself will contain nothing but text.
If you must process 7 million records, must do it repeatedly, and are determined to do it with XML rather than a DBMS, I think you should forget about "what sort of parser" to use. That's the least of your worries. Instead, consider one of the various XML databases on the market. You can start by checking out Ron Bourret's XML Database Products page.
Good luck. Plan for some long work days.
Q: I'm confused about specifying attribute values in a DTD
I'm having trouble understanding
#FIXED
vs:
#REQUIRED
vs. "default values" attribute defaults. For example:
What does the
<!ATTLIST Publisher #REQUIRED ISBN CDATA "?????">
#REQUIRED
do? What does the
ISBN CDATA "?????"
actually mean?
A: Your example is scrambled. The general syntax for an ATTLIST
declaration is
<!ATTLIST elementname
attribname
attribvalueinfo>
You've got the element name
(Publisher
) there all right, and the attribute name (ISBN
) too,
but the attribute name needs to immediately follow the element name. Thus, setting
aside
your other questions for a moment, your declaration should read:
<!ATTLIST Publisher ISBN attribvalueinfo>
Now let's take a look at what goes into attribvalueinfo
. This is
actually a kind of shorthand for two types of information: the attribute type and
the
attribute's default-value specification.
Attribute types
A DTD can declare an attribute as one of three general types: enumerated, string, or tokenized.
The first, enumerated, simply lists the allowable values for the attribute in
enumerated form: as a series of tokens separated by "pipes" (vertical-bar or |
characters). So you might have something like this:
<!ATTLIST Publisher medium (print | online | other)
defaultspec>
Then there's the string attribute type. Unlike the enumerated type, for which the
author of the DTD provides a list of valid values, a string-type attribute is constrained
(as the name implies) simply in that its value must be a string of characters. You
indicate
a string-type attribute with the keyword CDATA. For instance:
<!ATTLIST Publisher ISBN CDATA defaultspec>
Finally, the tokenized attribute type takes as its value either a single token
(roughly speaking, a "word"), or a series of tokens separated by whitespace. The value
is
further constrained in that it must be of some type of token. For example:
<!ATTLIST Publisher ISBN ENTITY
defaultspec>
limits the value of the ISBN
attribute to
the name of some entity declared elsewhere in the DTD. And
<!ATTLIST Publisher ISBN IDREFS
defaultspec>
says that the ISBN
attribute's value can
consist of one or more (whitespace-delimited) tokens whose values match those of
ID
-type attributes elsewhere in the same document.
Specific attribute default values
Providing a default value for the enumerated and string-type attributes is simple.
First,
for the enumerated type, you can just select which of the allowable values you want
to be
the default value, enclosing it in quotation marks. Thus,
<!ATTLIST Publisher medium (print | online | other) "print">
In the
absence of any medium
attribute for a given Publisher
element, the
attribute will thus assume the value of print
just as if it had been explicitly
coded that way by the document's author.
You can also supply a simple, quotation-mark-enclosed string value to be used as the
default value of a string-type attribute. This wouldn't make a lot of sense in the
case of an ISBN (you wouldn't want to assign a default ISBN to every Publisher
element). But it might make a lot of sense in the case of (say) an XLink href
attribute, for which you'd want to supply a default location for some resource. Like
this:
<!ATTLIST Pub_page xlink:href CDATA "http://www.ora.com">
This would
ensure that the XLink would point to something even if, in a particular instance, the
document author failed to provide a URI for it.
However, besides assigning a specific value, you can also use a couple of other forms
of
the defaultspec
portion of the attribute declaration. These are the
forms which seem to be tripping you up.
"Generic" attribute default specifications
![]() |
|
Also in XML Q&A |
|
You've got two choices in declaring a non-specific attribute default value:
#IMPLIED
and #REQUIRED
.
#IMPLIED
is for use when you don't want your DTD to supply a default value at
all and don't care if a given document's author has supplied a value for it, either.
Thus,
<!ATTLIST Publisher ISBN CDATA #IMPLIED>
asserts that a given
Publisher
element may or may not have an ISBN
attribute. If
there is no such attribute, its value is undefined.
On the other hand, #REQUIRED
says some value must be supplied. Assume
the declaration for the ISBN
attribute looks like this:
<!ATTLIST Publisher ISBN CDATA #REQUIRED>
Then the following
Publisher
element appearing in a document instance would be rejected by a
validating XML parser:
<Publisher>...</Publisher>
because there's no
ISBN
attribute.
Special case: #FIXED-type attributes
String- and tokenized-type attributes can have their "default" values specified using
an
additional keyword: #FIXED
. I put the "default" in quotes because what the
#FIXED
keyword actually specifies is the only allowable value for the
given attribute. If the attribute is not supplied in a given document, it's assumed
to have
the value assigned by the ATTLIST
declaration; if the attribute is
supplied in a given document, it may have that value only. For example,
<!ATTLIST Article author CDATA #FIXED "John E. Simpson">
This may seem like a mostly useless kind of attribute declaration to make. For example,
the
above asserts that every Article
element will always be understood to
have an author
attribute whose value will always be "John E. Simpson." How many
real applications might there be like this?
One common use for #FIXED
-type attributes occurs when you're mixing attributes
from various XML vocabularies. You might not control the range of allowable values
for a
given attribute as expressed in one of these vocabularies, but you need to limit it
as
expressed in your own vocabulary. For instance, the XLink spec permits the
xlink:type
attribute to have a value of simple
,
extended
, locator
, and so on. In your application you may want
to guarantee that for a particular XLinking-type element, this attribute has a value
of only
simple
, only extended
, or whatever. That is, in its "native
tongue" the attribute may be enumerated, string, or tokenized; but as used in your
application it may have only a single specific value.