Big Documents, Little Attributes
DOM-based XML parsers load an XML document's tree structure into memory. What if the XML document is too big -- say, 7 million records? This may give an out-of-memory error. On the other hand, SAX-based parsers read character-by-character from an XML document. If I have to search for a record at the end of a document this big, it's going to take a long time to find the matching record.
So what sort of parser should I use to parse a XML file which has around 7 million records -- taking into consideration memory and search time?
A: I don't know anything about your application other than the information you've provided above, but you know what it sounds like to me? It sounds like somewhere along the line, someone (not necessarily you) made a decision to use XML for the application without properly considering the consequences.
As a general rule (to which there are exceptions, no doubt), XML is ill-suited as a large-document storage format. There are a number of intrinsic reasons for this, such as the following.
id() and
key(). Trouble is, the index goes away as soon as the
document is closed -- and of course needs to be rebuilt the next
time it's opened.)hrefs, and base64 encoding of binary data. None of
which changes the fact that the document itself will contain nothing but
text.If you must process 7 million records, must do it repeatedly, and are determined to do it with XML rather than a DBMS, I think you should forget about "what sort of parser" to use. That's the least of your worries. Instead, consider one of the various XML databases on the market. You can start by checking out Ron Bourret's XML Database Products page.
Good luck. Plan for some long work days.
I'm having trouble understanding
#FIXED vs:
#REQUIRED vs. "default values"
attribute defaults. For example:What
does the
<!ATTLIST
Publisher #REQUIRED ISBN CDATA "?????">
#REQUIRED do? What does the
ISBN CDATA "?????" actually
mean?
A: Your example is scrambled. The general syntax for an
ATTLIST declaration is
<!ATTLIST elementname attribname
attribvalueinfo>
You've got the element name (Publisher) there all right,
and the attribute name (ISBN) too, but the attribute name
needs to immediately follow the element name. Thus, setting aside your
other questions for a moment, your declaration should read:
<!ATTLIST Publisher ISBN
attribvalueinfo>
Now let's take a look at what goes into
attribvalueinfo. This is actually a kind of
shorthand for two types of information: the attribute type and the
attribute's default-value specification.
A DTD can declare an attribute as one of three general types: enumerated, string, or tokenized.
The first, enumerated, simply lists the allowable values for the
attribute in enumerated form: as a series of tokens separated by
"pipes" (vertical-bar or | characters). So you might have something like
this:
<!ATTLIST Publisher medium (print | online | other)
defaultspec>
Then there's the string attribute type. Unlike the
enumerated type, for which the author of the DTD provides a list of
valid values, a string-type attribute is constrained (as the name
implies) simply in that its value must be a string of characters. You
indicate a string-type attribute with the keyword CDATA. For
instance:
<!ATTLIST Publisher ISBN CDATA defaultspec>
Finally, the tokenized attribute type takes as its value
either a single token (roughly speaking, a "word"), or a series of
tokens separated by whitespace. The value is further constrained in
that it must be of some type of token. For example:
<!ATTLIST Publisher ISBN ENTITY defaultspec>
limits the value of the ISBN attribute to the name of
some entity declared elsewhere in the DTD. And
<!ATTLIST Publisher ISBN IDREFS defaultspec>
says that the ISBN attribute's value can consist of one
or more (whitespace-delimited) tokens whose values match those of
ID-type attributes elsewhere in the same document.
Providing a default value for the enumerated and string-type
attributes is simple. First, for the enumerated type, you can just
select which of the allowable values you want to be the default value,
enclosing it in quotation marks. Thus,
<!ATTLIST Publisher medium (print | online | other)
"print">
In the absence of any medium attribute for a given
Publisher element, the attribute will thus assume the
value of print just as if it had been explicitly coded
that way by the document's author.
You can also supply a simple, quotation-mark-enclosed string value
to be used as the default value of a string-type
attribute. This wouldn't make a lot of sense in the case of an ISBN
(you wouldn't want to assign a default ISBN to every
Publisher element). But it might make a lot of sense in
the case of (say) an XLink href attribute, for which
you'd want to supply a default location for some resource. Like
this:
<!ATTLIST Pub_page xlink:href CDATA
"http://www.ora.com">
This would ensure that the XLink would point to something even
if, in a particular instance, the document author failed to provide a
URI for it.
However, besides assigning a specific value, you can also use a
couple of other forms of the defaultspec portion
of the attribute declaration. These are the forms which seem to be
tripping you up.
|
Also in XML Q&A | |
You've got two choices in declaring a non-specific attribute default value:
#IMPLIED and #REQUIRED.
#IMPLIED is for use when you don't want your DTD to supply a
default value at all and don't care if a given document's author has
supplied a value for it, either. Thus,
<!ATTLIST Publisher ISBN CDATA #IMPLIED>
asserts that a given Publisher element may or may not
have an
ISBN attribute. If there is no such attribute, its value is
undefined.
On the other hand, #REQUIRED says some value must be
supplied. Assume the declaration for the ISBN attribute looks
like this:
<!ATTLIST Publisher ISBN CDATA #REQUIRED>
Then the following Publisher element appearing in a
document instance would be rejected by a validating XML parser:
<Publisher>...</Publisher>because
there's no
ISBN attribute.
String- and tokenized-type attributes can have their "default"
values specified using an additional keyword: #FIXED. I
put the "default" in quotes because what the #FIXED
keyword actually specifies is the only allowable value for the
given attribute. If the attribute is not supplied in a given document,
it's assumed to have the value assigned by the ATTLIST
declaration; if the attribute is supplied in a given document,
it may have that value only. For example,
<!ATTLIST Article author CDATA #FIXED "John E.
Simpson">
This may seem like a mostly useless kind of attribute declaration
to make. For example, the above asserts that every
Article element will always be understood to have an
author attribute whose value will always be "John
E. Simpson." How many real applications might there be like
this?
One common use for #FIXED-type attributes occurs when
you're mixing attributes from various XML vocabularies. You might not
control the range of allowable values for a given attribute as
expressed in one of these vocabularies, but you need to limit it as
expressed in your own vocabulary. For instance, the XLink spec
permits the xlink:type attribute to have a value of
simple, extended, locator, and
so on. In your application you may want to guarantee that for a
particular XLinking-type element, this attribute has a value of only
simple, only extended, or whatever. That is,
in its "native tongue" the attribute may be enumerated, string, or
tokenized; but as used in your application it may have only a
single specific value.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.