Relax, and Take it Easy
March 2, 2000
Two technologies introduced on Wednesday at XTech 2000 promise to make developers' lives easier, giving them less work to do in processing and managing XML. Paul Prescod's EasySAX provides a middle ground between the Simple API for XML (SAX) and the Document Object Model (DOM) for Python. Murata Makoto presented the REgular LAnguage description for XML (RELAX), which is a simpler schema language being developed at the Information Technology Research and Standardization Centre (INSTAC) in Japan.
Paul Prescod of ISOGEN, whom David Megginson introduced as the "Python evangelist for XML," described the work he's done on EasySAX, a Python module. EasySAX provides a middle ground between the development-intensive approach of event handler-based parsing and the memory-intensive approach of tree-based parsing. By offering developers events, context, and the ability to build tree structures when appropriate, EasySAX lets developers choose their own balance of processing tradeoffs.
Presenting a Zen quest for the "Pythonic" way to process XML, Prescod reused parts from both the SAX and DOM APIs, while hiding their complexity and resource demands. In fact, EasySAX merges some ideas from SAX, DOM, XSLT, and DSSSL, providing a layer above the bare parser API.
Prescod sought Aristotle's golden mean in his quest for a Pythonic processing API: making it simple, but not too simple to get the job done; elegant but not cute; flexible but not at the cost of clarity; and dynamic but maintainable.
Asking "Does SAX have the Python nature?", Prescod found much in SAX usable for his Python approach: SAX's complexity will be acceptable if hidden. SAX's good performance and standards conformance are Pythonic, and "reinventing wheels is not Pythonic." In order to hide the complexity, character handling, event dispatching, and context management are given more Python-like and "friendlier" support in the EasySAX API.
The DOM presents more difficulties for the "Python nature," even apart from Prescod's opinions on its overall design elegance. Tree models have serious limitations because of their ability to consume enormous amounts of memory rapidly, making them difficult to use with large documents. While some of Python's tools, such as the Zope Object database (ZODB), can ease those problems by moving large trees from memory to disk, that approach opens up new performance problems. Nonetheless, avoiding the reinvention of the wheel suggests the DOM has an important role to play.
EasySAX combines material from SAX and DOM with more borrowings from XSLT, XPath, DSSSL, Omnimark, Balise, and others, to build an API that dispatches nodes rather than events. These nodes have context, and content handlers can take advantage of that context to limit their activation to particular situations.
Because the appropriate amount of tree-building varies from application to application, EasySAX lets developers choose how to process nodes—with or without tree-building. A "mini-DOM" provides access to the tree structures built during parsing. At the same time, the parent context is always available during the parse, and can be used even without tree-building. Namespaces can be registered before or during the parse, allowing Python programmers to reference namespace URIs with prefixes, as is done in an XML document.
EasySAX is almost complete, and should be released in the next few days on Prescod's web site, though documentation, tools for pruning tree structures, and a number of other features are still in development.
RELAX supplies a simple tool for creating grammars that describe XML-based languages, providing a lightweight alternative to XML Schema Structures. Although RELAX is built in large part on the theoretical framework of "hedge grammars," it deliberately takes a lightweight approach to document description. Like other schema proposals, it uses an XML document syntax to describe document structures.
"Classic" RELAX provides the functionality of "DTD features minus default values minus entities minus notations plus datatypes." (RELAX uses XML Schema Datatypes to describe datatypes.) "Fully Relaxed" RELAX adds Horn clauses and regular hedge grammars to the mix. These make it possible to describe ancestor-sensitive content models, equivalence classes, local scoping, mutually-exclusive attributes, and content models that consider attribute values in their application.
RELAX can be used in the same contexts as DTDs and XML Schemas, as a document description language supporting validation and other processes building on an expected set of document structures. An alpha version of a Java-based tool for converting RELAX to DTDs is available, as is a C++-based tool for verifying that documents conform to RELAX descriptions. Another Java-based tool in alpha can generate Java classes for processing documents based on a RELAX description.
RELAX is also namespace-aware, and a RELAX Namespace version that supports mixing modules describing multiple namespaces should appear by June. The RELAX Core should appear this month, though the approval process for standardization (through JIS and ISO) will take longer. Tutorials and descriptions are currently available in Japanese and English (see links below).
Murata took a calm view of RELAX's prospects for success, accepting that "as of today, nobody knows," and that "users and developers will make the final call." Conversions from DTDs to RELAX, and from RELAX to XML Schemas, offer a middle ground that may make RELAX a useful tool for immediate development—even in cases where developers ultimately expect to migrate to XML Schemas. In any case, RELAX brings a different perspective to solving the schema definition problem.