XQuery, XSLT, and OmniMark: Mixed Content Processing
Document-oriented XML usually has highly irregular structure in which elements might be mixed in unknown way. Processing such XML requires advanced data-driven facilities: push-style processing enriched with transformation rules and side-effect-free updates. In this article we emphasize such facilities in three XML-native languages: XQuery, XSLT, and OmniMark; and analyze applicability of these languages and their combinations to document-oriented XML processing. As data in many practical applications often comes as a result of a database query, we also examine various approaches to combine XQuery with XSLT or OmniMark for document-oriented XML processing over a database system.
What is notable about processing document-oriented XML data is that a particular XML element can appear virtually everywhere in the content (i.e. at any level of the hierarchy of the XML document tree and intermixed with any elements). Processing such elements, one usually wants to preserve their relative positions among other elements in the XML document tree. In other words, some elements are to be replaced while others are to be reserved. The replacement for an element may consist of nothing, another element, or a sequence of elements. Below we provide a number of particular examples of such replacements.
The primary approach to processing document-oriented XML data is data-driven transformation (where the order of the output is dictated by the order of the input) as opposed to code-driven transformation (where the order of the output is dictated by XSLT stylesheets, OmniMark rules, or XQuery queries).
Using data-driven transformation, it is very easy to preserve the relative position of elements being processed. In XSLT and OmniMark, data-driven transformations can be naturally expressed in push style using transformation rules.
Let us consider an example. Suppose we need to process a document-oriented XML document (doc.xml) as follows: replace all elements named "a" with an element named "b," which contains the content of "a" wrapped in the "*" symbol. This is how it looks in XSLT.
<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:template match="a">
<b>*<xsl:value-of select="text()"/>*</b>
</xsl:template>
<xsl:template match="*">
<xsl:element name="{node-name(.)}">
<xsl:apply-templates/>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
The same can be expressed in OmniMark as follows.
element a
output "<b>*" || "%c" || "*</b>"
element #implied
output "<%q>%c</%q>"
process
do xml-parse
scan file "doc.xml"
output "%c"
done
As XQuery has no support for push style--it is a pure pull-style language--the only way to express such transformation in XQuery is to use the polymorphic recursive function. The function traverses the source document and reconstructs it, replacing only the required elements. The following recursive function implements the same transformation as in the previous XSLT example.
declare function local:traverse-replace($n as node())
as node()
{
typeswitch($n)
case $a as element(a)
return
<b>*{$a/text()}*</b>
case $e as element()
return element
{ fn:local-name($e) }
{ for $c in $e/(* | text())
return local:traverse-replace($c) }
case $d as document-node()
return document
{ for $c in $d/* return local:traverse-replace($c) }
default return $n
};
The transformation can be applied to a whole document by invoking the local:traverse-replace function on the root node of the document, as follows:
local:traverse-replace(doc("doc.xml"))
Another way to accomplish such transformation in XQuery has been recently introduced by the W3C in "XQuery Update Facilities." The facilities extend XQuery with the transform operator, which allows performing data-driven XML transformations in a way that is very different from all previous approaches.
transform, you can avoid the reconstruction of the elements that remain unchanged.transform can be implemented using a random access execution model that avoids sequentially scanning all the data and employs instead alternative ways to access the required data (mainly via indices). The possibility to implement transform via the random access execution model makes transform suitable for efficient support in database systems.The main idea of the XQuery transform operator is to employ traditional in-place updates for data transformations. The semantics of in-place updates are modified to avoid side effects. That is why we refer to transform as a side-effect-free update. Semantically, instead of modifying the document, updates are evaluated on a new copy of it. Operationally, transform can be implemented without actual data copying (e.g., using a shadow mechanism as proposed in [Rekouts2006]).
The above example can be expressed via XQuery transform as follows.
transform
copy $new:=doc("doc.xml")
modify
for $a in $new//a
do replace $a with <b>*{$a/text()}*</b>
return $new
It is worth noting that currently we do not know of any implementations of the XQuery transform. Its efficient support is still an open research issue.
Comparing the approaches discussed above, we conclude that the push-style approach powered by transformation rules provided by XSLT and OmniMark is a better choice for processing document-oriented XML data. The XQuery recursive function approach is usually harder to code and maintain than the push-style approach. As concerns XQuery transform, it remains to be seen how effective the transform approach is and its usability is especially questionable in case of complex transformations.
If XSLT and OmniMark seem to be well-suited for document-oriented XML data transformation, why not to use them and forget about XQuery? Growing volumes of XML data (especially in an enterprise environment) often require a database for XML data management. For example, replacing XML elements that represent references (or placeholders) might require querying a database that contains references (or placeholders) mapping to substitutes. This means that XQuery is still required as a database query language, and we need to find the right way to combine transformation languages--XSLT and OmniMark--with a query language--XQuery. In the following sections we analyze two approaches to the combination.
|
Before we analyze different ways of combining XQuery/XSLT and XQuery/OmniMark, it is worth a note that OmniMark and XQuery can be integrated directly, as OmniMark supports API-to-XQuery database systems (i.e., OmniMark plays the role of a host language for XQuery). XSLT/XQuery integration requires a third language to glue them together. If we are using command-line XSLT and XQuery processors, then a scripting language like Bash can play the role of the glue language. Or if we use XSLT and XQuery libraries, a general-purpose programming language like Java can be used to glue them.
Let us consider the following example. Suppose that there is an XML document that includes placeholders, which refer to fragments of a book stored in an XML database. We need to publish the document, replacing the placeholders with the corresponding fragments rendering them. Below is the document.
Example: document.xml
<page>
...
<fragmentref>378</fragmentref>
...
<fragmentref>835</fragmentref>
...
</page>
The simple solution is to query the database each time we come across a reference to a fragment during the XSLT or OmniMark transformation. Below is an example in OmniMark and XQuery. We use OmniMark API to Sedna XML database in this example.
Example: process.xom
mport "omdb.xmd" prefixed by db.
global db.database moviedb
define string source function div-render (value string source s)
as
; rendering code is here
element fragmentref
local db.field result variable
db.query moviedb
statement "doc('book.xml')//div[@id='%c']"
into result
do when db.record-exists result
output div-render(db.reader of result)
done
element #implied
output "<%q>%c</%q>"
process
set moviedb to db.open-sedna "localhost" dbname "moviedb"
user "SYSTEM" password "MANAGER"
do xml-parse
scan file "document.xml"
output "%c"
done
db.close moviedb
This solution is tightly coupled due to the following properties:
While the size of the query result, which is a book fragment, is not known in advance, streaming processing of the fragment in OmniMark allows for processing it regardless of its size. As XSLT engines do not support streaming, the size of the query result that can be processed by XSLT engines is restricted by the size of available memory.
Another problem with the XSLT implementation of this solution is that XSLT engines usually do not support APIs to XML database systems. This means that an XSLT-based implementation has to call the database via an extension function implemented in a programming language, with an XQuery API that overcomplicates the implementation.
To conclude, we would like to emphasize that while this solution can suffer from pure performance because of many query calls, it does not impose any limitations on the size of the query result and allows using streaming transformation. The solution described in the next section has different properties.
Let us consider a popular example of document-oriented XML processing known as dynamic linkage. The idea is that 1) the content is marked up with semantically meaningful XML elements that represent media-neutral links, and 2) the elements are then replaced with the media-specific links at the time of content delivery (rendering). Dynamic linking is especially useful in the context of single source publishing, when the author focuses on content creation and does not have to worry about how content is delivered.
Consider a project to create a collection of movie reviews with associated information and to create output to various media. Movie reviews are full of references to other movies, actors, directors, places, times, and themes. All these references are good places to create links to other resources, such as biographies, maps, or histories. Instead of using direct HTML links, which are media-specific, the author marks up references with XML tags. These tags are named so that they describe the type of the reference (e.g. movie, actor, director). These tags have an attribute, name, which allows for the retrieval of information required to construct the media-specific links. When we publish reviews on the Web, we might link them to Wikipedia using HTML links. When we publish reviews on CD, we put a link to local resources.
Here is an example of a movie review with a reference to a director.
Example: reviews.xml
<reviews>
<review>
<title>Titanic</title>
<genre>romance</genre>
<text>
...
<p><director name="James Cameron">James Cameron's</director>
194-minute, $200 million film
of the tragic voyage is in the tradition of the great
Hollywood epics.</p>
...
</text>
</review>
...
</reviews>
Below is the corresponding fragment of the links mapping (people.xml). The document people.xml contains person elements, which have id attributes and contain biography elements with biography references for various media. The url element contains the URL to the director's biography intended for publishing on the web page. The file element has a path to the biography stored locally on the CD-ROM. The text element provides a brief biography for publishing on the print media.
Example: people.xml
<people>
<person id="James Cameron">
<biography>
<url>http://en.wikipedia.org/wiki/James_Cameron</url>
<file>/biography/james_cameron.html</file>
<text>
James Francis Cameron (born August 16, 1954) is
a Canadian-born American film director noted for
his action/science fiction films, which are often
extremely successful financially...
</text>
</biography>
...
</person>
...
</people>
This application can be implemented using the tightly coupled approach, but we will try to improve the performance by minimizing the number of database queries. This may be achieved by decomposition of the application into two separate tasks: database querying and reference processing. This approach allows for minimizing the inter-environment communication to just one data transmission, and as a pleasant side effect, it does not require an API from the transformation language to the database. This is why we refer to this solution as loosely coupled.
|
In general, the loosely coupled solution can be implemented as follows: the query should return each document augmented with all the information that is needed to process (render) it. In our particular example, it means that an XQuery query should return each review augmented with the corresponding subset of the mapping that is required to render the links within the review. Below is an example of how an augmented review can be represented.
Example: review-with-mapping.xml
<catalog>
<reviews>
<review>
<title>Titanic</title>
<map:mapping xmlns:map="www.linkmapping.com">
<map:record>
<map:name>James Cameron</map:name>
<map:link>http://en.wikipedia.org/wiki/James_Cameron</map:link>
</map:record>
</map:mapping>
<text>
...
<p><director name="James Cameron">James Cameron's</director>
194-minute, $200 million film of the tragic voyage is in
the tradition of the great Hollywood epics.</p>
...
</text>
</review>
</reviews>
</catalog>
This fragment contains a review extended by the mapping from director names mentioned in the review to the corresponding links. To combine the review with the mapping, XML elements in the www.linkmapping.com namespace are used. You can see that this fragment contains all the required information to render the review with no need to query the database.
In XQuery we use element constructors to compound the review text and the corresponding mapping. The XQuery query is as follows:
declare namespace map = "www.linkmapping.com";
<catalog>
{for $r in doc("reviews.xml")/reviews/review
return
<reviews>
<review>
<title>{$r/title/text()}</title>
<map:mapping xmlns:map="www.linkmapping.com">
{for $dir-name in distinct-values($r//director/@name)
let $dir:=doc("people.xml")//person[@id=$dir-name]
return
<map:record>
<map:name>{$dir-name}</map:name>
<map:link>{$dir/biography/url/text()}</map:link>
</map:record>
}
</map:mapping>
<text>
{$r/text/node()}
</text>
</review>}
</reviews>
</catalog>
The fragment can then be processed in XSLT as follows:
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="2.0"
xmlns:map="www.linkmapping.com">
<xsl:template match="review">
<review>
<xsl:apply-templates>
<xsl:with-param name="mapping" select="./map:mapping"/>
</xsl:apply-templates>
</review>
</xsl:template>
<xsl:template match="director">
<xsl:param name="mapping"/>
<xsl:variable name="dname" select="@name"/>
<a href="{$mapping/map:record[map:name=$dname]/map:link/text()}">
<xsl:value-of select="."/>
</a>
</xsl:template>
<xsl:template match="map:mapping"/>
<xsl:template match="*">
<xsl:param name="mapping"/>
<xsl:element name="{node-name(.)}">
<xsl:apply-templates>
<xsl:with-param name="mapping" select="$mapping"/>
</xsl:apply-templates>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
The same transformation can be expressed in OmniMark as follows:
global string locname
global string locref
global string locmapping variable
group "reference-processing"
element #base mapping
output "%c"
element #base record
output "%c"
do when locmapping hasnt key locname
set new locmapping{locname} to locref
done
element #base name
set locname to "%c"
element #base link
set locref to "%c"
group #implied
process
do xml-parse
scan file "review-with-mapping.xml"
output "%c"
done
xmlns-change when xmlns-name = "www.linkmapping.com"
using group "reference-processing" output "%c"
element director
output "<a href='" || locmapping{"%v(name)"} || "'>%c</a>"
element #implied
output "<%q>%c</%q>"
The properties of loosely coupled approach may be described as follows:
According to the above, the general rule for choosing between loosely coupled and tightly coupled solutions is as follows. When the size of the query result is unpredictable, the only solution that is guaranteed to work properly is tightly coupled, as it allows for processing the query result as a stream. When the size of the query result is known to be quite small, both tightly coupled and loosely coupled solutions can be used, but the loosely coupled one should work faster.
Let us demonstrate the latter statement by experiments. We will compare tightly coupled and loosely coupled solutions for the movie review example introduced in the previous section. Both solutions are implemented using OmniMark version 8.0 and Sedna version 1.0. The experiments were conducted on Windows XP on a computer with the following configuration: Pentium M 1.8GHz with a hard disk of 4200 RPM. Sedna buffers were set to 100MB. There are 3000 movie reviews stored in the database. Each review is about 4KB in size and includes six director references, on average. The mapping (people.xml) is 2.22GB in size and includes 50,6000 people. It is also stored in the database. person elements are indexed by the id attribute. The table below contains average total execution time for five runs.
| Solution | Cold Buffers | Hot Buffers |
|---|---|---|
| Tightly coupled | 7 min 40 sec | 7 min 10 sec |
| Loosely coupled | 21 sec | 12 sec |
This table demonstrates that the loosely coupled solution is an order of magnitude faster, as it allows minimizing the number of queries. The fact that loosely coupled solutions usually work faster is also discussed in the literature for database practicians; for instance, see Section 5.4.2, "Minimize the Number of Round-Trips Between the Application and the Database Server," in Database Tuning: Principles, Experiments and Troubleshooting Techniques by Dennis E. Shasha and Philippe Bonnet, published by Morgan Kaufmann in 2002.
Processing document-oriented XML in modern content management applications is a challenging task, as it often requires both content transformation and database querying. Domain-specific XML transformation languages (e.g., XSLT and OmniMark) are very good at document-oriented XML processing but require a query language (e.g., XQuery) to access a database. In XQuery document-oriented XML processing can be implemented via transform mechanism, but this mechanism is suitable only for simple transformation tasks performed on the database side. To build elegant and efficient document-oriented XML processing applications, we have to combine transformation and query languages. We have described two possible approaches, which we call tightly coupled and loosely coupled, to combine the languages and discussed the pros and cons of these approaches.
The last thing worth mentioning is that XQuery-featuring systems has an advantage over SQL-based ones for loosely coupled solutions. Chunks, which include all the data required to process themselves, have quite a complex (hierarchical) structure. SQL does not provide adequate construction facilities to build such structures, as it is designed to deal with simpler (flat) structures. Thanks to XML node constructors, such structures can be easily built in XQuery.
The authors would like to thank Maria Grineva and Patrick Baker for valuable discussions and comments.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.