XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

XQuery, XSLT, and OmniMark: Mixed Content Processing
Pages: 1, 2, 3

In general, the loosely coupled solution can be implemented as follows: the query should return each document augmented with all the information that is needed to process (render) it. In our particular example, it means that an XQuery query should return each review augmented with the corresponding subset of the mapping that is required to render the links within the review. Below is an example of how an augmented review can be represented.

Example: review-with-mapping.xml
<catalog>
<reviews>
 <review>
  <title>Titanic</title>
  <map:mapping xmlns:map="www.linkmapping.com">
   <map:record>
    <map:name>James Cameron</map:name>
    <map:link>http://en.wikipedia.org/wiki/James_Cameron</map:link>
   </map:record>
  </map:mapping>
  <text>
   ...
  <p><director name="James Cameron">James Cameron's</director>
  194-minute, $200 million film of the tragic voyage is in
  the tradition of the great Hollywood epics.</p>

  ...
  </text>
 </review>
</reviews>
</catalog>

This fragment contains a review extended by the mapping from director names mentioned in the review to the corresponding links. To combine the review with the mapping, XML elements in the www.linkmapping.com namespace are used. You can see that this fragment contains all the required information to render the review with no need to query the database.

In XQuery we use element constructors to compound the review text and the corresponding mapping. The XQuery query is as follows:

declare namespace map = "www.linkmapping.com";
<catalog>
  {for $r in doc("reviews.xml")/reviews/review
   return
   <reviews>
    <review>
     <title>{$r/title/text()}</title>
     <map:mapping xmlns:map="www.linkmapping.com">
     {for $dir-name in distinct-values($r//director/@name)
      let $dir:=doc("people.xml")//person[@id=$dir-name]
      return
      <map:record>
         <map:name>{$dir-name}</map:name>
         <map:link>{$dir/biography/url/text()}</map:link>
      </map:record>
      }
      </map:mapping>
      <text>
        {$r/text/node()}
      </text>
   </review>}
  </reviews>
</catalog>

The fragment can then be processed in XSLT as follows:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
                version="2.0"
                xmlns:map="www.linkmapping.com">

<xsl:template match="review">
 <review>
  <xsl:apply-templates>
     <xsl:with-param name="mapping" select="./map:mapping"/>
  </xsl:apply-templates>
 </review>
</xsl:template>

<xsl:template match="director">
  <xsl:param name="mapping"/>
  <xsl:variable name="dname" select="@name"/>
  <a href="{$mapping/map:record[map:name=$dname]/map:link/text()}">
     <xsl:value-of select="."/>
  </a>
</xsl:template>

<xsl:template match="map:mapping"/>

<xsl:template match="*">
 <xsl:param name="mapping"/>
 <xsl:element name="{node-name(.)}">
   <xsl:apply-templates>
     <xsl:with-param name="mapping" select="$mapping"/>
   </xsl:apply-templates>
 </xsl:element>
</xsl:template>

</xsl:stylesheet>

The same transformation can be expressed in OmniMark as follows:

global string locname
global string locref
global string locmapping variable

group "reference-processing"
element #base mapping
    output "%c"

element #base record
    output "%c"
    do when locmapping hasnt key locname
        set new locmapping{locname} to locref
    done

element #base name
     set locname to "%c"

element #base link
     set locref to "%c"

group #implied
process
        do xml-parse
            scan file "review-with-mapping.xml"
            output "%c"
        done

xmlns-change when xmlns-name = "www.linkmapping.com"
    using group "reference-processing" output "%c"

element director
    output "<a href='" || locmapping{"%v(name)"} || "'>%c</a>"

element #implied
    output "<%q>%c</%q>"

The properties of loosely coupled approach may be described as follows:

  1. As all the relevant data are fetched in advanced and there is no need to access the database during the transformation process, a loosely coupled solution may be implemented without a database API for the transformation language used. While for a tightly coupled solution such an API is required, a loosely coupled solution can employ a standalone tool to fetch the data and apply a separate, database-agnostic tool to transform the pre-fetched data.
  2. The loosely coupled approach allows for minimizing database queries. But it may introduce restrictions on the size of the processed data. In the example considered, the mapping for a given review is restricted by buffer size because we have to keep the mapping in memory while we process the review. As a single review cannot contain a lot of references and the size of a single link cannot be too large, a loosely coupled solution will work for this application.
  3. Combining transformation and query languages in a loosely coupled fashion improves modular design. The de-coupled tasks may be implemented as separate reusable modules and utilized in different content processing pipelines.

Comparison of Loosely Coupled and Tightly Coupled Solutions

According to the above, the general rule for choosing between loosely coupled and tightly coupled solutions is as follows. When the size of the query result is unpredictable, the only solution that is guaranteed to work properly is tightly coupled, as it allows for processing the query result as a stream. When the size of the query result is known to be quite small, both tightly coupled and loosely coupled solutions can be used, but the loosely coupled one should work faster.

Let us demonstrate the latter statement by experiments. We will compare tightly coupled and loosely coupled solutions for the movie review example introduced in the previous section. Both solutions are implemented using OmniMark version 8.0 and Sedna version 1.0. The experiments were conducted on Windows XP on a computer with the following configuration: Pentium M 1.8GHz with a hard disk of 4200 RPM. Sedna buffers were set to 100MB. There are 3000 movie reviews stored in the database. Each review is about 4KB in size and includes six director references, on average. The mapping (people.xml) is 2.22GB in size and includes 50,6000 people. It is also stored in the database. person elements are indexed by the id attribute. The table below contains average total execution time for five runs.

Solution Cold Buffers Hot Buffers
Tightly coupled 7 min 40 sec 7 min 10 sec
Loosely coupled 21 sec 12 sec

This table demonstrates that the loosely coupled solution is an order of magnitude faster, as it allows minimizing the number of queries. The fact that loosely coupled solutions usually work faster is also discussed in the literature for database practicians; for instance, see Section 5.4.2, "Minimize the Number of Round-Trips Between the Application and the Database Server," in Database Tuning: Principles, Experiments and Troubleshooting Techniques by Dennis E. Shasha and Philippe Bonnet, published by Morgan Kaufmann in 2002.

Conclusion

Processing document-oriented XML in modern content management applications is a challenging task, as it often requires both content transformation and database querying. Domain-specific XML transformation languages (e.g., XSLT and OmniMark) are very good at document-oriented XML processing but require a query language (e.g., XQuery) to access a database. In XQuery document-oriented XML processing can be implemented via transform mechanism, but this mechanism is suitable only for simple transformation tasks performed on the database side. To build elegant and efficient document-oriented XML processing applications, we have to combine transformation and query languages. We have described two possible approaches, which we call tightly coupled and loosely coupled, to combine the languages and discussed the pros and cons of these approaches.

The last thing worth mentioning is that XQuery-featuring systems has an advantage over SQL-based ones for loosely coupled solutions. Chunks, which include all the data required to process themselves, have quite a complex (hierarchical) structure. SQL does not provide adequate construction facilities to build such structures, as it is designed to deal with simpler (flat) structures. Thanks to XML node constructors, such structures can be easily built in XQuery.

The authors would like to thank Maria Grineva and Patrick Baker for valuable discussions and comments.