Menu

XQuery and Data Abstraction

July 12, 2007

Kurt Cagle

Suppose, for a moment, that you had the whole computer revolution to do over again, knowing what you know now about things such as the shape of languages, the emergence of databases, the rise of networking and the Web, and so forth. For instance, while networking took shape first in the 1960s, it wasn't until the '80s that Ethernet forced a consolidation of the networking market that made it possible to create a new abstraction layer, one called TCP/IP, which in turn made the Internet viable. Meanwhile, Ted Codd's database mathematics laid the foundation for the relational database a few years earlier, leading to the adoption of the vendor-driven SQL specification in the late '80s.

Yet, what would have happened if the distributed programming paradigm and the foundations for REST had emerged prior to the rise of SQL, and XML had become established earlier? There are, admittedly, some flaws in this supposition, not least of which being the fact that XSD took as its starting point the same algebras that SQL did a couple of decades earlier, but for now, let's turn a blind eye to such anachronisms.

The Structured Query Language (SQL) arose due to a very specific need. The relational database model represented a significant step forward in the development of databases, because it made it possible for the first time to create non-linear databases in a consistent and cohesive fashion. However, with everyone and their VC out creating databases, the risk of fragmentation of the market became so high that the vendors realized they were better off establishing a formal standard (SQL 87). Therefore, if programmer A knew SQL, he could learn to program database DB without having to spend a year in training courses. Put another way, one of the initial draws of SQL as a technology was that it provided a database abstraction layer on all of the myriad relational databases in use at the time.

Database abstraction is a powerful concept, one that many programmers today tend to take for granted. The idea that you shouldn't need to know anything about the structure of your databases means several things:

  • Portability: You're not as locked into a specific vendor for a given technology, and if a different vendor offers better performance but otherwise uses the same underlying standard, then you can make the migration from one database to another with little to no work.
  • Knowledge universality: If your DBA is an Oracle guru and you need to work with SQL Server or IBM DB2, the same DBA can probably handle the other databases. This factor becomes more significant the larger the number of database vendors.
  • Heterogeneity: Most companies have more than one type of database, because different departments may have control over different aspects of a given IT budget. If you can abstract the database access layer, you can pull data from each of these resources and work with them together from the same application.
  • Application decoupling: If you have to build applications that require special formats for accessing different kinds of databases, this increases the degree to which that database makes your code more difficult to write and maintain.

For its time, SQL was a profound success because, while it didn't completely tame the database world (it is, of course, in the interest of the vendor to lock in developers to its particular product, at least in the short run). It managed to create a near universal standard for working with relational databases. This helped to launch relational databases as the preferred form for data storage for a couple of decades.

However, even as SQL-based databases became the norm, the realization arose that there are, in fact, other ways of describing data, each of which are more suited for particular domains of data. For high speed access of content with a stable schema and limited updating, LDAP is considerably faster than the standard RDBMS. For large-scale analysis of extremely large datasets, OLAP-based cubes actually worked better than the RDBMS model, and while SQL has frequently been used with OLAP cubes, it's not really well optimized for it. Devices and even filesystems can also be seen as databases holding system-oriented information, to the extent that most contemporary file systems (such as Reiser or ext3) include sophisticated journaling support that seems more the province of databases than of traditional file systems.

Finally, of course, there's the XML explosion and the rise of XML databases. An XML database differs from a relational one in a few key respects:

  • Semi-structured data: Because XML may have multiple possible options (elements included or not, abstract elements that can be replaced by different distinct instances, and so forth), modelling XML in a relational database can prove to be an expensive and ultimately self-defeating proposition, especially for information like a web page that tends to be very unstructured.
  • Externally defined schemas: An XML document does not necessarily need a formal schema, whereas such a schema is an implicit part of a relational database. Thus, XML databases usually make no tacit assumptions about data type.
  • Multiple namespaces: A relational database makes an implicit assumption about the atomicity of namespaces: all tables (and hence all fields) within a given database occupy the same namespace. This assumption is generally not true with regards to XML documents, where a given instance may potentially have dozens of namespaces associated with various elements and attributes.
  • Weakness of keys: A significant amount of the power in a relational database comes from the use of keys, primary and foreign, in order to bind various sets of records together. Thus, in an RDBMS, there should generally be a one-to-many relationship between a primary key and a foreign key, such that all foreign keys are internally referenceable against a primary key within the system. Otherwise, the system loses data integrity. In an XML document, on the other hand, IDRefs (which are analogous to foreign keys) may very well point to resources that are outside of the scope of the database. Moreover, because of this association with external resources, primary keys in general need to be not only locally unique, but potentially globally unique, compared to the frequent use of sequential integers as the keys of choice within a database.
  • Element multiplicity: In an RDBMS, fields are unique within a given table; there can only be one field with that name in the table itself. In an XML document, a given set of child elements may in fact all have the same name, some may have the same name, or none of them may have the same name (think of paragraphs in an XHTML document as one example). This is incredibly difficult to capture without giving up the notion that an element is analogous to a field. Curiously enough, JavaScript Object Notation (JSON) also runs into this problem, which means that, short of inducing complex semantics on JSON objects, JSON objects work fine for the capture of relational data, but stumble when attempting to capture document-centric material. And this is why JSON is, at best, a weak substitute for XML.

In general, XML databases get around these limitations by using the names of each element in a given document as table names with relationships kept as pointers to the various child, attribute, text, and, perhaps, parent nodes. By indexing each element with an eye toward its text content, this makes it possible to perform searches based on proximity between terms and to then retrieve the nodes that are the nearest ancestor to the node containing the text. In some cases, they also optimize this process by working with the schemas associated with the various XML documents, in other cases they use other indexing algorithms, but in general the result is that, while it is very easy to reconstruct the XML from the databases, the information is neither kept in a serial format, nor is it in the relational table model preferred by SQL.

This has some interesting implications. First of all, it's not really possible to query XML with SQL, because SQL relies on the presence of explicitly defined primary and foreign keys in order to retrieve a set of records in all but the simplest of cases, and for the most part these don't exist in XML. This realization was made back in the late 1990s, which was one of the motivating factors in the development of XPath; it was a way to perform selection operations on a set of nodes within an XML tree.

Now XPath is not terribly well loved in a lot of circles. Part of the reason for this is that it's not SQL. It has a syntax that is considerably more compact, it uses a concept called context that doesn't really exist in SQL, and it doesn't break the world into JOINS, which, since XML doesn't really rely on the presence of PK/FK pairs, is not surprising. Web developers (and key people in the HTML spec world, unfortunately) don't like it because it's fairly complex, and it has a syntax that is similar to, but not quite the same as, CSS selectors. While the Microsoft implementation started out with a fairly simple set of XPath methods on document and element objects (selectNodes() and selectSingleNode()), in more recent years it and Mozilla have created XPath objects that are considerably more verbose and certainly more painful to write, largely because they were implemented in the W3C DOM documentation.

Yet for all of that, XPath is still (with good reason) the established language for selecting nodes for XSLT and for specific DOM operations, to the extent that the W3C made a concerted effort to push the XPath 2.0 specification and the XQuery specification towards development as part of the same process. Indeed, when you get right down to it, XQuery can be thought of as XPath 2.0 with a framework for variable assignment, control and conditional statements, function and module declarations, and some extended syntax.

Why XPath 2.0? If XPath 1.0 has become so firmly established, what was wrong with it that it required a fairly significant upgrade? The answer again comes back to the bigger issue of what exactly data is. XPath 1.0 concentrated on XML structure, which meant that XPath could only manipulate objects (or collections of objects) such as strings, numbers, or other types if they were described as text nodes. This proved a significant limitation for XSLT (which is one of the reasons why the node-set() function is one of the most universal extensions to the XSLT 1.0 implementations) and also meant that an XPath 1.0 expression could only work on a single document at a time.

XPath 2.0, on the other hand, introduces four new concepts that fairly radically change what you can do with the language.

Sequences

The first, sequence, is a list of, well, pretty much anything. It can be a list of nodes, but it can also be a list of numbers, strings, dates, or mixed collections of these. With such a structure in place, you can create iterators and enumerators, perform set operations, and, as a consequence, you can build node sets on the fly without needing to resort to an extension function.

The following are all typical of sequences:

(3,5,8,10)

("Monday","Tuesday","Wednesday","Thursday","Friday")

(3 to 6)  -- Same as (3,4,5,6)

(//p,//h1,//h2)

(3,4,"abc",//li)

Basic Schema Types

The second change is the formal introduction of primitive schema data-types into the language. This both makes it possible to more accurately determine the behavior of XPath (and XSLT and XQuery) extensions and makes it easier for both XPath and XQuery processors to optimize their searches. While there are some downsides to integrating typed content, for the most part this addition will likely provide better, more type-safe queries and transformations. I'll cover the use of datatypes more formally in a bit.

Conditional and Iterative Operators

The third change is the addition of both conditional and iteration operators in XPath. For instance, the following XPath expression will return a sequence consisting of the length of all paragraphs in a document (this article, alas) that have more than 300 characters per paragraph:

for $para in //p return 

    if(string-length($para)>800) then string-length($para) else ( )

This should be eye-opening for a couple of reasons: the result set is not a set of XML nodes, but rather a sequence of numbers, and you can create temporary variables within XPath that can in turn be used to make expressions clearer. In this case, the $para variable holds each paragraph in the document in turn. Note that if the filter expression (the if() conditional) is not met, then the result is an empty sequence and that sequence within sequences collapses, i.e.,

(a,b,(c,d),e,( ),f)

becomes

(a,b,c,d,e,f)

While XPath still places a prohibition on creating new nodes (that role is handed to XQuery, XSLT, or whatever is hosting the XPath engine), you can create fairly sophisticated result sets using the native XPath operators.

XPath Extensions

The final major change is the creation of an explicitly defined extension mechanism for XPath. What this means in practice is that for any XPath 2.0 engines, so long as you are conformant with the method declaration mechanism, you can create XPath extensions using Java, JavaScript, PHP, .NET, or other traditional languages. You can also create XPath extension functions using XSLT and XQuery, each of which opens up a fairly radical means for creating complex APIs that are still XML-oriented.

By creating a common interface mechanism, the XPath 2.0 group both sanctioned the idea of building extensions in the first place, and did so in such a way that even if you moved from one platform or programming language, the only changes you would need to introduce would be related to reimplementing the extensions in the new language; the XQuery or XSLT code remains the same.

Extensions in XPath are handled by the hosting environment within separate namespaces and are later referenced via namespace prefixes. For instance, suppose that you wanted to define a function (to-title-case()) which would take an expression, such as "This is a test," and turn it into a single variable-like expression in title case notation, like "ThisIsATest" (a useful prelude for turning labels into element names). In XSLT 2.0, such a function would be defined by:

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

    xmlns:str="http://www.metaphoricalweb.org/xmlns/string-utilities"

    xmlns:xs="http://www.w3.org/2001/XMLSchema"

    version="2.0">

    <xsl:function name="str:title-case" as="xs:string">

        <xsl:param name="expr"/>

        <xsl:variable name="tokens" select="tokenize($expr,' ')"/>

        <xsl:variable name="titledTokens" select="for $token in $tokens return

            concat(upper-case(substring($token,1,1)),

                      lower-case(substring($token,2)))"/>

        <xsl:value-of select="string-join($titledTokens,'')"/>

    </xsl:function>

    <xsl:template match="/">

        <data><xsl:value-of select="str:title-case('This is a test')"/></data>

    </xsl:template>

</xsl:stylesheet>

In XQuery, the definition is a little more compact, but follows a similar structure:

declare namespace str="http://www.metaphoricalweb.org/xmlns/string-utilities";

declare namespace xs="http://www.w3.org/2001/XMLSchema";



declare function str:title-case($expr as xs:string) as xs:string {

    let $tokens := tokenize($expr,' ')

    let $titledTokens := for $token in $tokens return

          concat(upper-case(substring($token,1,1)),lower-case(substring($token,2)))

    return string-join($titledTokens,'')

    };

    

<data>{str:title-case("This is a test")}</data>

In both cases, a namespace for the function is first declared. It is represented here by the namespace declared as:

xmlns:str="http://www.metaphoricalweb.org/xmlns/string-utilities"

Data types are assigned to input parameters and resulting output value, here using the as attribute and operator respectively. The specific function declaration can of course be quite different between languages, but in general, the signature should remain consistent across platforms. Finally, in both cases, the specific invocation of the function and the passing of parameters is the same, because at the point where the functions are invoked, they are essentially considered XPath 2.0 functions:

XSLT: <data><xsl:value-of select="str:title-case('This is a test')"/></data>

XQuery: <data>{str:title-case("This is a test")}</data>

Note that this principle holds true regardless of the host language; a Java class called from Saxon (or Xalan) would still be declared as a namespace, except in this case the namespace corresponds to a Java class defined in the classpath, and the protocol is the java: protocol:

xmlns:str = "java:org.metaphoricalweb.stringUtilities"

Similar mechanisms exist for .NET and PHP 5, making it possible to associate functions (or classes of functions) in those languages, with the associated namespaces.

XQuery Abstraction and Object-Oriented Programming

In many ways, this extension mechanism is one of the most important aspects of XPath 2.0. Typically, you do not associate only one function with a given namespace, rather you associate a number of related functions. For instance, a function such as str:from-title-case() that performs the inverse operation to str:title-case(), converting a title case string into a delimiter-separated one, may very well be a part of the same string utilities class that title-case() is. This idea is used fairly extensively by an XML databases like eXist for providing access mechanisms into the database proper from within the context of the XQuery code, and from a design standpoint shifts the deployment of XQuery statements more into an object-oriented approach, rather than simple procedural scripts.

This ability to both extend XPath and to create an object-like mechanism for manipulating that Xpath has a mirror in the SQL concept of stored procedures, with the primary difference being that most stored procedure implementations do not have a notion of namespaces or classes. The implications that arise from that namespace differentiation, however, are not insignificant.

XML has long resisted fitting neatly into an object-oriented paradigm, both because an XML structure is hard to encapsulate and generally doesn't have functional methods associated with it (polymorphism, on the other hand, it's got down cold). On the other hand, if you have a functional namespace within a language such as XQuery, then the route to object-oriented programming comes down to the ability to create a unique key for extracting some form of virtual record, then applying that key as the first argument in any XQuery namespaced function.

For instance, suppose that you have a set of invoice records rendered in XML and an invoice namespace called (appropriately enough) invoice:. In that case, you can define a method called invoice:new(args) that will create a new invoice record, and most signicantly, return a unique GUID identifying that record:

let $invoice := invoice:new()

return $invoice

-- "A1DFCDEA22318165226AB10DAA411442D439"

Once you've done this, this key becomes a parameter to any other method within the namespace:

invoice:addItemX($invoice,

     <item>

          <name>Pencil</name>

          <quantity>12</quantity>

          <sku>C11D-A</sku>

          <cost>0.89</cost>

     </item>

     )

(: or :)

invoice:addItem($invoice,"Pencil",12,"C11d-A",0.89)

invoice:getObject($invoice)

In other words, XPath 2.0 functionality (through XQuery or XSLT) opens up the possibility of working with objects where the data is not encapsulated within the object, but is instead stored within an external data store or even exists only in virtual form.

Data abstraction is the process whereby you make the access and update mechanisms for a given database (regardless of its implementation) transparent to the host system. SQL takes this process to a certain point, but many of the precepts that are central to object-oriented programming didn't really gain a foothold until after SQL had been standardized. This has meant that most application development can be seen as the systematic association of a SQL view of the universe and an object oriented view, usually at the cost of extreme coupling. Most frameworks have evolved some hook to synchronize between the two world views, but they typically necessitate fairly extensive bindings of SQL content into code.

XPath 2.0 and XQuery, on the other hand, have the potential to abstract not only beyond a single SQL database implementation but beyond SQL itself. If you can create an XML bridge to both query and update a given type of data store (one where XML is either produced or consumed, as appropriate, even if the core operations deal with non-XML data) then you can use XQuery both to manipulate the XML side of the bridge and to build an object abstraction layer. Thus, LDAP interactions, text files, OLAP cubes, JSON, and other data protocols can be abstracted with an XQuery object layer so that their differences become invisible to the processes that utilize these services.

Note that this doesn't mean that LDAP is going to be implemented in XML, that JSON isn't going to start being mapped to XML (though see E4X), or that SQL is going away any time soon. There are obviously situations where the specific features of these systems necessitate that work be done in the form that is most well optimized for that environment. However, a surprisingly large number of cases exist where what is important is that the data be accessible in a way that is most transparent to the overall development process, regardless of the container of that data. In those cases, I see XQuery gaining a huge mindshare.

Summary

Just as it is not easy to write about all aspects of SQL in a single article, the implications that XQuery holds for both data abstraction and distributed object-oriented programming can be tough to more than hint at in a single article. In the next part of this series, I'm going to get into the basics of how XQuery itself works; in the article after that I will look at current XQuery implementations in contemporary XML databases and related engines.

When I started working with XQuery, I was initially skeptical about it. Recently, as I see the implications that are opened up by the modularization of the language, the emergence of XQuery aware data repositories, and the shift towards rich client applications that are increasingly XML-centered, I'll admit that I was wrong in my initial assessments. XQuery is an important technology, and it will very likely have a huge impact upon the way that applications are built, especially in distributed environments.

Kurt Cagle is a systems developer and information architect living in Victoria, British Columbia. He is also the webmaster for XForms.org, a news and code portal for the XForms and XQuery community.