XQuery and Data Abstraction
Suppose, for a moment, that you had the whole computer revolution to do over again, knowing what you know now about things such as the shape of languages, the emergence of databases, the rise of networking and the Web, and so forth. For instance, while networking took shape first in the 1960s, it wasn't until the '80s that Ethernet forced a consolidation of the networking market that made it possible to create a new abstraction layer, one called TCP/IP, which in turn made the Internet viable. Meanwhile, Ted Codd's database mathematics laid the foundation for the relational database a few years earlier, leading to the adoption of the vendor-driven SQL specification in the late '80s.
Yet, what would have happened if the distributed programming paradigm and the foundations for REST had emerged prior to the rise of SQL, and XML had become established earlier? There are, admittedly, some flaws in this supposition, not least of which being the fact that XSD took as its starting point the same algebras that SQL did a couple of decades earlier, but for now, let's turn a blind eye to such anachronisms.
The Structured Query Language (SQL) arose due to a very specific need. The relational database model represented a significant step forward in the development of databases, because it made it possible for the first time to create non-linear databases in a consistent and cohesive fashion. However, with everyone and their VC out creating databases, the risk of fragmentation of the market became so high that the vendors realized they were better off establishing a formal standard (SQL 87). Therefore, if programmer A knew SQL, he could learn to program database DB without having to spend a year in training courses. Put another way, one of the initial draws of SQL as a technology was that it provided a database abstraction layer on all of the myriad relational databases in use at the time.
Database abstraction is a powerful concept, one that many programmers today tend to take for granted. The idea that you shouldn't need to know anything about the structure of your databases means several things:
- Portability: You're not as locked into a specific vendor for a given technology, and if a different vendor offers better performance but otherwise uses the same underlying standard, then you can make the migration from one database to another with little to no work.
- Knowledge universality: If your DBA is an Oracle guru and you need to work with SQL Server or IBM DB2, the same DBA can probably handle the other databases. This factor becomes more significant the larger the number of database vendors.
- Heterogeneity: Most companies have more than one type of database, because different departments may have control over different aspects of a given IT budget. If you can abstract the database access layer, you can pull data from each of these resources and work with them together from the same application.
- Application decoupling: If you have to build applications that require special formats for accessing different kinds of databases, this increases the degree to which that database makes your code more difficult to write and maintain.
For its time, SQL was a profound success because, while it didn't completely tame the database world (it is, of course, in the interest of the vendor to lock in developers to its particular product, at least in the short run). It managed to create a near universal standard for working with relational databases. This helped to launch relational databases as the preferred form for data storage for a couple of decades.
However, even as SQL-based databases became the norm, the realization arose that there are, in fact, other ways of describing data, each of which are more suited for particular domains of data. For high speed access of content with a stable schema and limited updating, LDAP is considerably faster than the standard RDBMS. For large-scale analysis of extremely large datasets, OLAP-based cubes actually worked better than the RDBMS model, and while SQL has frequently been used with OLAP cubes, it's not really well optimized for it. Devices and even filesystems can also be seen as databases holding system-oriented information, to the extent that most contemporary file systems (such as Reiser or ext3) include sophisticated journaling support that seems more the province of databases than of traditional file systems.
Finally, of course, there's the XML explosion and the rise of XML databases. An XML database differs from a relational one in a few key respects:
- Semi-structured data: Because XML may have multiple possible options (elements included or not, abstract elements that can be replaced by different distinct instances, and so forth), modelling XML in a relational database can prove to be an expensive and ultimately self-defeating proposition, especially for information like a web page that tends to be very unstructured.
- Externally defined schemas: An XML document does not necessarily need a formal schema, whereas such a schema is an implicit part of a relational database. Thus, XML databases usually make no tacit assumptions about data type.
- Multiple namespaces: A relational database makes an implicit assumption about the atomicity of namespaces: all tables (and hence all fields) within a given database occupy the same namespace. This assumption is generally not true with regards to XML documents, where a given instance may potentially have dozens of namespaces associated with various elements and attributes.
- Weakness of keys: A significant amount of the power in a relational database comes from the use of keys, primary and foreign, in order to bind various sets of records together. Thus, in an RDBMS, there should generally be a one-to-many relationship between a primary key and a foreign key, such that all foreign keys are internally referenceable against a primary key within the system. Otherwise, the system loses data integrity. In an XML document, on the other hand, IDRefs (which are analogous to foreign keys) may very well point to resources that are outside of the scope of the database. Moreover, because of this association with external resources, primary keys in general need to be not only locally unique, but potentially globally unique, compared to the frequent use of sequential integers as the keys of choice within a database.
- Element multiplicity: In an RDBMS, fields are unique within a given table; there can only be one field with that name in the table itself. In an XML document, a given set of child elements may in fact all have the same name, some may have the same name, or none of them may have the same name (think of paragraphs in an XHTML document as one example). This is incredibly difficult to capture without giving up the notion that an element is analogous to a field. Curiously enough, JavaScript Object Notation (JSON) also runs into this problem, which means that, short of inducing complex semantics on JSON objects, JSON objects work fine for the capture of relational data, but stumble when attempting to capture document-centric material. And this is why JSON is, at best, a weak substitute for XML.
In general, XML databases get around these limitations by using the names of each element in a given document as table names with relationships kept as pointers to the various child, attribute, text, and, perhaps, parent nodes. By indexing each element with an eye toward its text content, this makes it possible to perform searches based on proximity between terms and to then retrieve the nodes that are the nearest ancestor to the node containing the text. In some cases, they also optimize this process by working with the schemas associated with the various XML documents, in other cases they use other indexing algorithms, but in general the result is that, while it is very easy to reconstruct the XML from the databases, the information is neither kept in a serial format, nor is it in the relational table model preferred by SQL.
This has some interesting implications. First of all, it's not really possible to query XML with SQL, because SQL relies on the presence of explicitly defined primary and foreign keys in order to retrieve a set of records in all but the simplest of cases, and for the most part these don't exist in XML. This realization was made back in the late 1990s, which was one of the motivating factors in the development of XPath; it was a way to perform selection operations on a set of nodes within an XML tree.
Now XPath is not terribly well loved in a lot of circles. Part of the reason for this is that it's not SQL. It has a syntax that is considerably more compact, it uses a concept called context that doesn't really exist in SQL, and it doesn't break the world into JOINS, which, since XML doesn't really rely on the presence of PK/FK pairs, is not surprising. Web developers (and key people in the HTML spec world, unfortunately) don't like it because it's fairly complex, and it has a syntax that is similar to, but not quite the same as, CSS selectors. While the Microsoft implementation started out with a fairly simple set of XPath methods on document and element objects (selectNodes() and selectSingleNode()), in more recent years it and Mozilla have created XPath objects that are considerably more verbose and certainly more painful to write, largely because they were implemented in the W3C DOM documentation.
Yet for all of that, XPath is still (with good reason) the established language for selecting nodes for XSLT and for specific DOM operations, to the extent that the W3C made a concerted effort to push the XPath 2.0 specification and the XQuery specification towards development as part of the same process. Indeed, when you get right down to it, XQuery can be thought of as XPath 2.0 with a framework for variable assignment, control and conditional statements, function and module declarations, and some extended syntax.
Why XPath 2.0? If XPath 1.0 has become so firmly established, what was wrong with it that it required a fairly significant upgrade? The answer again comes back to the bigger issue of what exactly data is. XPath 1.0 concentrated on XML structure, which meant that XPath could only manipulate objects (or collections of objects) such as strings, numbers, or other types if they were described as text nodes. This proved a significant limitation for XSLT (which is one of the reasons why the node-set() function is one of the most universal extensions to the XSLT 1.0 implementations) and also meant that an XPath 1.0 expression could only work on a single document at a time.
XPath 2.0, on the other hand, introduces four new concepts that fairly radically change what you can do with the language.
Sequences
The first, sequence, is a list of, well, pretty much anything. It can be a list of nodes, but it can also be a list of numbers, strings, dates, or mixed collections of these. With such a structure in place, you can create iterators and enumerators, perform set operations, and, as a consequence, you can build node sets on the fly without needing to resort to an extension function.
The following are all typical of sequences:
(3,5,8,10)
("Monday","Tuesday","Wednesday","Thursday","Friday")
(3 to 6) -- Same as (3,4,5,6)
(//p,//h1,//h2)
(3,4,"abc",//li)
Basic Schema Types
The second change is the formal introduction of primitive schema data-types into the language. This both makes it possible to more accurately determine the behavior of XPath (and XSLT and XQuery) extensions and makes it easier for both XPath and XQuery processors to optimize their searches. While there are some downsides to integrating typed content, for the most part this addition will likely provide better, more type-safe queries and transformations. I'll cover the use of datatypes more formally in a bit.
Conditional and Iterative Operators
The third change is the addition of both conditional and iteration operators in XPath. For instance, the following XPath expression will return a sequence consisting of the length of all paragraphs in a document (this article, alas) that have more than 300 characters per paragraph:
for $para in //p return
if(string-length($para)>800) then string-length($para) else ( )
This should be eye-opening for a couple of reasons: the result set is not a set of XML nodes, but rather a sequence of numbers, and you can create temporary variables within XPath that can in turn be used to make expressions clearer. In this case, the $para variable holds each paragraph in the document in turn. Note that if the filter expression (the if() conditional) is not met, then the result is an empty sequence and that sequence within sequences collapses, i.e.,
(a,b,(c,d),e,( ),f)
becomes
(a,b,c,d,e,f)
While XPath still places a prohibition on creating new nodes (that role is handed to XQuery, XSLT, or whatever is hosting the XPath engine), you can create fairly sophisticated result sets using the native XPath operators.
Pages: 1, 2 |