Going Native, Part 2

April 13, 2005

Data Integration

The second major use of native XML databases is data integration. XML is well-suited to data integration because of its flexible data model and machine-neutral text format. In addition, XQuery is a good data integration language because of its ease of use, support for transformations, and ability to join data from different documents (data sources). Finally, there are a large number of tools for converting data from various formats to XML.

Data Integration in the Real World

Let's start by looking at some of the ways in which native XML databases are used to integrate data:

Business data. A very common data integration problem is, in the words of Michael Champion, formerly of Software AG, to "get a coherent view of the mess in the back office." For example, Vodafone uses Tamino in a system that lets customers and internal users view billing information over the Web; the system integrates data from SAP, Lotus Notes, and Microsoft Exchange. Snapbridge FDX XML Server and X-Hive/DB are used in similar fashion, integrating data from such diverse platforms as relational databases, flat files, weblogs, Web services, LDAP, and HRM systems to give a single view of entities like customers or products.
Order analysis. Hewlett Packard uses Ipedo to integrate internal financial data, such as order, shipment, and revenue data; expense data; and financial projections and goals. Data is gathered from a variety of sources, including relational databases, spreadsheets, flat files, PDF documents, and Word documents. It is queried and displayed through an internal Web portal.

AZM, a Swiss chain of convenience stores, uses TEXTML Server to archive and search purchase orders and invoices from suppliers and customers. The primary data sources are ERP (Enterprise Requirements Planning) systems that use relational databases, but it turned out to be faster to search the data as XML than to perform the joins in the relational databases.
News production. RTL, a German broadcaster, uses MaxiMedia Mpower for newsroom production. Mpower uses Tamino to integrate stories from a variety of news feeds using a subset of NewsML. The system currently stores more than 300,000 articles, comprising several gigabytes of data. Similarly, IXIASOFT's TEXTML Server is used to integrate data from multiple news feeds, each with its own format, into a single news production system.
Financial data. Commerzbank uses Tamino to integrate data about customized-financial-derivative trades from 18 different systems. Data is first converted to XML and stored in Tamino. It is then transformed to a common schema for automatic evaluation and reporting by several back-office systems.
Flight information. The central information system at Schiphol Airport in Amsterdam uses Tamino to integrate data from more than 38 systems in real time. The data describes flights--arrival and departure times and gates, airplane types, baggage carousels, and so on--and is used by more than 50 parties involved in flight handling--from catering to baggage handling to gate assignment--as well as being displayed to the public. Data is stored as XML, using an envelope schema derived from ebXML and predefined message types, such as for flight updates, airplane movements, and timetable changes.
Health care. The TigerLogic XML Life Sciences Data Server (based on TigerLogic XDMS) can integrate medical and health care records from more than 20 data sources. Such integration problems are common in health care, where data usually resides on many legacy systems.
Manufacturing. A production control system uses QuiLogic's SQL/XML-IMDB to integrate data from different machines and sensors, which is then queried through a Web services interface. Each machine and sensor publishes data in a different ASCII or binary format, which are converted to XML documents with a common schema.
Instrument data. Renault uses X-Hive/DB to integrate data from sensors on its Formula 1 race cars, along with weather and track conditions and audio and video signals. The data is then made available to mechanics, engineers, and designers via a Web portal, who use it to perform "what-if" analyses.
Customer referrals. TIBCO uses Xpriori's NeoCore XMS to integrate customer referral information from spreadsheets and various documents according to a common schema. The system is used to coordinate customer referrals worldwide.
Customer support. A pharmaceutical clearinghouse uses Snapbridge's Cross Media Server to store existing RSS feeds from a number of relational databases. Customer support representatives query these feeds (primarily through full-text searches), then use an embedded URI to get back to the original data. While the solution is not elegant--a schema more targeted to the data than RSS would have been better--it had no impact on the backend databases and was very easy to write.
Law enforcement. Raining Data's TigerLogic XDMS is used to search data gathered from government and law enforcement agencies, both domestically and internationally.

Inside the Applications

Data integration applications must solve a number of problems, from data access to security to change management. In this section, we will look at the architectures used to solve two of these problems: queries and mapping schemas.

Query architectures

There are two query architectures for integrating data with a native XML database: local and distributed. In a local query architecture, data is imported into the database as XML and queried locally. In a distributed query architecture, data resides in remote sources and the query engine distributes queries across those data sources. The engine then compiles results and returns them to the application.

The main advantage of local queries is that they are faster, since no remote calls are made. They are also simpler to optimize, and the engine is simpler to implement, as all queries are local. Their main disadvantage is that data may be stale. A secondary problem is access control, as the local store must enforce controls previously handled by each source. Distributed queries have the opposite advantages and disadvantages: data is live, but queries are slower and harder to optimize and the engine is more complex.

Which architecture to use depends on a number of factors:

Support for distributed queries. Only about one sixth of native XML databases, all of them commercial, support distributed queries.
Support for your data sources. Most distributed query engines support a limited number of data sources, usually relational, hierarchical, or object-oriented databases.
Run-time availability of data sources. Not all data sources are available at run time. For example, they might be disconnected for security reasons (a document database in an intelligence agency), connections to them might be unreliable (a Web site or a system in another company), or they might not allow external queries (a heavily-loaded production system).
Number of data sources. Some vendors report that distributed queries perform well with a small number of data sources, but slow noticeably as the number of sources increases. While this is partially due to the number of sources, it may also be because of incorrect optimization choices made by the engine.
Update strategy. A few native XML databases support distributed updates. While some vendors report that customers find it easier to write updates against a single view in the native XML database, others say that distributed updates have performance problems. They also note that updates done through a native XML database circumvent integrity checks and notifications performed by the applications normally used to update the data.

If a data source cannot be included in a distributed query, its data must be imported and queried locally. About a third of the commercial native XML databases, including most of the popular ones, can import data. Support ranges from a few sources, such as relational databases and Microsoft Word, to hundreds of sources. Some databases also have APIs so you can write your own import modules. And a few databases can refresh data in response to triggers or timeouts. If your database cannot import data, you must do so yourself, such as with the help of a third-party converter.

Handling differences in schemas

The biggest problem in integrating data is handling differences in schemas. With structural differences, the same concept is represented differently, such as a name using one or multiple fields. With semantic differences, slightly different concepts are represented; these can be straightforward (a price is in US dollars or Euros) or subtle (a price includes a discount). Handling schema differences is mostly just hard work, although some differences cannot be completely resolved. Since the actual resolutions depend on the differences, this section looks at where differences can be resolved.

If data is grouped by schema, such as in relational databases or (sometimes) in native XML database collections, three architectures are common:

Address differences in the query. For example, the following query uses different functions to retrieve species names from collections of Bioinformatic Sequence Markup Language (BSML) and MicroArray and Gene Expression Markup Language (MAGE-ML) documents:


declare function local:bsml_species() as element*

{

   for $o in collection("bsml")//Organism

   let $g = fn:string($o@genus)

   let $s = fn:string($o@species)

   return <Species>{$g} {$s}</Species>

}



declare function local:mage-ml_species() as element*

{

   for $s in collection("mage-ml")//Species

   let $v = $s/OntologyEntry[@category="NCBI:Taxonomy"]@value

   return <Species>{fn:string($v)}</Species>

}



<SpeciesList>

{

   let $s1 := local:bsml_species()

   let $s2 := local:mage-ml_species()

   for $s in fn:distinct-values(union($s1, $s2))

   order by $s

   return $s

}

</SpeciesList>

Convert all documents to the same schema. This allows applications to ignore differences between schemas. Data can be converted at load time, such as when data is stored and queried locally, or at run time, such as when distributed queries are used. In the latter case, conversions are built into the XML views over the data. Note that the common schema only needs to include those fields to be queried. While this limits potential queries, it is a good way simplify development of limited systems.
Build common indexes over documents. A few native XML databases allow users to specify how to build index values. These are used to resolve queries and return pointers to the original documents. For example, suppose one collection of academic papers uses multiple Author elements while another uses a single Authors element. A single index might use the Author elements directly and parse the Authors elements. This allows applications to query papers by author.

If data is not grouped by schema, application logic is more complex, since is it no longer possible to write location-based queries. If documents are limited to a set of schemas, it may be possible to base queries on document structure. For example, our species name query could check if the root element is Bsml or MAGE-ML before handing a document off to a function that extracts the species name.

If documents can have any schema, such as when a law office subpoenas all documents pertaining to a particular company and automatically converts them to XML, the best that can usually be done is to explore the documents with a browser to determine if there are common structures or fields. If so, it may be possible to use these in production queries or convert documents so that such queries are possible. As a last resort, users can always perform XML-aware full-text searches.

It may also be possible to ignore differences. For example, suppose an application displays information about a customer from multiple sources. Since a human is reading the information, they can resolve many differences, such as whether a name is stored in one or two fields.

Why You Need a Native XML Database

Vendors report that most of their data integration customers were not able to solve their problems without a native XML database. The problem was that other solutions, such as federated relational databases and data integration toolkits, either could not model the types of data involved (documents, semi-structured data, hierarchical data, and so on), could not handle data whose schema was unknown at design time, and/or could not handle data whose schema changed frequently.

Native XML databases solve the first two problems with the XML data model, which is considerably more flexible than the relational model and can handle schemaless data. While native XML databases do not provide a complete solution for schema evolution, they can at least store documents with rapidly evolving schemas, as we'll discuss later.

Another advantage of native XML databases is that many support XQuery which, as was mentioned earlier, is a good data integration language.

A Peek into the Future

An interesting use of native XML databases in data integration is as a repository for metadata and semantic information. For example, the XML Business Information Portfolio from Software AG uses Tamino to store metadata and semantic information, as well as to retrieve data from backend sources. While nontrivial to set up, this allows applications to execute queries against a common schema.

CompuCredit used an early version of this repository to integrate customer data from more than 100 systems and databases, each of which is exposed as a Web service. In response to a query, the repository constructs an XML document from backend data and ships it across an Enterprise Service Bus to the customer service representative, who receives a single view of the customer's data.

Working with Semi-Structured Data

Managing semi-structured data is the third major use case for native XML databases. Semi-structured data has some structure, but isn't as rigidly structured as relational data. While there is no formal definition for semi-structured data, some common characteristics are:

Data can contain fields not known at design time. For example, the data comes from a source over which the database designer has no control.
Data is self-describing. That is, metadata is associated with individual data values (as with element and attribute names in XML) rather than a group of values of the same type (as with column names in a relational database). Self-descriptions are used to interpret fields not known at design time.
The same kind of data may be represented in multiple ways. For example, an address might be represented by one field or by multiple fields, even within a single set of data.
Data may be sparse. That is, among fields known at design time, many fields will not have values.

Semi-Structured Data in the Real World

Semi-structured data occurs in many fields. For example, here are some of the types of semi-structured data that are being stored in native XML databases today:

Data integration. Integration data is semi-structured because the same concept is often represented differently in different data sources and changes to remote data sources can result in fields unknown to the integrator. Data integration was discussed earlier.
Schema evolution. Rapidly evolving schemas result in semi-structured data because they introduce new fields and may change the way in which data is represented. These problems occur most commonly when data crosses organizational boundaries. Schema evolution is discussed separately.
Biological data. Biological data, especially molecular and genetic data, is semi-structured because the field itself is evolving rapidly. As a result, the schemas used in these fields generally allow user-defined data. For example, much of the data in MAGE-ML is stored as hierarchies of user-defined property-value pairs. Similarly, BSML allows users to add arbitrary metadata in the form of property-value pairs.
Metadata. Metadata is often semi-structured because users define their own types of metadata. For example, the Metadata Encoding and Transmission Standard (METS), which is used to provide metadata for objects in digital libraries, defines only basic metadata, such as the name of the person who created the METS document, and allows users to define the rest. For example, a user might use Dublin Core to provide information about the title, author, and publisher of a book whose digital image is in a library, and NISO MIX to provide technical data about how the image was created. On the other hand, while the Encoded Archival Description (EAD) schema does not allow user-defined metadata, it is extremely flexible and will likely result in documents that sparsely populate the available fields.
Financial data. Financial data is semi-structured because new financial instruments are constantly being invented and because it is often the result of integrating data from many proprietary systems. An additional source of change in the XML world is the rapid development of standards like the Financial Information eXchange Markup Language (FIXML) and Financial products Markup Language (FpML). (Of interest, the FIXML specification explicitly discusses how to customize FIXML.)
Health data. Health data is semi-structured because it is sparsely populated, it is often the result of integrating data from many proprietary systems, and user-defined data is common. For example, HL7 has hundreds of elements (it is unlikely that the description of any patient or organization will use all of them) and makes frequent use of the xsd:any element.
Business documents. Business documents are semi-structured because the real world is a highly variable place. Most documents contain a core set of fields--name, address, date, and so on--as well as user-defined fields. For example, while insurance claims have a number of fixed fields (name, policy number, date, and so on), the bulk of the information is free-form (accident description, police reports, photographs, and so on).
Catalogs. Catalogs are hierarchies of product descriptions. While some catalogs are rigidly structured (that is, a single set of fields can be used to describe each node in the catalog) other catalogs are semi-structured. One reason is that different parts, such as a piston, a tire, and a carburetor, are described by different fields. Another reason is that some catalogs integrate data from different vendors, each of whom uses their own schema.
Entertainment data. Entertainment data is semi-structured because the services being described (films, restaurants, hotels, and so on) vary tremendously. As a result, data is sparsely populated and schemas change frequently. Entertainment data also comes from a variety of sources (movie theatres, newspaper reviews, hotel chains, and so on), which may result in integration problems.
Customer profiles. Customer profiles are semi-structured for two reasons. They are sparsely populated because few customers have data for all fields (frequent flier numbers, food preferences, preferred travel time, and so on). They evolve rapidly because the ways in which people are described (contact information, exercise preferences, medical conditions, and so on) change constantly.
Laboratory data. Laboratory data is semi-structured because it is sparsely populated--different measurements apply to different substances--and because there is ample room for user-defined data. For example, in the pharmaceutical approval process, a single application might handle all of the documentation for applying for drug approval, yet different drugs are likely to require different sets of data.

Inside the Applications

Applications that work with semi-structured data that has a known schema are not significantly different from applications that work with other kinds of data. For example, they use queries defined at design time to retrieve and update data. The main difference is that they often must handle data represented in different ways in different parts of the data set. While this may be unpleasant, as long as the number of variations is limited, it is usually possible.

Applications that work with semi-structured data containing fields not known at design time are fundamentally different. As a general rule, such applications pass unknown fields to humans for processing. For example, suppose a catalog has a basic structure defined by a central authority and uses vendor-specific XML to describe individual items. A catalog browser might be hard-coded to navigate the known structure and use XML-aware full-text searches or // searches to search the unknown structure. Product data might be displayed as raw XML or converted to XHTML with a stylesheet that displays data based on nesting level.

Similar applications are found in molecular biology, genetics, health care, and library science. In each case, the data describes something--a molecule, a gene, a patient, an archive--and many of the fields are known. The application uses these fields, such as to allow the user to drill into the data, and then displays the unknown fields. The person reading the data can interpret it and take further action, such as reading a scientific paper, making a diagnosis, or adding comments.

Another common solution is for the application to evolve with the data. For example, incoming documents can be examined with a generic browser to decide what kinds of queries are possible. In some cases, it might be possible to write specific queries, such as //address to search for addresses; in other cases, the only choice might be full-text searches. While this kind of development is likely to be repugnant to programmers accustomed to working with well-defined schemas, it is a huge improvement for users whose previous choice was to wade through reams of paper or search files in a variety of formats using a variety of tools.

Why You Need a Native XML Database

XML is a good way to represent semi-structured data: it does not require a schema; it is self-describing (albeit minimally so); and it represents sparse data efficiently. Thus, native XML databases are a good way to store semi-structured data. They support the XML data model, they can index all fields (even those unknown at design time), they support XML query languages and XML-aware full-text searches, and some support node-based updates.

Relational databases, on the other hand, do not handle semi-structured data well. The main problem is that they require rigidly defined schemas. Thus, fields not known at design time must be stored abstractly, such as with property-value pairs, which are difficult to query. They are also difficult to change as the schema evolves. A secondary problem is that they do not handle sparse data efficiently: the choices are a single table with lots of NULLs, which wastes space, or many sparsely populated tables, which are expensive to join.

According to vendors, many customers couldn't handle their semi-structured data until they used a native XML database. Other customers used a variety of tools, such as grep, full-text search engines, and proprietary applications, or stored some data in a relational database and complete documents as flat files, CLOBs, or even Word documents. As a general rule, these solutions worked in the initial stages, but had limited query capabilities, didn't scale well, and were difficult to maintain as schemas evolved.

(A notable exception occurred in the field of biology. The AceDB database was initially written to store data about the worm C. elegans. It has since evolved into a generic, object-oriented database with its own schema, query languages, and data browsers. Other databases, such as UniProt/Swiss-Prot and GenBank are (apparently) available in relational and flat-file formats, but are generally queried through proprietary tools such as SRS and Entrez.)

A Peek into the Future

Semi-structured data is still straddling the boundary between academia and industry, so the near term is most likely to consist of gaining experience--managing data, writing applications, handling evolution, and so on--than creating definitive tools.

In our next installment, we will look at how native XML databases are used in schema evolution, for long-running transactions, for handling large documents, and in a number of other cases, as well as how relational databases are evolving to handle XML.