XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Going Native, Part 2

April 13, 2005

Data Integration

The second major use of native XML databases is data integration. XML is well-suited to data integration because of its flexible data model and machine-neutral text format. In addition, XQuery is a good data integration language because of its ease of use, support for transformations, and ability to join data from different documents (data sources). Finally, there are a large number of tools for converting data from various formats to XML.

Data Integration in the Real World

Let's start by looking at some of the ways in which native XML databases are used to integrate data:

  • Business data. A very common data integration problem is, in the words of Michael Champion, formerly of Software AG, to "get a coherent view of the mess in the back office." For example, Vodafone uses Tamino in a system that lets customers and internal users view billing information over the Web; the system integrates data from SAP, Lotus Notes, and Microsoft Exchange. Snapbridge FDX XML Server and X-Hive/DB are used in similar fashion, integrating data from such diverse platforms as relational databases, flat files, weblogs, Web services, LDAP, and HRM systems to give a single view of entities like customers or products.

  • Order analysis. Hewlett Packard uses Ipedo to integrate internal financial data, such as order, shipment, and revenue data; expense data; and financial projections and goals. Data is gathered from a variety of sources, including relational databases, spreadsheets, flat files, PDF documents, and Word documents. It is queried and displayed through an internal Web portal.

    AZM, a Swiss chain of convenience stores, uses TEXTML Server to archive and search purchase orders and invoices from suppliers and customers. The primary data sources are ERP (Enterprise Requirements Planning) systems that use relational databases, but it turned out to be faster to search the data as XML than to perform the joins in the relational databases.

  • News production. RTL, a German broadcaster, uses MaxiMedia Mpower for newsroom production. Mpower uses Tamino to integrate stories from a variety of news feeds using a subset of NewsML. The system currently stores more than 300,000 articles, comprising several gigabytes of data. Similarly, IXIASOFT's TEXTML Server is used to integrate data from multiple news feeds, each with its own format, into a single news production system.

  • Financial data. Commerzbank uses Tamino to integrate data about customized-financial-derivative trades from 18 different systems. Data is first converted to XML and stored in Tamino. It is then transformed to a common schema for automatic evaluation and reporting by several back-office systems.

  • Flight information. The central information system at Schiphol Airport in Amsterdam uses Tamino to integrate data from more than 38 systems in real time. The data describes flights--arrival and departure times and gates, airplane types, baggage carousels, and so on--and is used by more than 50 parties involved in flight handling--from catering to baggage handling to gate assignment--as well as being displayed to the public. Data is stored as XML, using an envelope schema derived from ebXML and predefined message types, such as for flight updates, airplane movements, and timetable changes.

  • Health care. The TigerLogic XML Life Sciences Data Server (based on TigerLogic XDMS) can integrate medical and health care records from more than 20 data sources. Such integration problems are common in health care, where data usually resides on many legacy systems.

  • Manufacturing. A production control system uses QuiLogic's SQL/XML-IMDB to integrate data from different machines and sensors, which is then queried through a Web services interface. Each machine and sensor publishes data in a different ASCII or binary format, which are converted to XML documents with a common schema.

  • Instrument data. Renault uses X-Hive/DB to integrate data from sensors on its Formula 1 race cars, along with weather and track conditions and audio and video signals. The data is then made available to mechanics, engineers, and designers via a Web portal, who use it to perform "what-if" analyses.

  • Customer referrals. TIBCO uses Xpriori's NeoCore XMS to integrate customer referral information from spreadsheets and various documents according to a common schema. The system is used to coordinate customer referrals worldwide.

  • Customer support. A pharmaceutical clearinghouse uses Snapbridge's Cross Media Server to store existing RSS feeds from a number of relational databases. Customer support representatives query these feeds (primarily through full-text searches), then use an embedded URI to get back to the original data. While the solution is not elegant--a schema more targeted to the data than RSS would have been better--it had no impact on the backend databases and was very easy to write.

  • Law enforcement. Raining Data's TigerLogic XDMS is used to search data gathered from government and law enforcement agencies, both domestically and internationally.

Inside the Applications

Data integration applications must solve a number of problems, from data access to security to change management. In this section, we will look at the architectures used to solve two of these problems: queries and mapping schemas.

Query architectures

There are two query architectures for integrating data with a native XML database: local and distributed. In a local query architecture, data is imported into the database as XML and queried locally. In a distributed query architecture, data resides in remote sources and the query engine distributes queries across those data sources. The engine then compiles results and returns them to the application.

The main advantage of local queries is that they are faster, since no remote calls are made. They are also simpler to optimize, and the engine is simpler to implement, as all queries are local. Their main disadvantage is that data may be stale. A secondary problem is access control, as the local store must enforce controls previously handled by each source. Distributed queries have the opposite advantages and disadvantages: data is live, but queries are slower and harder to optimize and the engine is more complex.

Which architecture to use depends on a number of factors:

  • Support for distributed queries. Only about one sixth of native XML databases, all of them commercial, support distributed queries.

  • Support for your data sources. Most distributed query engines support a limited number of data sources, usually relational, hierarchical, or object-oriented databases.

  • Run-time availability of data sources. Not all data sources are available at run time. For example, they might be disconnected for security reasons (a document database in an intelligence agency), connections to them might be unreliable (a Web site or a system in another company), or they might not allow external queries (a heavily-loaded production system).

  • Number of data sources. Some vendors report that distributed queries perform well with a small number of data sources, but slow noticeably as the number of sources increases. While this is partially due to the number of sources, it may also be because of incorrect optimization choices made by the engine.

  • Update strategy. A few native XML databases support distributed updates. While some vendors report that customers find it easier to write updates against a single view in the native XML database, others say that distributed updates have performance problems. They also note that updates done through a native XML database circumvent integrity checks and notifications performed by the applications normally used to update the data.

If a data source cannot be included in a distributed query, its data must be imported and queried locally. About a third of the commercial native XML databases, including most of the popular ones, can import data. Support ranges from a few sources, such as relational databases and Microsoft Word, to hundreds of sources. Some databases also have APIs so you can write your own import modules. And a few databases can refresh data in response to triggers or timeouts. If your database cannot import data, you must do so yourself, such as with the help of a third-party converter.

Handling differences in schemas

The biggest problem in integrating data is handling differences in schemas. With structural differences, the same concept is represented differently, such as a name using one or multiple fields. With semantic differences, slightly different concepts are represented; these can be straightforward (a price is in US dollars or Euros) or subtle (a price includes a discount). Handling schema differences is mostly just hard work, although some differences cannot be completely resolved. Since the actual resolutions depend on the differences, this section looks at where differences can be resolved.

If data is grouped by schema, such as in relational databases or (sometimes) in native XML database collections, three architectures are common:

  • Address differences in the query. For example, the following query uses different functions to retrieve species names from collections of Bioinformatic Sequence Markup Language (BSML) and MicroArray and Gene Expression Markup Language (MAGE-ML) documents:

    declare function local:bsml_species() as element*
    {
       for $o in collection("bsml")//Organism
       let $g = fn:string($o@genus)
       let $s = fn:string($o@species)
       return <Species>{$g} {$s}</Species>
    }
    
    declare function local:mage-ml_species() as element*
    {
       for $s in collection("mage-ml")//Species
       let $v = $s/OntologyEntry[@category="NCBI:Taxonomy"]@value
       return <Species>{fn:string($v)}</Species>
    }
    
    <SpeciesList>
    {
       let $s1 := local:bsml_species()
       let $s2 := local:mage-ml_species()
       for $s in fn:distinct-values(union($s1, $s2))
       order by $s
       return $s
    }
    </SpeciesList>
    
  • Convert all documents to the same schema. This allows applications to ignore differences between schemas. Data can be converted at load time, such as when data is stored and queried locally, or at run time, such as when distributed queries are used. In the latter case, conversions are built into the XML views over the data. Note that the common schema only needs to include those fields to be queried. While this limits potential queries, it is a good way simplify development of limited systems.

  • Build common indexes over documents. A few native XML databases allow users to specify how to build index values. These are used to resolve queries and return pointers to the original documents. For example, suppose one collection of academic papers uses multiple Author elements while another uses a single Authors element. A single index might use the Author elements directly and parse the Authors elements. This allows applications to query papers by author.

If data is not grouped by schema, application logic is more complex, since is it no longer possible to write location-based queries. If documents are limited to a set of schemas, it may be possible to base queries on document structure. For example, our species name query could check if the root element is Bsml or MAGE-ML before handing a document off to a function that extracts the species name.

If documents can have any schema, such as when a law office subpoenas all documents pertaining to a particular company and automatically converts them to XML, the best that can usually be done is to explore the documents with a browser to determine if there are common structures or fields. If so, it may be possible to use these in production queries or convert documents so that such queries are possible. As a last resort, users can always perform XML-aware full-text searches.

It may also be possible to ignore differences. For example, suppose an application displays information about a customer from multiple sources. Since a human is reading the information, they can resolve many differences, such as whether a name is stored in one or two fields.

Why You Need a Native XML Database

Vendors report that most of their data integration customers were not able to solve their problems without a native XML database. The problem was that other solutions, such as federated relational databases and data integration toolkits, either could not model the types of data involved (documents, semi-structured data, hierarchical data, and so on), could not handle data whose schema was unknown at design time, and/or could not handle data whose schema changed frequently.

Native XML databases solve the first two problems with the XML data model, which is considerably more flexible than the relational model and can handle schemaless data. While native XML databases do not provide a complete solution for schema evolution, they can at least store documents with rapidly evolving schemas, as we'll discuss later.

Another advantage of native XML databases is that many support XQuery which, as was mentioned earlier, is a good data integration language.

A Peek into the Future

An interesting use of native XML databases in data integration is as a repository for metadata and semantic information. For example, the XML Business Information Portfolio from Software AG uses Tamino to store metadata and semantic information, as well as to retrieve data from backend sources. While nontrivial to set up, this allows applications to execute queries against a common schema.

CompuCredit used an early version of this repository to integrate customer data from more than 100 systems and databases, each of which is exposed as a Web service. In response to a query, the repository constructs an XML document from backend data and ships it across an Enterprise Service Bus to the customer service representative, who receives a single view of the customer's data.

Pages: 1, 2

Next Pagearrow







close