XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Going Native, Part 2
by Ronald Bourret | Pages: 1, 2

Working with Semi-Structured Data

Managing semi-structured data is the third major use case for native XML databases. Semi-structured data has some structure, but isn't as rigidly structured as relational data. While there is no formal definition for semi-structured data, some common characteristics are:

  • Data can contain fields not known at design time. For example, the data comes from a source over which the database designer has no control.

  • Data is self-describing. That is, metadata is associated with individual data values (as with element and attribute names in XML) rather than a group of values of the same type (as with column names in a relational database). Self-descriptions are used to interpret fields not known at design time.

  • The same kind of data may be represented in multiple ways. For example, an address might be represented by one field or by multiple fields, even within a single set of data.

  • Data may be sparse. That is, among fields known at design time, many fields will not have values.

Semi-Structured Data in the Real World

Semi-structured data occurs in many fields. For example, here are some of the types of semi-structured data that are being stored in native XML databases today:

  • Data integration. Integration data is semi-structured because the same concept is often represented differently in different data sources and changes to remote data sources can result in fields unknown to the integrator. Data integration was discussed earlier.

  • Schema evolution. Rapidly evolving schemas result in semi-structured data because they introduce new fields and may change the way in which data is represented. These problems occur most commonly when data crosses organizational boundaries. Schema evolution is discussed separately.

  • Biological data. Biological data, especially molecular and genetic data, is semi-structured because the field itself is evolving rapidly. As a result, the schemas used in these fields generally allow user-defined data. For example, much of the data in MAGE-ML is stored as hierarchies of user-defined property-value pairs. Similarly, BSML allows users to add arbitrary metadata in the form of property-value pairs.

  • Metadata. Metadata is often semi-structured because users define their own types of metadata. For example, the Metadata Encoding and Transmission Standard (METS), which is used to provide metadata for objects in digital libraries, defines only basic metadata, such as the name of the person who created the METS document, and allows users to define the rest. For example, a user might use Dublin Core to provide information about the title, author, and publisher of a book whose digital image is in a library, and NISO MIX to provide technical data about how the image was created. On the other hand, while the Encoded Archival Description (EAD) schema does not allow user-defined metadata, it is extremely flexible and will likely result in documents that sparsely populate the available fields.

  • Financial data. Financial data is semi-structured because new financial instruments are constantly being invented and because it is often the result of integrating data from many proprietary systems. An additional source of change in the XML world is the rapid development of standards like the Financial Information eXchange Markup Language (FIXML) and Financial products Markup Language (FpML). (Of interest, the FIXML specification explicitly discusses how to customize FIXML.)

  • Health data. Health data is semi-structured because it is sparsely populated, it is often the result of integrating data from many proprietary systems, and user-defined data is common. For example, HL7 has hundreds of elements (it is unlikely that the description of any patient or organization will use all of them) and makes frequent use of the xsd:any element.

  • Business documents. Business documents are semi-structured because the real world is a highly variable place. Most documents contain a core set of fields--name, address, date, and so on--as well as user-defined fields. For example, while insurance claims have a number of fixed fields (name, policy number, date, and so on), the bulk of the information is free-form (accident description, police reports, photographs, and so on).

  • Catalogs. Catalogs are hierarchies of product descriptions. While some catalogs are rigidly structured (that is, a single set of fields can be used to describe each node in the catalog) other catalogs are semi-structured. One reason is that different parts, such as a piston, a tire, and a carburetor, are described by different fields. Another reason is that some catalogs integrate data from different vendors, each of whom uses their own schema.

  • Entertainment data. Entertainment data is semi-structured because the services being described (films, restaurants, hotels, and so on) vary tremendously. As a result, data is sparsely populated and schemas change frequently. Entertainment data also comes from a variety of sources (movie theatres, newspaper reviews, hotel chains, and so on), which may result in integration problems.

  • Customer profiles. Customer profiles are semi-structured for two reasons. They are sparsely populated because few customers have data for all fields (frequent flier numbers, food preferences, preferred travel time, and so on). They evolve rapidly because the ways in which people are described (contact information, exercise preferences, medical conditions, and so on) change constantly.

  • Laboratory data. Laboratory data is semi-structured because it is sparsely populated--different measurements apply to different substances--and because there is ample room for user-defined data. For example, in the pharmaceutical approval process, a single application might handle all of the documentation for applying for drug approval, yet different drugs are likely to require different sets of data.

Inside the Applications

Applications that work with semi-structured data that has a known schema are not significantly different from applications that work with other kinds of data. For example, they use queries defined at design time to retrieve and update data. The main difference is that they often must handle data represented in different ways in different parts of the data set. While this may be unpleasant, as long as the number of variations is limited, it is usually possible.

Applications that work with semi-structured data containing fields not known at design time are fundamentally different. As a general rule, such applications pass unknown fields to humans for processing. For example, suppose a catalog has a basic structure defined by a central authority and uses vendor-specific XML to describe individual items. A catalog browser might be hard-coded to navigate the known structure and use XML-aware full-text searches or // searches to search the unknown structure. Product data might be displayed as raw XML or converted to XHTML with a stylesheet that displays data based on nesting level.

Similar applications are found in molecular biology, genetics, health care, and library science. In each case, the data describes something--a molecule, a gene, a patient, an archive--and many of the fields are known. The application uses these fields, such as to allow the user to drill into the data, and then displays the unknown fields. The person reading the data can interpret it and take further action, such as reading a scientific paper, making a diagnosis, or adding comments.

Another common solution is for the application to evolve with the data. For example, incoming documents can be examined with a generic browser to decide what kinds of queries are possible. In some cases, it might be possible to write specific queries, such as //address to search for addresses; in other cases, the only choice might be full-text searches. While this kind of development is likely to be repugnant to programmers accustomed to working with well-defined schemas, it is a huge improvement for users whose previous choice was to wade through reams of paper or search files in a variety of formats using a variety of tools.

Why You Need a Native XML Database

XML is a good way to represent semi-structured data: it does not require a schema; it is self-describing (albeit minimally so); and it represents sparse data efficiently. Thus, native XML databases are a good way to store semi-structured data. They support the XML data model, they can index all fields (even those unknown at design time), they support XML query languages and XML-aware full-text searches, and some support node-based updates.

Relational databases, on the other hand, do not handle semi-structured data well. The main problem is that they require rigidly defined schemas. Thus, fields not known at design time must be stored abstractly, such as with property-value pairs, which are difficult to query. They are also difficult to change as the schema evolves. A secondary problem is that they do not handle sparse data efficiently: the choices are a single table with lots of NULLs, which wastes space, or many sparsely populated tables, which are expensive to join.

According to vendors, many customers couldn't handle their semi-structured data until they used a native XML database. Other customers used a variety of tools, such as grep, full-text search engines, and proprietary applications, or stored some data in a relational database and complete documents as flat files, CLOBs, or even Word documents. As a general rule, these solutions worked in the initial stages, but had limited query capabilities, didn't scale well, and were difficult to maintain as schemas evolved.

(A notable exception occurred in the field of biology. The AceDB database was initially written to store data about the worm C. elegans. It has since evolved into a generic, object-oriented database with its own schema, query languages, and data browsers. Other databases, such as UniProt/Swiss-Prot and GenBank are (apparently) available in relational and flat-file formats, but are generally queried through proprietary tools such as SRS and Entrez.)

A Peek into the Future

Semi-structured data is still straddling the boundary between academia and industry, so the near term is most likely to consist of gaining experience--managing data, writing applications, handling evolution, and so on--than creating definitive tools.

In our next installment, we will look at how native XML databases are used in schema evolution, for long-running transactions, for handling large documents, and in a number of other cases, as well as how relational databases are evolving to handle XML.



1 to 1 of 1
  1. Roadside Assistance Los Angeles, CA 1-877-364-5264
    2009-07-02 14:53:41 carpetcare
1 to 1 of 1