XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Mapping DTDs to Databases
by Ronald Bourret | Pages: 1, 2, 3, 4, 5

3.5. Mapping Attributes

As was seen earlier, attributes are mapped to scalar properties. This section will discuss the details of that mapping, along with other issues.

3.5.1. Mapping Single- and Multi-Valued Attributes

There are two different classes of attributes: single-valued (CDATA, ID, IDREF, NMTOKEN, ENTITY, NOTATION, and enumerated) and multi-valued (IDREFS, NMTOKENS, and ENTITIES). As might be expected, these map to single-valued properties (and then to columns) and multi-valued properties (and then to property tables). For example:

               DTD                               Classes                    Tables
   ============================                ============               ===========

   <!ELEMENT A (B, C)>                         class A {                  Table A
   <!ATTLIST A                                    String b;                  Column B
             D CDATA #REQUIRED>      ==>          String c;      ==>         Column C
   <!ELEMENT B (#PCDATA)>                         String d;                  Column D
   <!ELEMENT C (#PCDATA)>                      }

and:

               DTD                            Classes                    Tables
   ========================               ==============               ==============

   <!ELEMENT A (B, C)>                    class A {                    Table A
   <!ATTLIST A                               String   b;                  Column a_pk
         D IDREFS #IMPLIED>      ==>         String   c;      ==>         Column b
   <!ELEMENT B (#PCDATA)>                    String[] d;                  Column c
   <!ELEMENT C (#PCDATA)>                 }                            Table D
                                                                          Column a_fk
                                                                          Column d

The order in which attributes occur is not significant according to the XML Information Set. For example, the following two XML documents are considered identical. Because of this, there is no need to use order properties to maintain the order in which attributes occur, although it would certainly be possible to do so.

   <A B="bbb"
      C="ccc"
      D="ddd"/>

   <A C="ccc"
      B="bbb"
      D="ddd"/>

On the other hand, the order in which values occur in multi-valued attributes is considered significant. As was the case with sibling elements and PCDATA, order properties can be used to maintain information about the order in which values in a multi-valued attribute occur. However, there is one important difference between the order properties used for sibling elements and PCDATA and those used for multi-valued attributes: each order property used for a multi-valued attribute has its own order space. This can be seen in the following example:

        XML                       Objects
   ===============             =========================

                               object a {
   <A B="dd ee ff"                b = {"dd", "ee", "ff"}
      C="gg hh"/>      ==>        bOrder = {1, 2, 3}
                                  c = {"gg", "hh"}
                                  cOrder = {1, 2}
                               }

Alert readers will note that order properties are not strictly necessary at the object level; array order could be used instead. However, they are needed in relational databases, since there is no concept of row order there.

3.5.2. Mapping ID/IDREF(S) Attributes

ID attributes are used to uniquely identify elements in an XML document. IDREF and IDREFS attributes are used to associate one element with another by referring to the latter element's ID. This is usually done when it is not possible to form this association by nesting one element inside another. For example, consider the following directed graph:

          A
         / \
        B   C
         \ /
          D

This can be represented in an XML document as:

   <A>
      <B ref_d="1">
         ...
      </B>
      <C ref_d="1">
         ...
      </C>
      <D id="1">
         ...
      </D>
   </A>

ID/IDREF(S) attributes map to primary key, foreign key relationships. For example, the above document could be stored in the database with the following tables and columns:

               Table A
                  Column a_pk
                  ...
                /     \
               /       \
   Table B              Table C
      Column a_fk          Column a_fk
      Column ref_d         Column ref_d
      ...                  ...
              \         /
               \       /
               Table D
                  Column a_fk
                  Column id
                  ...

One thing that data transfer software needs to be careful of when storing ID/IDREF(S) attributes in a database is that IDs are only guaranteed to be unique inside a given XML document. Thus, if the data from more than one document is stored in the same table, there is no guarantee that the IDs will be unique. The solution to this problem is to somehow "decorate" the IDs. This could be done by mapping the attributes to two columns, one of which contains a value that is unique to each document and the other of which contains the ID, or by decorating the ID itself, such as prefixing it with a unique value.

A similar problem exists when transferring data from the database to a XML document. If the retrieved data originates from more than one document, the data transfer software needs to ensure that the ID values are unique. This might involve changing one or more values, along with the values of any IDREF(S) attributes that reference them.

Currently, most products do not support ID/IDREF attributes as distinct from other attributes.

3.5.3. Mapping Notation Attributes

Notations are used in XML to alert the application of how an element or unparsed entity is to be treated. For example, the following "xhtml" notation might tell the application that the element contains XHTML and should be displayed by a browser:

   <Desc type="xhtml">
      <html>
      <head><title>Turkey wrench</title></title></head>
      <body>
      <p>A very <b>big</b> turkey wrench.</p>
      </body>
      </html>
   </Desc>

Notation attributes and their values are generally of no interest to the object-relational mapping; they're treated as simply another attribute.

The only exception to this occurs when the notation indicates the data type of the contained text. For example, the notation "base64" might tell the application that an element contains binary data encoded as Base64 (a MIME encoding that maps binary data to a subset of US-ASCII). In most cases, this information will only be of interest to software that generates mappings. It could use the information to map the element type to a binary-valued property and then to a BLOB (Binary Large OBject). In such cases, the mapping itself does not use this information. The mapping from element to BLOB is independent of the fact that a notation contains data type information.

The only exception to this is when the data transfer software is sophisticated enough to switch mappings at run time based on the notation value. In this case, each possible notation is mapped to a data type, which is then used to convert the data.

3.5.4. Mapping ENTITY/ENTITIES Attributes

ENTITY and ENTITIES attributes are used to associate unparsed, external data (such as a binary file) with an XML document. These are mapped the same as any other attribute, except that, when the data is transferred, the entity may be substituted for the attribute value (when transferring data from XML to the database) or a new entity may be created and its identifier stored as the attribute value (when transferring data from the database to XML). Because unparsed entity values can be generated dynamically, it is a good idea for the mapping to specify whether the value or the entity URI is to be stored in the database.

Because unparsed entities always have associated notations, it is possible to use those notations when determining the data type of the entity (either at map time or run time).

3.6. Alternate Mappings

In the previous sections, we have described how to map DTDs to databases. In fact, this description has not been complete, as there are a number of other ways we could have done the mapping. In this section, we will discuss two of the most significant alternatives.

3.6.1. Mapping Complex Element Types to Scalar Types

Although complex element types are normally mapped to classes and then to tables, it is possible to map them to scalar types. In other words, references to complex element types can be mapped to scalar properties. The value of such properties is generally the element content, serialized as XML. This is useful when the element's value makes sense only as a whole and shouldn't be broken into smaller parts.

For example, consider an XML document that gives information about a part. If one of the child elements is a part description in XHTML, it probably makes no sense to fragment it further. As we have already seen, this will result in the data being scattered across numerous tables; one for italic words, one for bold words, one for words used in hyperlinks, etc. Therefore, it is better to store this description in a single column:

             DTD                               Classes                      Tables
   ===========================              ===============             ==============

   <!ELEMENT Part (Num, Desc)>              class Part {                Table Part
   <!ELEMENT Number (#PCDATA)>      ==>        String num;      ==>        Column num
   <!-- Use Inline entity                      String desc;                Column desc
        from XHTML -->                      }
   <!ELEMENT Desc (%Inline;)>

For example, the following description is stored as follows:

   <Part>
      <Number>127</Number>      Table Part
      <Desc>                    Num  Desc
         A very <b>big</b>  =>  ---  ----------------
         turkey wrench.         127  A very <b>big</>
      </Desc>                        turkey wrench.
   </Part>

Note that storing data as XML does cause problems for data transfer software. In particular, the software cannot distinguish between markup and data. For example, how does the application determine if the <b> in the following text is a <b> element or text?

   An example of the <b> element is <b>this element</b>.

One solution to this is to store actual elements in the database using tags and character data using entity references:

   An example of the &lt;b&gt; element is <b>this element</b>.

The problem with this is that non-XML applications can't then search the database as they might expect.

3.6.2. Mapping Scalar Properties to Property Tables

Although single, scalar-valued properties are usually mapped to columns, they can be mapped to property tables. This is useful, for example, for storing BLOBs or infrequently used properties in a separate table from the main table. For example:

      Classes                        Tables
   ===============             ==================

   class Part {                Table Parts
      String num;      ==>        Column num
      String desc;
                               Table Descriptions
   }                              Column num
                                  Column desc

3.6. Conclusions

The object-relational mapping works for all XML documents, maps well to objects, and allows non-XML applications to use the data in the database. Because of this, it is a good choice for storing data-centric documents and (not surprisingly) is used as the underlying model in some middleware, most XML-enabled relational databases, and most XML-enabled object servers.

It should be noted that all of these products implement slightly different versions of the object-relational mapping and none implements all possibilities in the mapping. Of the more common parts of the mapping, all map complex element types to classes and references to element types to properties, as well as using primary key, foreign key pairs to join tables. However, some map columns only to PCDATA-only elements, others map columns only to attributes, and still others allow both. Similarly, most of these products do not support sibling order or mixed content, and many do not allow users to change names during the mapping.

The object-relational mapping is not a good choice for ordinary documents. First, it is inefficient when used with mixed content. Second, like the table-based mapping, it does not preserve physical structure, comments, or processing instructions.

4. Generating Schema

We now consider how to generate relational database schema from a DTD according to the object-relational mapping and vice versa. Since there are a number of possible paths through the object-relational mapping, the algorithms here simply choose the most commonly used branch each time there is a choice. For example, single references to PCDATA-only elements can be mapped to a column or a separate property table. Since the most common choice is to map them to a column, the following algorithm generates a column from such references.

For object-oriented databases, the generation process is similar.

4.1. Generating Relational Database Schema from DTDs

Relational schemas are generated by reading through the DTD and processing each element type:

  • Complex element types generate class tables with primary key columns.
  • Simple element types are ignored except when processing content models.

To process a content model:

  • Single references to simple element types generate columns; if the reference is optional (? operator), the column is nullable.
  • Repeated references to simple element types generate property tables with foreign keys.
  • References to complex element types generate foreign keys in remote class tables.
  • PCDATA in mixed content generates a property table with a foreign key.
  • Optionally generate order columns for all referenced element types and PCDATA.

To process attributes:

  • Single-valued attributes generate columns; if the attribute is optional, the column is nullable.
  • Multi-valued attributes generate property tables with foreign keys.
  • If an attribute has a default, it is used as the column default.

The following example shows how this process works. Consider the following DTD:

                          DTD                                           Tables
   =================================================               =================

   <!ELEMENT Order (OrderNum, Date, CustNum, Item*)>
   <!ELEMENT OrderNum (#PCDATA)>
   <!ELEMENT Date (#PCDATA)>
   <!ELEMENT CustNum (#PCDATA)>
                                                          
   <!ELEMENT Item (ItemNum, Quantity, Part)>
   <!ELEMENT ItemNum (#PCDATA)>
   <!ELEMENT Quantity (#PCDATA)>
                                                          
   <!ELEMENT Part (PartNum, Price)>
   <!ELEMENT PartNum (#PCDATA)>
   <!ELEMENT Price (#PCDATA)>

In the first step, we generate tables for complex element types and primary keys for these tables:

                          DTD                                           Tables
   =================================================               =================

   <!ELEMENT Order (OrderNum, Date, CustNum, Item*)>      ==>      Table Order
   <!ELEMENT OrderNum (#PCDATA)>                                      Column OrderPK
   <!ELEMENT Date (#PCDATA)>
   <!ELEMENT CustNum (#PCDATA)>
                                                         
   <!ELEMENT Item (ItemNum, Quantity, Part)>              ==>      Table Item
   <!ELEMENT ItemNum (#PCDATA)>                                       Column ItemPK
   <!ELEMENT Quantity (#PCDATA)>
                                                          
   <!ELEMENT Part (PartNum, Price)>                       ==>      Table Part
   <!ELEMENT PartNum (#PCDATA)>                                       Column PartPK
   <!ELEMENT Price (#PCDATA)>

In the second step, we generate columns for references to simple element types:

                          DTD                                           Tables
   =================================================               =================

   <!ELEMENT Order (OrderNum, Date, CustNum, Item*)>      ==>      Table Order
   <!ELEMENT OrderNum (#PCDATA)>                                      Column OrderPK
   <!ELEMENT Date (#PCDATA)>                                          Column OrderNum
   <!ELEMENT CustNum (#PCDATA)>                                       Column Date
                                                                      Column CustNum
                                                          
   <!ELEMENT Item (ItemNum, Quantity, Part)>              ==>      Table Item
   <!ELEMENT ItemNum (#PCDATA)>                                       Column ItemPK
   <!ELEMENT Quantity (#PCDATA)>                                      Column ItemNum
                                                                      Column Quantity

   <!ELEMENT Part (PartNum, Price)>                       ==>      Table Part
   <!ELEMENT PartNum (#PCDATA)>                                       Column PartPK
   <!ELEMENT Price (#PCDATA)>                                         Column PartNum
                                                                      Column Price

In the final step, we generate foreign keys for references to complex element types:

                          DTD                                           Tables
   =================================================               =================

   <!ELEMENT Order (OrderNum, Date, CustNum, Item*)>      ==>      Table Order
   <!ELEMENT OrderNum (#PCDATA)>                                      Column OrderPK
   <!ELEMENT Date (#PCDATA)>                                          Column OrderNum
   <!ELEMENT CustNum (#PCDATA)>                                       Column Date
                                                                      Column CustNum
                                                          
   <!ELEMENT Item (ItemNum, Quantity, Part)>              ==>      Table Item
   <!ELEMENT ItemNum (#PCDATA)>                                       Column ItemPK
   <!ELEMENT Quantity (#PCDATA)>                                      Column ItemNum
                                                                      Column Quantity
                                                                      Column OrderFK
                                                          
   <!ELEMENT Part (PartNum, Price)>                       ==>      Table Part
   <!ELEMENT PartNum (#PCDATA)>                                       Column PartPK
   <!ELEMENT Price (#PCDATA)>                                         Column PartNum
                                                                      Column Price
                                                                      Column PartFK

A generated schema isn't going to be the same as a human would have written. In addition to naming problems (for example, a person might have called the tables Orders, Items, and Parts), the generation algorithm was unable to determine that the relationship between Items and Parts was many-to-one. The algorithm was also unable to recognize that OrderNum and PartNum could be used as primary keys, and it could not determine data types and column lengths, although XML Schemas will solve the latter problem. Although no naming collisions occurred or illegal names were generated, both are possible.

Pages: 1, 2, 3, 4, 5

Next Pagearrow