XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Mapping DTDs to Databases
by Ronald Bourret | Pages: 1, 2, 3, 4, 5

3.1. The Basic Mapping

An object-relational mapping is performed in two steps. First, an XML schema (a DTD in this case) is mapped to an object schema, then the object schema is mapped to the database schema. The two mappings can optionally be combined for a direct DTD-to-database mapping, as is done in most software today.

In considering this mapping, it is important to understand that the objects involved are specific to each DTD and not objects from the DOM. In particular, these objects model the data in the XML document, while the DOM models the structure of the XML document. For example, the following shows the object trees for data-specific objects and the DOM that would be created for the XML document in the example above:

    SalesOrder                  Document
     /      \                       |
   Item    Item                  Element_______
                          ______/ /  \ \______ \_________
                         /       /    \       \          \
                    Element  Element  Element  Element   Element
                       |        |        |    /  |  \ \    etc.
                     Text     Text     Text  /   |   \ \_______
                                            /    |    \        \
                                      Element Element Element  Attr
                                         |       |       |       |
                                       Text    Text    Text    Text

This distinction is particularly important when you consider how the data is stored in the database. To store the data-specific objects, you will need SalesOrders and Items tables; to store the DOM objects, you will need Document, Element, Text, and Attr tables. The most important difference is that non-XML applications can use the data-specific tables but can't use the DOM-specific tables.

3.1.1. Mapping DTDs to Object Schemas

The mapping starts by recognizing that element types are data types. Element types that have PCDATA-only content are called simple element types; the term is borrowed from W3C XML Schema. These hold a single data value and are equivalent to scalar data types in an object-oriented programming language. (Note that the word "scalar" is used here to mean "consisting of a single data value". In some languages, "scalar" data types -- in this sense of the word -- are represented by objects. The most notable example of this is the String data type in Java.) Attribute types are also simple types.

Element types that have element or mixed content, or that have attributes, are called complex data types; the term is again borrowed from XML Schema. These hold a structured value and are equivalent to classes in an object-oriented programming language or structs in C. Note that an element type that has empty content and attributes is still "complex". The reason for this is that the attributes also provide structure and are roughly equivalent to child PCDATA-only elements.

The object-relational mapping first maps simple types to scalar data types. For example, the element type Title might be mapped to a String, and the element type Price might be mapped to a float. It then maps complex types to classes, with each element type in the content model of the complex type mapped to a property of that class. The data type of each property is the data type to which the referenced element type is mapped. For example, a reference to a Title element would be mapped to a String property, and a reference to a Price element would be mapped to a float property. References to complex element types are mapped to pointers/references to an object of the class to which the complex element type is mapped.

The last part of the mapping maps attributes to properties, with the data type of the property determined from the data type of the attribute. Note that attributes are equivalent to references to element types in a content model. This is because, like references in a content model, they are local to a given element type. The only conceptual difference is that the attribute type is defined locally, rather than at a global (DTD-wide) level, as is the case with element types.

For example, in the following the simple element types B, D, E, and the attribute F are all mapped to Strings and the complex element types A and C are mapped to classes A and C. The content models and attributes of A and C are mapped to properties of classes A and C. The references to B, D, and E in the content models of A and C are mapped to String properties (because the types are mapped to Strings) and the attribute F is also mapped to a String property. The reference to C in the content model of A is mapped to a property with the type pointer/reference to an object of class C because element type C is mapped to class C.

              DTD                         Classes
   =========================           =============

   <!ELEMENT A (B, C)>                 class A {
   <!ELEMENT B (#PCDATA)>                 String b;
   <!ATTLIST A                  ==>       C      c;
          F CDATA #REQUIRED>              String f;
                                       }


   <!ELEMENT C (D, E)>                 class C {
   <!ELEMENT D (#PCDATA)>       ==>       String d;
   <!ELEMENT E (#PCDATA)>                 String e;
                                       }

One important point to reiterate here is that mapping references to element types in a content model is different than mapping the element types themselves. Element types map to data types, while references to element types map to properties in structured data types (classes). The difference is clear when you consider an element type that is referenced in two different content models. In this case, each reference must be mapped separately, with the data types of the resulting properties determined by the data types to which the element types themselves (not the references) are mapped.

For example, consider the Title and Section element types in the following. Each of these element types is referenced in the content models of both Chapter and Appendix. Each reference is mapped separately for each parent element type. The data type of the properties to which the references to Title are mapped is String, since Title contains only PCDATA and is mapped to a String. The data type of the properties to which the references to Section are mapped is a pointer/reference to a Section object, since the Section element type is complex and is mapped to a Section class.

                     DTD                                      Classes
   =========================================               =============

   <!ELEMENT Chapter (Title, Section+)>                    class Chapter {
   <!ELEMENT Appendix (Title, Section+)>          ==>         String    title;
   <!ELEMENT Title (#PCDATA)>                                 Section[] sections;
   <!ELEMENT Section (#PCDATA | p | b | i)*>               }

                                                           class Appendix {
                                                              String    title;
                                                              Section[] sections;
                                                           }

The other important point to reiterate is that simple element types and attributes can be mapped to data types other than String. For example, an element type named Quantity might be mapped to an integer. When mapping from a DTD, this requires human intervention, since there is no way to predict the target data type from a PCDATA-only element type. However, when mapping from an XML Schema, the target type is known since XML Schemas have data types.

3.1.2. Mapping Object Schemas to Database Schemas

In the second part of the object-relational mapping, classes are mapped to tables (known as class tables), scalar properties are mapped to columns, and pointer/reference properties are mapped to primary key/foreign key relationships. For example:

     Classes                     Tables
   ============             =================
   class A {                Table A:
      String b;                Column b
      C      c;     ==>        Column c_fk
      String f;                Column f
   }

   class C {                Table C:
      String d;     ==>        Column d
      String e;                Column e
   }                           Column c_pk

Note that the tables are joined by a primary key (C.c_pk) and a foreign key (A.c_fk). Because the relationship between the parent and child elements is one-to-one, the primary key can be in either table. If the relationship is one-to-many, the primary key must be on the "one" side of the relationship, regardless of whether this is the parent or child. For example, if a SalesOrder element contains multiple Item elements, then the primary key must be in the SalesOrder table (the parent). But if each Item element contains a Part element, the primary key must be in the Part table (the child), since one part can occur in many items.

A primary key column can be created as part of the mapping, as is the case for the column c_pk, or an existing column or columns can be used as the primary key. For example, if a SalesOrder element type has a Number child, this might be mapped to the primary key column.

If a primary key column is created as part of the mapping, its value must be generated by the database or the data transfer software. While this is generally considered better database design than using data columns as primary keys, it has a disadvantage when used with XML, and this is that the generated key is meaningless outside the source database. Thus, when data with a generated key is transferred to an XML document, it will either contain a meaningless primary key (if the primary key value is transferred) or no primary key at all (if the key is not transferred). In the latter case, it may be impossible to re-identify the source of the data, which is a problem if the data is modified and returned to the database as an XML document.

3.1.3. Miscellany

Before we continue with the more complicated parts of the mapping, we need to mention two things. First, names can be changed during the mapping. For example, the DTD, object schema, and relational schema can all use different names. For example, the following DTD uses different names than the following class:

                DTD                                   Classes
   ===============================             =====================

   <!ELEMENT Part (Number, Price)>             class PartClass {    
   <!ELEMENT Number (#PCDATA)>         ==>        String numberProp;
   <!ELEMENT Price (#PCDATA)>                     float  priceProp;
                                               }

which uses different names than the following table:

         Classes                           Tables
   =====================             ==================

   class PartClass {                 Table PRT
      String numberProp;     ==>        Column PRTNUM
      float  priceProp;                 Column PRTPRICE
   }

Second, the objects involved in the mapping are conceptual. That is, there is no need to actually instantiate them when transferring data between an XML document and a relational database. (This is not to say that the objects can't be instantiated. Whether it is useful to instantiate the objects depends on the actual application.)

3.2. Mapping Complex Content Models

The content models we have seen so far are relatively simple. What happens with more complex content models, such as the following?

   <!ELEMENT A (B?, (C | ((D | E | F | G)*, (H | I)+, J?)))>

In this section, we will consider the various parts of the content model. Mapping the above example will be left as an exercise to the reader. (I've always wanted to say that.)

3.2.1. Mapping Sequences

As has already been seen, each element type referenced in a sequence is mapped to a property, which is then mapped either to a column or to a primary key, foreign key relationship. For example:

            DTD                          Classes                    Tables
   ======================              ============             ==============

   <!ELEMENT A (B, C)>                 class A {                Table A
   <!ELEMENT B (#PCDATA)>     ==>         String b;     ==>        Column b
                                          C      c;                Column c_fk
                                       }

   <!ELEMENT C (D, E)>                 class C {                Table C
   <!ELEMENT D (#PCDATA)>     ==>         String d;     ==>        Column d
   <!ELEMENT E (#PCDATA)>                 String e;                Column e
                                       }                           Column c_pk

3.2.2. Mapping Choices

As with sequences, each element type referenced in a choice is also mapped to a property, then either to a column or a primary key, foreign key relationship. The only difference from the way sequences are mapped is that the properties and columns can be null. For example, suppose we changed the sequence in the content model of A to a choice. The mapping from DTD to object schema would then be

            DTD                                Classes
   ======================              ========================

   <!ELEMENT A (B | C)>                class A {
   <!ELEMENT B (#PCDATA)>     ==>         String b; // Nullable
                                          C      c; // Nullable
                                       }
   <!ELEMENT C (D, E)>                 class C {
   <!ELEMENT D (#PCDATA)>     ==>         String d;
   <!ELEMENT E (#PCDATA)>                 String e;
                                       }

and the mapping from object schema to database schema would be

           Classes                               Tables
   ========================             ===========================

   class A {                            Table A (
      String b; // Nullable     ==>        Column b     // Nullable
      C      c; // Nullable                Column c_fk  // Nullable
   }
   class C {                            Table C (
      String d;                 ==>        Column d     // Not nullable
      String e;                            Column e     // Not nullable
   }                                       Column c_pk  // Not nullable

To see why this is true, consider the following XML document, which conforms to the DTD above. Because the choice in the content model of A requires that either B or C (but not both) be present as a child, one of the two corresponding properties (and columns) will always be null.

       XML                       Objects                    Tables
   =============               ============               ===========

                                                            Table A
                                                           ---------
   <A>                         object a {                  b    c_fk
      <B>bbb</B>      ==>         b = "bbb"      ==>      ---   ----
   </A>                           c = null                bbb   NULL
                               }                          ...   ...

Note that if the primary key used to join the tables was in table A, the corresponding foreign key column in C would not be nullable. If A did have a C child, the column would have to have a value to join it to the correct row in table A. If A did not have a C child, there would simply be no row in table C.

3.2.3. Mapping Repeated Children

Children that can occur multiple times in their parent, which are known as repeated children, are mapped to multi-valued properties and then either to multiple columns in a table or to a separate table, known as a property table.

If a content model contains repeated references to an element type, the references are mapped to a single property, which is an array of known size. This can be mapped either to multiple columns in a table or to a property table. For example, the following shows how to map a repeated reference to multiple columns in a table:

              DTD                              Classes                      Tables
   =========================                 ==============               ===========

   <!ELEMENT A (B, B, B, C)>                 class A {                    Table A
   <!ELEMENT B (#PCDATA)>         ==>           String[] b;      ==>         Column b1
   <!ELEMENT C (#PCDATA)>                       String   c;                  Column b2
                                             }                               Column b3
                                                                             Column c

If the + or * operator is applied to a reference, the reference is again mapped to a single property, which this time is an array of unknown size. Since the number of values can be arbitrarily large, the property must be mapped to a property table, which will contain one row for each value. The property table is linked to the class table by a primary key, foreign key relationship, where the primary key is in the class table. For example,

            DTD                            Classes                       Tables
   ======================               ==============               ==============

   <!ELEMENT A (B+, C)>                 class A {                    Table A
   <!ELEMENT B (#PCDATA)>      ==>         String[] b;      ==>         Column a_pk
   <!ELEMENT C (#PCDATA)>                  String   c;                  Column c
                                        }
                                                                     Table B
                                                                        Column a_fk
                                                                        Column b

3.2.4. Mapping Optional Children

Children that are optional in their parent are mapped to nullable properties, then to nullable columns. This has already seen for children that appear in a choice group, as in the following mapping from DTD to object schema:

              DTD                               Classes
   =========================            ========================

   <!ELEMENT A (B | C | D)>             class A {
   <!ELEMENT B (#PCDATA)>       ==>        String b; // Nullable
   <!ELEMENT C (#PCDATA)>                  String c; // Nullable
                                           D      d; // Nullable
                                        }
   <!ELEMENT D (E, F)>                  class D {
   <!ELEMENT E (#PCDATA)>       ==>        String e;
   <!ELEMENT F (#PCDATA)>                  String f;
                                        }

and object schema to database schema:

            Classes                               Tables
   ========================               ========================

   class A {                              Table A
      String b; // Nullable      ==>         Column b     // Nullable
      String c; // Nullable                  Column c     // Nullable
      D      d; // Nullable                  Column d_fk  // Nullable
   }
   class D {                              Table D
      String e;                  ==>         Column e     // Not nullable
      String f;                              Column f     // Not nullable
   }                                         Column d_pk  // Not nullable

It also applies when the ? or * operator is applied to a reference, as in the following mapping from DTD to object schema:

            DTD                                  Classes
   ======================               ==========================

   <!ELEMENT A (B?, C*)>                class A {
   <!ELEMENT B (#PCDATA)>      ==>         String b;   // Nullable
   <!ELEMENT C (#PCDATA)>                  String c[]; // Nullable
                                        }

and object schema to database schema:

            Classes                                    Tables
   ==========================               ===============================

   class A {                                Table A
      String b;   // Nullable      ==>         Column b     // Nullable
      String c[]; // Nullable                  Column a_pk  // Not nullable
   }
                                            Table C
                                               Column a_fk  // Not nullable
                                               Column c     // Not nullable

Note that the column used to store C (in property table C) is not nullable. This is because, if no C children are present in A, there are simply no rows in table C.

3.2.5. Mapping Subgroups

References in subgroups are mapped to properties of the parent class, then to columns in the class table, as in the following mapping from DTD to object schema:

            DTD                                     Classes
   =========================              ============================

   <!ELEMENT A (B, (C | D))>              class A {
   <!ELEMENT B (#PCDATA)>        ==>         String b; // Not nullable
   <!ELEMENT C (#PCDATA)>                    String c; // Nullable
   <!ELEMENT D (E, F)>                       D      d; // Nullable
                                          }
   <!ELEMENT D (E, F)>                    class D {
   <!ELEMENT E (#PCDATA)>        ==>         String e;
   <!ELEMENT F (#PCDATA)>                    String f;
                                          }

and object schema to database schema:

           Classes                                     Tables
   ============================               ==========================

   class A {                                  Table A
      String b; // Not nullable      ==>         Column b    // Not nullable
      String c; // Nullable                      Column c    // Nullable
      D      d; // Nullable                      Column d_fk // Nullable
   }
   class D {                                  Table D (
      String e;                       ==>        Column e
      String f;                                  Column f
   }                                             Column d_pk

You might be wondering how this is possible. What happened to the structure imposed by the subgroup? In fact, this structure appears only in the content model, not in instance documents. For example, both of the following documents conform to the above content model:

   <A>
      <B>bbbbbb</B>
      <C>cccccc</C>
   </A>

   <A>
      <B>bbbbbb</B>
      <D>
         <E>eee</E>
         <F>fff</F>
      </D>
   </A>

There's no way to determine the presence of the subgroup from these documents. Structurally, C and D are indistinguishable from B; they are just children of A. Thus, they can be mapped like children of A that are not in a subgroup.

One consequence of mapping references in subgroups directly to properties of the parent class is that repeatability and optionality can be indirect. For example, in the following content model C, D, and E are both optional and repeatable. They are repeatable because the + operator applies to them indirectly. C is optional because it is directly in a choice group, but D and E are optional because they are indirectly in a choice group.

               DTD                                       Classes
   ===============================            =============================

   <!ELEMENT A (B, (C | (D, E))+)>            class A {
   <!ELEMENT B (#PCDATA)>                        String   b;
   <!ELEMENT C (#PCDATA)>            ==>         String[] c; // May be null  
   <!ELEMENT D (E, F)>                           D[]      d; // May be null
   <!ELEMENT E (#PCDATA)>                        String[] e; // May be null
                                              }

Pages: 1, 2, 3, 4, 5

Next Pagearrow