XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Mapping DTDs to Databases
by Ronald Bourret | Pages: 1, 2, 3, 4, 5

3.3. Mapping Mixed Content

Mixed content is just a choice group to which the * operator applies indirectly, except that it can contain PCDATA mixed between child elements. Thus, the element type references in mixed content can be mapped first to properties that are nullable arrays of unknown size, then to property tables.

To see how to map mixed content, consider the following XML document:

   <A>
   This text <c>cc</c> makes
   <b>bbbb</b> no sense
   <c>cccc</c> except as
   <b>bb</b> an example.
   </A>

and then notice that it is essentially the same as the following document, in which PCDATA has been wrapped in <pcdata> elements:

   <A>
   <pcdata>This text </pcdata><c>cc</c><pcdata> makes
   </pcdata><b>bbbb</b><pcdata> no sense
   </pcdata><c>cccc</c><pcdata> except as
   </pcdata><b>bb</b><pcdata> an example.</pcdata>
   </A>

From this, it is easy to see that PCDATA can be treated like any other child element. Thus, PCDATA in mixed content is mapped to a nullable array of unknown size, then to a property table. The following shows how to map mixed content from a DTD to an object schema:

               DTD                                Classes
   ===============================           ===================

                                             class A {
   <!ELEMENT A (#PCDATA | B | C)*>              String[] pcdata;
   <!ELEMENT B (#PCDATA)>            ==>        String[] b;
   <!ELEMENT C (#PCDATA)>                       String[] c;
                                             }

and an object schema to a database schema:

         Classes                                 Tables
   ===================              ===================================

                                                       Table PCDATA
                                                    ------Column a_fk
   class A {                                       /      Column pcdata
      String[] pcdata;              Table A       /    Table B
      String[] b;           ==>        Column a_pk--------Column a_fk
      String[] c;                                 \       Column b
   }                                               \   Table C
                                                    \-----Column a_fk
                                                          Column c

To see what is actually stored in the database, consider the document shown at the start of this section, which is mapped to the following object, then to rows in the following tables. (We assume that the system generates a primary key of value 1 for the row in the table for A. This is used to link the row in table A to the rows in the other tables.)

            Objects                                       Tables
   ============================               ===============================

                                                            Table PCDATA
                                                            a_fk  pcdata
                                                            ----  -----------
                                                             1    This text 
                                                             1    makes
   object a {                                                1    no sense
      pcdata = {"This text ",                                1    except as
                " makes ",                    Table A        1    an example.
                " no sense ",                   a_pk       
                " except as",        ==>        ----        Table B
                " an example."}                  1          a_fk   b
      b      = {"bbbb", "bb"}                               ----  ----
      c      = {"cc", "cccc"}                                1    bbbb
   }                                                         1    bb

                                                            Table C
                                                            a_fk   c
                                                            ----  ----
                                                             1    cc
                                                             1    cccc

One of the things that should be readily obvious from this example is that the object-relational mapping is not very efficient at storing mixed content. Because of this, it is more commonly used in data-centric applications, which tend to have little mixed content.

There are two ways to solve this problem. The first is to use a mapping other than the object-relational mapping. For example, if the document is modeled using the DOM or a similar structure, and this is mapped to the database with an object-relational mapping, there are far fewer tables in the database -- Document, Element, Attr, Text, etc. -- although a similar number of joins are required to retrieve a document. The second strategy is to not break documents into their smallest possible components but instead to break them into larger pieces, such as chapters or sections. This strategy can be used with the object-relational mapping; for more information, see section 3.6.1, "Mapping Complex Element Types to Scalar Types".

3.4. Mapping Order

This section discusses how the object-relational mapping handles order.

3.4.1. Sibling Order, Hierarchical Order, and Document Order

Sibling means "brother or sister". Thus, sibling elements or PCDATA are elements or PCDATA that have the same parent. In other words, they appear in the same content model. For example, if the document from the previous section is represented as a tree, it is readily apparent which elements are siblings: those elements at the second level of the hierarchy, which all have A as their parent.

                                   A
        ___________________________|______________________
       |      |    |    |     |      |      |      |      |
   This text  C  makes  B  no sense  C  except as  B  an example
              |         |            |             |
              cc       bbbb         cccc           bb 

Note that the elements at the third level of the hierarchy are not siblings because they don't share the same parent. This also points out the difference between sibling order, which is the order in which children occur in their parent, and hierarchical order, which is the level at which children appear in a tree representing the document. Different still is document order, which is the order in which elements and text appear in an XML document. For example:

Sibling order (order not shown where there is only one sibling):

                                   A
        ___________________________|______________________
       |      |    |    |     |      |      |      |      |
   This text  C  makes  B  no sense  C  except as  B  an example
       1      2    3    4     5      6      7      8      9
              |         |            |             |
              cc       bbbb         cccc           bb 

Hierarchical order:

   1                                  A
           ___________________________|______________________
          |      |    |    |     |      |      |      |      |
   2  This text  C  makes  B  no sense  C  except as  B  an example
                 |         |            |             |
   3             cc       bbbb         cccc           bb 

Document order:

                                   A
                                   1
        ___________________________|______________________
       |      |    |    |     |      |      |      |      |
   This text  C  makes  B  no sense  C  except as  B  an example
       2      3    5    6     8      9      11     12     14
              |         |            |             |
              cc       bbbb         cccc           bb
              4         7            10            13

According to the XML specification, sibling order is significant. In practice, this depends on the application. For example, in a data-centric application, where an XML document is used to populate an object or a table, sibling order usually does not matter because object-oriented languages have no concept of order among their properties. Similarly, relational databases have no concept of order among their columns. Thus, the sibling order is not significant in either of the following documents:

   <Part>
      <Number>123</Number>
      <Desc>Turkey wrench</Desc>
      <Price>10.95</Price>
   </Part>

   <Part>
      <Price>10.95</Price>
      <Desc>Turkey wrench</Desc>
      <Number>123</Number>
   </Part>

both of which can be mapped to the following object and row in a table:

         Objects                                         Tables
   =========================               ===================================
                                                     Table Parts
   object part {                           -------------------------------
      number = 123                ==>      Number  Desc           Price
      desc = "Turkey wrench"               ------  -------------  -----
      price = 10.95                         123    Turkey wrench  10.95

(A major exception to this is when a data-centric document must match a specific DTD. This occurs when an application must validate documents, such as when they come from an unknown or untrusted source. Although "all groups" in XML Schemas help in this situation by allowing a set of children to appear in any order, they do not support repeated children.)

On the other hand, in document-centric applications, in which documents are generally designed for human consumption, sibling order is very important. For example, I am likely to like the first review and not the second:

   <Review>
      <p>Ronald Bourret is an
      <b>excellent writer</b>.
      Only an <b>idiot</b>
      wouldn't read his work.</p>
   </Review>

   <Review>
      <p>Ronald Bourret is an
      <b>idiot</b>. Only an
      <b>excellent writer</b>
      wouldn't read his work.</p>
   </Review>

The object-relational mapping can preserve sibling order, as will be seen below, although in practice few products support this. It inherently preserves hierarchical order by mapping references to simple element types to columns in a table and by mapping references to complex element types to primary key, foreign key relationships. It preserves document order when both hierarchical and sibling order are preserved.

3.4.2. Mapping Sibling Order

Because object-oriented languages have no concept of order among their properties, and relational databases have no concept of order among their columns, it is necessary to store sibling order values separately from data values. One way to do this is to introduce separate properties and columns in which to store order values. Another way to do this is to store the order values in the mapping itself.

3.4.2.1. Order Properties and Columns

Order properties and order columns are used to store order values. They are separate from data properties and data columns. One property or column is needed for each referenced element type or PCDATA for which order is deemed important. For example, consider the above mixed content example. The following maps the sibling order in a DTD to order properties:

                DTD                                      Classes
   ===============================               ========================

                                                 class A {
                                                    String[] pcdata;
                                                    int[]    pcdataOrder;
   <!ELEMENT A (#PCDATA | B | C)*>                  String[] b;
   <!ELEMENT B (#PCDATA)>               ==>         int[]    bOrder;
   <!ELEMENT C (#PCDATA)>                           String[] c;
                                                    int[]    cOrder;
                                                 }

and then to order columns:

           Classes                                         Tables
   ========================               ========================================

                                                             Table PCDATA
   class A {                                               -----Column a_fk
      String[] pcdata;                                    /     Column pcdata
      int[]    pcdataOrder;                              /      Column pcdataOrder
      String[] b;                         Table A       /    Table B
      int[]    bOrder;           ==>         Column a_pk--------Column a_fk
      String[] c;                                       \       Column b
      int[]    cOrder;                                   \      Column bOrder
   }                                                      \  Table C
                                                           \----Column a_fk
                                                                Column c
                                                                Column cOrder

Notice that the order properties are stored in tables parallel to the properties that they order.

The following example shows order properties being used to preserve sibling order in the "makes-no-sense" example. One important thing to notice here is that all the order properties share the same order space. An order value that appears in one order property won't appear in another order property.

           Classes                                                Tables
   =================================               =====================================

                                                            Table PCDATA
                                                            a_fk pcdata      pcdataOrder
                                                            ---- ----------- -----------
                                                             1   This text   1
   object a {                                                1   makes       3
      pcdata      = {"This text ",                           1   no sense    5
                     " makes ",                              1   except as   7
                     " no sense ",                 Table A   1   an example. 9
                     " except as",                 a_pk       
                     " an example."}      ==>      ----     Table B
      pcdataOrder = {1, 3, 5, 7, 9}                 1       a_fk  b   bOrder
      b           = {"bbbb", "bb"}                          ---- ---- ------
      bOrder      = {4, 8}                                   1   bbbb 4
      c           = {"cc", "cccc"}                           1   bb   8
      cOrder      = {2, 6}          
   }                                                        Table C
                                                            a_fk  c   cOrder
                                                            ---- ---- ------
                                                             1   cc   2
                                                             1   cccc 6

Although order properties are most commonly used to maintain order in mixed content, they can be used with element content as well. For example, consider the following element type definition. Because B can appear an arbitrary number of times in A, it is stored in a separate property table. Without order properties, there would be no way to determine how to order the B children. (Note that row order cannot be used here, as relational databases are not guaranteed to return rows in any particular order.)

   <!ELEMENT A (B*, C)>
3.4.2.2. Storing Order in the Mapping

In many cases, sibling order is important only because of validation; the application itself does not care about sibling order except to be able to validate a document. This is especially true of element content in data-centric documents. In such cases, it may be sufficient to store order information in the mapping itself.

For example, given the following content model, the mapping could store the information that the children of A are ordered B, then C, then D:

   <!ELEMENT A (B, C, D)>

In practice, there are limitations to storing order information in a mapping. For example, consider the following content model:

   <!ELEMENT A (B?, C, B)>

Constructing a document that matches this content model requires software to decide first how much data is available for constructing B elements. If there is only enough data to construct one B element, it won't construct the first B element, since the second B element is required.

It is unlikely that most software will go to the trouble of doing this. Instead, a reasonable limitation is to support only those content models that group all siblings of the same element type together. This is sufficient for many data-centric content models and can be implemented by storing the position of each element in the content model in the mapping.

For example, order of siblings in the following content models can be mapped this way. Note that in the third content model, Author and Editor can both be assigned the same order value or different values; if they are assigned different values, all elements of one type one will occur before any elements of the other type.

   <!ELEMENT Part (Number, Description, Price)>
   <!ELEMENT Order (Number, CustNum, Date, Item*)>
   <!ELEMENT Book (Title, (Author | Editor)+, Price, Review*)>

When order information is stored only in the mapping, round-tripping of documents is not possible whenever the content model contains more than one element of the same type. For example, consider the following content model:

   <!ELEMENT A (B+, C)>

Although the mapping can tell the software that all B elements must occur before the C element, it cannot specify the order of the B elements. Thus, if data is transferred from a document containing this content model to the database and back again, there is no guarantee that the B elements will occur in the same order as in the original document. Fortunately, this is not often a problem for data-centric documents.

Pages: 1, 2, 3, 4, 5

Next Pagearrow