Menu

The Object Model-based approach

November 17, 1999

Object model approaches

Contents

Part 1: XML Programming with C++
Part 2: The SAX-based approach
Part 3: The Object Model-based approach
Part 4: Uses and Tradeoffs

The previous section presented the event-driven approach to handling XML documents. There is another option for the handling of XML documents: the "object model" approach. This approach is also known as the "tree based approach," and it is based on the idea of parsing the whole document and constructing an object representation of it in memory.

There is a standard language independent specification, written in OMG's IDL, for the constructed object model. It is called the Document Object Model, or DOM.

DOM

The next section presents the basic ideas behind the DOM, and the typical steps involved when writing a XML DOM-based module in your C++ application.

Expressing a document as a structure and making it available to the application is not new: all major browsers have done so for years in their own proprietary way. The important idea behind the XML DOM is that it standardizes the model to use when representing any XML document in memory. DOM-based C++ parsers produce a DOM representation of the document instead of informing the application when they encounter elements, attributes etc.

The Document Object Model is a language- and platform-independent interface that allows programs and scripts to dynamically access and update the content structure and style of documents. There is a core set of interfaces that every DOM 1.0-compliant implementation must provide. Here we concentrate on those core interfaces. Currently, anything in a document can be accessed using the DOM (1.0), except for the internal and external DTD subsets, for which no API currently exists.

The DOM, as the name implies, is an object model as opposed to a data model (see a complete UML class diagram here).

The object-oriented interfaces define the semantics of a structural model, independently of the implementation chosen for it. That means that DOM parser implementations are free to choose whatever internal representation they like, as long as they comply with the DOM interfaces. The next section will show the basic DOM core interfaces; then we will look at the steps you will use with the DOM approach.

DOM Interfaces

The DOM level 1 core defines a basic set of interfaces that allow the manipulation of XML documents. It provides methods for the access and population of the document. These methods are encapsulated in two sets of interfaces: the fundamental core interfaces and the extended interfaces.

Here is a basic presentation of the main interfaces. For a complete description and all the methods, you will need to download a DOM library. Again, xml4c2 is a good choice because of its excellent documentation.

Fundamental Interfaces

Interface Description
Node This interface is the primary datatype for the entire Document Object Model. It represents a single node in the document tree. This is the base interface for everything in the model—therefore all objects implementing the Node interface expose the methods defined by it. One should be careful about this because some derivatives of node, like the text node, expose some Node methods they don't really support like "get children," which results in an exception since a text node cannot have children.
Document Class to refer to XML Document nodes in the DOM. Conceptually, a DOM document node is the root of the document tree, and provides the primary access to the document's data.
DocumentFragment DocumentFragment is a "lightweight" Document object. This object encapsulates a portion of the document, which is very useful in applications that need to rearrange or modify portions of the tree, for example an editor doing a cut/paste. Note that the fragment contained is not (necessarily or even often) a valid XML document.
Element

The majority of objects, apart from text, that one may find in the DOM Tree are DOM Element nodes. They represent elements in the document object model, and since they can have other Element nodes as children, their structure reflects the arrangement of the original XML document.

Other fundamental interfaces

The rest of the fundamental interfaces are: DOMImplementation, NodeList, NamedNodeMap, CharacterData, DOMException (which in IDL is not an interface but an exception), Attr, Text, Comment. Again, for more details on these, you are encouraged to download the complete documentation included in toolkits like xml4c2.

Extended Interfaces

The extended interfaces also form part of the core DOM. These interfaces are:

  • CDATASection

  • DocumentType

  • EntityReference

  • ProcessingInstruction

Important details about DOM level 1

In order to get a complete view of the DOM, you should read the W3C recommendation. Nevertheless, here are some important points to keep in mind:

Limitations of Level 1

DOM Level 1 is strictly limited to those methods needed to represent and manipulate the document structure. At the time of the writing of this article DOM level 2 was not yet endorsed as a W3C Recommendation, and no C++ implementation was available, so for the sake of usability, I decided to focus on DOM level 1. Future DOM levels may provide:

  • Thread safety

  • Events

  • Control for rendering via stylesheets

  • Access control

Persistence

Saving the DOM representation is left to the implementation

Repercussions of changes to nodes

Changes to a Node are reflected in any NodeList or NamedNodeMap that refer to them. (This translates to the use of references or pointers in your C++ implementation.)

Memory management

The DOM API is memory-management-philosophy independent (i.e., the recommendation does not specify any memory management policy). This is the responsibility of the implementation.

DOMString type

DOM defines the DOMString type in IDL as

typedef sequence<unsigned short> DOMString;

That is, a sequence of 16 bit characters using the UTF-16 encoding.

Case sensitivity

Even though some HTML processors may expect normalization to uppercase, the DOM bases its string matching in a purely case sensitive way. Nevertheless, it is admitted by the recommendation that some normalizations may occur before DOM construction.

The basic structure of an XML DOM-based module

In contrast to the role of the application during event-based parsing, the focus of the activity in a DOM-based application is post-parsing. The basic layout of an XML module using DOM might look something like this:

Here the main XML application class is no longer a handler but a manipulator of the DOM representation produced. Note that this is an important change of focus: your application's processing is no longer done at parsing time but rather in an extra module that will manipulate nodes.

The basic steps for the creation of your XML module would be:

  • Create and register handlers for errors and other implementation-dependent activities.

  • Create a DOM manipulator that will have the responsibility of (1) issuing parsing requests to the parser (2) manipulating the results of such requests.

  • Include the necessary manipulator/rest of the application interaction.

In order to complete the picture and give you a real idea of how all this fits together, the best thing to do is a complete walk-through of an example DOM-based implementation. Again, we consider an XML "pretty printer" as an example. A full example can be found in the DOMPrint sample distributed with IBM's xml4c2.

Note that for bigger projects, the correct encapsulation of the DOM processing activities in the above scheme is meant to keep your design clean and your program manageable.

The translation of the DOMPrint example to the DOM manipulator/DOM Parser/Rest of the application scheme is left as a simple exercise for the reader. In this case, all it requires is to take the methods of the original and encapsulate them in a class. So the main function can be something like:

#include "DOMBeautifier.h"





static char*    xmlFile = 0; // The name of the file to parse



// --------------------------------------------------------------

//  Main - very simplified version -- for guidance purposes only.

//  Note that the main no longer takes responsibility of creating 

//  the parser or printing the file. Try encapsulating that 

//  in the DOMBeautifier class yourself, based on the original 

//  DOM_Print (it's very easy an will give you a better feeling 

//  of the library)

// --------------------------------------------------------------



int main(int argC, char* argV[])

{

    // Check command line and extract arguments.

    if (argC != 2)

        return 1;

    // The first parameter must be the file name

    xmlFile = argV[1];



 // Now initialize the XML4C2 platform

    try {

     XMLPlatformUtils::Initialize();

    }

    catch(const XMLException& toCatch) {

     return 1;

    }



    DomBeautifier *myBeautifier = new DomBeautifier();

    

    if(myBeautifier->Beautify(xmlFile) == ERRORS){

      return 2;

    }



    // Cleanup

    delete myBeautifier;

    return 0;

    }

Now that we have seen both the event-based and DOM approaches to C++ XML document handling, we will examine the different considerations that will help you decide between each approach.