XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

XML Programming with C++

November 17, 1999

C++ is a popular programming language for which many XML related efforts already exist. The aim of this article is to introduce and analyze the different options available when using C++ for your XML applications.

We will examine two things: the main APIs and strategies for parsing and manipulating XML in your C++ application, and the practical uses and tradeoffs of approaches to XML parsing.

To get the most from this article, a basic understanding of the C++ language is required. Static model diagrams are illustrated in UML: the diagrams used show mainly inheritance and simple relationships and may not require previous UML knowledge. Nevertheless, we provide a basic UML guide containing all you need to know in order to understand the examples.

Contents

Part 1: XML Programming with C++
Part 2: The SAX-based approach
Part 3: The Object Model-based approach
Part 4: Uses and Tradeoffs

Different approaches to XML processing

Several toolkits and libraries have been produced for C++ based manipulation. Those toolkits mainly fall into two categories: event-driven processors and object model construction processors. We will examine both.

Event-driven approaches

In an event-driven approach for processing XML data, a parser reads the data and notifies specialized handlers that undertake the desired actions. Note that here the term event-driven means a process that calls specific handlers when the contents of the XML document are encountered. For instance, calling endDocument() when the end of the XML document is found.

The various XML parser implementations differ in their application program interfaces. For example, one parser could notify a handler of the start of an element, passing it only the name of the element and then requiring another call for the handling of attributes. Another parser could notify a handler when it finds the same start-element tag and pass it not only the name of the element, but a list of the attributes and values of that element.

Another important difference between XML parsers is in which representation they use to pass data from the parser to the application: e.g. one parser could use an STL list of strings, while another could use a specially made class to hold attributes and values. The methods for handling a start-element tag with each approach would be very different and would certainly affect how you program them.

// Example: different ways of communicating data to handlers
// STL based attribute passing
// with an STL based event-driven handler a startElementHandler
// method might look like this 
virtual void HypotheticalHandler::startElementHandler
(const String name,const list<String> attributes) = 0;
   
// Special Attribute List Class provided
// With some event-driven APIs a special AttributeList object 
// containing attribute information
// is used. This is the case with IBM's xml4c2 parser.
virtual void DocumentHandler::startElement(const XMLCh* const name,
    AttributeList& attrs) = 0;
   

As you can see, the way processors notify applications about elements, attributes, character data, processing instructions and entities is parser-specific and can greatly influence the programming style behind the XML-related modules of your system.

Efforts to create a standard event-driven XML processing API have produced SAX (the Simple API for XML). A standard interface for SAX in C++ has not yet been developed. Nevertheless, the importance and growing use of SAX in C++ XML based applications is unquestionable, and makes it an important topic in our discussion.

In the next two sections, we examine the ideas behind both non-SAX and SAX-based event-driven approaches to parsing. For our examples, we will be using expatpp (C++ wrapper of James Clark's expat parser) and xml4c2 (IBM's C++ XML parser), respectively. IBM's parser will be re-released at the end of this year as "Xerces," part of the new Apache XML Project.

Non SAX event-driven approaches

Expat is a C parser developed and maintained by James Clark. It is event-driven, in the sense that it calls handlers as parts of the document are encountered by the parser. User-defined functions can be registered as handlers.

Here is a sample of a typical expat use in C:

/* 
This is a simple demonstration of how to use expat. This program
reads an XML document from standard input and writes a line 
with the name of each element to standard output, indenting 
child elements by one tab stop more than their parent element. 
[Taken from the standard expat distribution] */

#include <stdio.h>
#include "xmlparse.h"

void startElement
(void *userData, const char *name, const char **atts)
{
  int i;
  int *depthPtr = userData;
  for (i = 0; i < *depthPtr; i++)
    putchar('\t');
  puts("I found the element:");
  puts(name);
  *depthPtr += 1;
}

void endElement(void *userData, const char *name)
{
  int *depthPtr = userData;
  *depthPtr -= 1;
}

int main()
{
  char buf[BUFSIZ];
  XML_Parser parser = XML_ParserCreate(NULL);
  int done;
  int depth = 0;
  XML_SetUserData(parser, &depth);
  XML_SetElementHandler(parser, startElement, endElement);
  do {
    size_t len = fread(buf, 1, sizeof(buf), stdin);
    done = len < sizeof(buf);
    if (!XML_Parse(parser, buf, len, done)) {
      fprintf(stderr,
       "%s at line %d\n",
       XML_ErrorString(XML_GetErrorCode(parser)),
       XML_GetCurrentLineNumber(parser));
      return 1;
    }
  } while (!done);
  XML_ParserFree(parser);
  return 0;
}
   

This program simply shows the string "I found the element" followed by the element name for each element found. Note the existence of a void *userData parameter that expat uses to give you the possibility of managing your information across calls. In the previous example, the userData is employed to keep track of the indentation level that should be used when printing elements and attributes to the standard output.

Expat has many advantages: it is very fast and very portable. It is also under the GPL (the GNU General Public License), which means you can freely use and distribute it. But it is just plain C, so some strategy must be chosen in order to integrate it with your OO C++project.

One strategy would be simply to create global functions to register with expat. Those functions can receive a pointer to the data you want to modify while reading the file (e.g., a Count object that will store the number of characters in the file), and then all you have to do is register them with expat. This is a straightforward approach, but it brings several undesirable consequences into the picture:

  • It decreases the modularity of your program.

  • It makes your program less cohesive (i.e., related methods are not bundled together).

  • It may ruin your OO design on a fundamental level (e.g., XML serialization of your objects).

All of the above will probably result in a less-maintainable program with an error prone design. A better option would be to wrap expat using a C++ class that will encapsulate the C details and provide you with a clean list of methods that you can override to suit your particular needs. This is how wrappers like expatpp work.

Expatpp is a C++ wrapper for expat. It was developed by Andy Dent with this basic idea: the constructor of expatpp creates an instance of an expat parser, and registers dummy functions as handlers that call the corresponding expatpp override-able methods.

Some code will make things clearer:

// In its class definition, expatpp declares the callbacks
// that it will register with the parser:
static void startElementCallback
(void *userData, const XML_Char*
name, const XML_Char** atts);
static void endElementCallback
(void *userData, const XML_Char* name);
static void charDataCallback(void *userData, const XML_Char* s, 
int len);
//... and so on for the other handlers like
//     processingInstructionCallback


// At the constructor, expatpp creates an expat parser and registers
// the callbacks
expatpp::expatpp()
{
mParser = XML_ParserCreate(0);
XML_SetUserData(mParser, this); //Note that the user data is
//the object itself
XML_SetElementHandler(mParser, startElementCallback,
endElementCallback);
XML_SetCharacterDataHandler(mParser, charDataCallback);
//... and so on with the other callbacks
}

// Now, for each callback there is a partner override-able member
// like
virtual void startElement(const XML_Char* name, const
XML_Char** atts){
// Note that the default behavior is to do nothing. In your
// derived class you can override this and for example print
// the name of the element like
// count << name;
}

// All a callback does is to call its partner method
inline void 
expatpp::startElementCallback
(void *userData, const XML_Char* name,
const XML_Char** atts)
{
((expatpp*)userData)->startElement(name, atts);
}
     
// In runtime, when the parser begins calling the callback, the
// appropriate method overridden in your derived class will be
// called. For the complete code look at expatpp.[h|c]

As you can see, the userData is used to maintain a pointer to your expatpp object. When the object is constructed, the callbacks are registered as handlers to expat. When parsing events occur, the handlers call the appropriate methods in the class. The default behavior of these methods is to do nothing, but you can override them for your own purposes.

expatpp interface

The expatpp interface defines wrappers for all the methods in expat and includes the following members:

 
virtual void startElement
(const XML_Char* name, const XML_Char** atts);
virtual void endElement(const XML_Char* name);
virtual void charData(const XML_Char *s, int len);
virtual void processingInstruction(const XML_Char* target, const
  XML_Char* data);
virtual void defaultHandler(const XML_Char *s, int len);
virtual void unparsedEntityDecl(const XML_Char *entityName, const
  XML_Char* base, const XML_Char* systemId, 
  const XML_Char* publicId, 
  const XML_Char* notationName);
virtual void notationDecl(const XML_Char* notationName, const
  XML_Char* base, const XML_Char* systemId, 
  const XML_Char* publicId);
 
 // XML interfaces
 int XMLPARSEAPI XML_Parse
 (const char *s, int len, int isFinal);
 XML_Error XMLPARSEAPI XML_GetErrorCode();
 int XMLPARSEAPI XML_GetCurrentLineNumber();
     

This interface defines a handler base for expatpp (look at the source code for details). Along with the example included below, it should be enough to get you started with expatpp in an XML project.

expatpp sample

The following example uses expatpp to create a tree view of the elements of the document. Unlike the rest of the examples in this article, this particular program was constructed using Inprise C++ builder, and depends on it.

   
// We declare a handler that will be capable of
// constructing the tree. In order to do so it will override
// the startElement and endElement methods (the others can
 // be ignored)

// [from myParser.h]
 class myParser : public dexpatpp
{
  private:
 TTreeView *mTreeView; 
 // The tree view in which the elements will be shown
 TTreeNode *lastNode;
       public:
 inline myParser(TTreeView *treeToUse);
inline void startElement
(const XML_Char* name, const XML_Char** atts);
inline void endElement(const XML_Char* name);
};
  
// Now, for the implementation, all we have to do is
inline
void myParser::startElement
(const XML_Char* name, const XML_Char** atts)
{
lastNode = mTreeView->Items->AddChild(lastNode, name);
}

// and
inline
void myParser::endElement(const XML_Char* name)
{
    lastNode = lastNode->Parent; 
}

For the complete, and more verbosely documented, code, download this file: expatppExample.zip. When given an XML file, the application will produce something like this:

The same approach can be, and has been, used for wrapping other C parsers for use in C++ code. These other parsers include, most notably, the Gnome Project libxml parser by Daniel Veillard.

The next section will cover the Simple API for XML—SAX.

Pages: 1, 2, 3, 4

Next Pagearrow