XML Programming with C++
November 17, 1999
C++ is a popular programming language for which many XML related
efforts already exist. The aim of this article is to introduce and
analyze the different options available when using
C++ for your XML applications.
We will examine two things: the main APIs and strategies for
parsing and manipulating XML in your C++ application, and
the practical uses and tradeoffs of approaches to XML parsing.
To get the most from this article, a basic
understanding of the C++ language is required. Static model diagrams are
illustrated in UML: the diagrams used show mainly inheritance and
simple relationships and may not require previous UML
knowledge. Nevertheless, we provide a basic UML guide
containing all you
need to know in order to understand the
examples.
Several toolkits and libraries have been produced for C++ based manipulation. Those
toolkits mainly fall into two categories: event-driven processors and object model construction
processors. We will examine both.
In an event-driven approach for processing XML data, a parser reads the data and notifies specialized handlers that undertake the
desired actions. Note that
here the term event-driven means a
process that calls specific handlers when the contents of the
XML document are
encountered. For instance, calling endDocument() when the end of the
XML document is found.
The various XML parser implementations differ in their application
program interfaces. For example, one parser could
notify a handler of the start of an element, passing it only the name
of the element and then requiring another call for the handling of
attributes. Another parser could notify a handler when it finds
the same start-element tag and pass it not only the name of the
element, but a list of the attributes and values of that
element. Another important difference between XML parsers is in which representation
they use to pass data from the parser to the application:
e.g. one parser could use an STL list of strings, while another
could use a specially made class to hold attributes and values. The
methods for handling a start-element tag with each
approach would be very different and would certainly affect how
you program them.
// Example: different ways of communicating data to handlers
// STL based attribute passing
// with an STL based event-driven handler a startElementHandler
// method might look like this
virtual void HypotheticalHandler::startElementHandler
(const String name,const list<String> attributes) = 0;
// Special Attribute List Class provided
// With some event-driven APIs a special AttributeList object
// containing attribute information
// is used. This is the case with IBM's xml4c2 parser.
virtual void DocumentHandler::startElement(const XMLCh* const name,
AttributeList& attrs) = 0;
As you can see, the way processors notify applications about
elements, attributes, character data, processing instructions and
entities is parser-specific and can greatly influence the
programming style behind the XML-related modules of your
system. Efforts to create a standard event-driven XML processing API
have produced SAX (the Simple API for XML).
A standard interface for SAX in C++ has not yet been
developed. Nevertheless, the importance and growing use
of SAX in C++ XML based applications is unquestionable, and makes it
an important topic in our discussion.
In the next two sections, we examine the ideas behind both non-SAX
and SAX-based event-driven approaches to parsing. For our examples, we
will be using expatpp (C++ wrapper of James Clark's expat parser) and
xml4c2 (IBM's C++ XML parser), respectively. IBM's parser will be
re-released at the end of this year as "Xerces," part of the new Apache XML Project.
Expat is a C parser developed and maintained by James
Clark. It is event-driven, in the sense that it calls handlers
as parts of the document are encountered by the parser.
User-defined functions can
be registered as handlers.
Here is a sample of a typical expat use in C:
/*
This is a simple demonstration of how to use expat. This program
reads an XML document from standard input and writes a line
with the name of each element to standard output, indenting
child elements by one tab stop more than their parent element.
[Taken from the standard expat distribution] */
#include <stdio.h>
#include "xmlparse.h"
void startElement
(void *userData, const char *name, const char **atts)
{
int i;
int *depthPtr = userData;
for (i = 0; i < *depthPtr; i++)
putchar('\t');
puts("I found the element:");
puts(name);
*depthPtr += 1;
}
void endElement(void *userData, const char *name)
{
int *depthPtr = userData;
*depthPtr -= 1;
}
int main()
{
char buf[BUFSIZ];
XML_Parser parser = XML_ParserCreate(NULL);
int done;
int depth = 0;
XML_SetUserData(parser, &depth);
XML_SetElementHandler(parser, startElement, endElement);
do {
size_t len = fread(buf, 1, sizeof(buf), stdin);
done = len < sizeof(buf);
if (!XML_Parse(parser, buf, len, done)) {
fprintf(stderr,
"%s at line %d\n",
XML_ErrorString(XML_GetErrorCode(parser)),
XML_GetCurrentLineNumber(parser));
return 1;
}
} while (!done);
XML_ParserFree(parser);
return 0;
}
This program simply shows the string "I
found the element" followed by the element name for each
element found. Note the existence of a void *userData parameter
that expat uses to give you the possibility of managing your
information across calls. In the previous example, the userData
is employed to keep track of the indentation level that should
be used when printing elements and attributes to the standard output.
Expat has many advantages: it is very fast and very portable. It is also under the GPL
(the GNU General Public License), which means you can freely use and distribute it. But it is just plain C, so
some strategy must be chosen in order to integrate it with your OO C++project. One strategy would be simply to create global
functions to register with expat. Those functions can receive
a pointer to the data you want to modify while reading the file
(e.g., a Count object that will store the number of characters in
the file), and then all you have to do is register them with expat. This is a straightforward approach, but
it brings several undesirable consequences into the picture:
It decreases the modularity of your program. It makes your program less cohesive (i.e., related methods
are not bundled together). It may ruin your OO design on a fundamental level
(e.g., XML serialization of your objects).
All of the above will probably result in a less-maintainable program with an error prone design. A better option would be to wrap
expat using a C++ class that will encapsulate the C details and
provide you with a clean list of methods that you can
override to suit your particular needs. This is how
wrappers like expatpp work.
Expatpp is a C++ wrapper for expat. It was developed by
Andy Dent with this basic idea: the constructor of expatpp
creates an instance of an expat parser, and registers dummy functions as handlers that call the corresponding expatpp override-able methods.
Some code will make things clearer:
// In its class definition, expatpp declares the callbacks
// that it will register with the parser:
static void startElementCallback
(void *userData, const XML_Char*
name, const XML_Char** atts);
static void endElementCallback
(void *userData, const XML_Char* name);
static void charDataCallback(void *userData, const XML_Char* s,
int len);
//... and so on for the other handlers like
// processingInstructionCallback
// At the constructor, expatpp creates an expat parser and registers
// the callbacks
expatpp::expatpp()
{
mParser = XML_ParserCreate(0);
XML_SetUserData(mParser, this); //Note that the user data is
//the object itself
XML_SetElementHandler(mParser, startElementCallback,
endElementCallback);
XML_SetCharacterDataHandler(mParser, charDataCallback);
//... and so on with the other callbacks
}
// Now, for each callback there is a partner override-able member
// like
virtual void startElement(const XML_Char* name, const
XML_Char** atts){
// Note that the default behavior is to do nothing. In your
// derived class you can override this and for example print
// the name of the element like
// count << name;
}
// All a callback does is to call its partner method
inline void
expatpp::startElementCallback
(void *userData, const XML_Char* name,
const XML_Char** atts)
{
((expatpp*)userData)->startElement(name, atts);
}
// In runtime, when the parser begins calling the callback, the
// appropriate method overridden in your derived class will be
// called. For the complete code look at expatpp.[h|c]
As you can see, the userData is used to maintain a pointer
to your expatpp object. When the object is
constructed, the callbacks are registered as handlers to expat.
When parsing events occur,
the handlers call the appropriate methods in the class. The default behavior of
these methods is to do nothing, but you can override them for your
own purposes. The expatpp interface defines wrappers for all the methods
in expat and includes the following members:
virtual void startElement
(const XML_Char* name, const XML_Char** atts);
virtual void endElement(const XML_Char* name);
virtual void charData(const XML_Char *s, int len);
virtual void processingInstruction(const XML_Char* target, const
XML_Char* data);
virtual void defaultHandler(const XML_Char *s, int len);
virtual void unparsedEntityDecl(const XML_Char *entityName, const
XML_Char* base, const XML_Char* systemId,
const XML_Char* publicId,
const XML_Char* notationName);
virtual void notationDecl(const XML_Char* notationName, const
XML_Char* base, const XML_Char* systemId,
const XML_Char* publicId);
// XML interfaces
int XMLPARSEAPI XML_Parse
(const char *s, int len, int isFinal);
XML_Error XMLPARSEAPI XML_GetErrorCode();
int XMLPARSEAPI XML_GetCurrentLineNumber();
This interface defines a handler base for expatpp (look at the
source code for details). Along with the
example included below, it
should be enough to get you started with expatpp in an XML project. The following example uses expatpp to create a tree view of the elements of the document. Unlike the rest of the examples in this article, this particular
program was constructed using Inprise C++ builder, and depends on it.
// We declare a handler that will be capable of
// constructing the tree. In order to do so it will override
// the startElement and endElement methods (the others can
// be ignored)
// [from myParser.h]
class myParser : public dexpatpp
{
private:
TTreeView *mTreeView;
// The tree view in which the elements will be shown
TTreeNode *lastNode;
public:
inline myParser(TTreeView *treeToUse);
inline void startElement
(const XML_Char* name, const XML_Char** atts);
inline void endElement(const XML_Char* name);
};
// Now, for the implementation, all we have to do is
inline
void myParser::startElement
(const XML_Char* name, const XML_Char** atts)
{
lastNode = mTreeView->Items->AddChild(lastNode, name);
}
// and
inline
void myParser::endElement(const XML_Char* name)
{
lastNode = lastNode->Parent;
}
For the complete, and more verbosely documented, code,
download this file: expatppExample.zip.
When given an XML file, the application will produce something
like this:
The same approach can be, and has been, used for wrapping
other C parsers for use in C++ code. These
other parsers include, most notably, the Gnome Project libxml
parser by Daniel Veillard.
The next section will cover the Simple API for XMLSAX.
|