Menu

The SAX-based approach

November 17, 1999

SAX event-driven approaches

Contents

Part 1: XML Programming with C++
Part 2: The SAX-based approach
Part 3: The Object Model-based approach
Part 4: Uses and Tradeoffs

The Simple API for XML, SAX, is an event-driven API for parsing XML documents. It defines several handler classes that encapsulate the methods needed for specific tasks when parsing XML documents, such as external entity handling. As with other event-driven parsers, the basic process for the definition of an XML module in your project may be described by the following steps:

  • Subclass the required handler base classes. (In the previous section you did so only from expatpp; now you have more classes available for subclassing, which we will explore below.)

  • Override the desired methods.

  • Register your handler with the parser.

  • Start parsing.

These steps can be seen in the following example which prints an XML file, keeping track of the correct indentation. The interfaces used are explained further below. You can also look at the SAXCount example from IBM's xml4c2 documentation.


// We declare a handler of our own that will be capable of

// remembering the correct indentation for "pretty"

// printing the file. In order to do so we override

// the startElement, characters and endElement handlers.



//  Take time to compare this solution to the expatpp solution to

//  the same problem (above)



void PrettyPrint::startElement(const XMLCh* const name, 

     AttributeList& attributes)

{

  indent++; // A new element started, it should be indented one 

  // level further than the current level

  int i;

  for(i = 0; i < indent; i++)

    outStrm << "\t";

  outStrm << "<" << name;

  unsigned int len = attributes.getLength();

  for (unsigned int i = 0; i < len; i++)

    {

      outStrm << " " << attributes.getName(i)

       << "=\"" << attributes.getValue(i) << "\"";

    }

  outStrm << ">";

}



void PrettyPrint::endElement(const XMLCh* const name)

{

  int i;

  for(i = 0; i < indent; i++)

    outStrm << "\t";

  outStrm << "</" << name << ">";

  indent--;

}



void PrettyPrint::characters(const XMLCh* const chars, 

     const unsigned int length)

{

  for (unsigned int index = 0; index < length; index++)

    {

      switch (chars[index])

 {

 case chAmpersand :

   outStrm << "&";

   break;



 case chOpenAngle :

   outStrm << "<";

   break;



 case chCloseAngle: 

   outStrm << ">";

   break;



 case chDoubleQuote : 

   outStrm << """;

   break;



 default:

   outStrm << chars[index];

   break;

 }

    }

}



void PrettyPrint::processingInstruction(const XMLCh* const target, 

     const XMLCh* const data)

{



  int i;

  for(i = 0; i < indent; i++)

    outStrm << "\t";

  outStrm << "<?" << target;

  if (data)

    outStrm << " " << data;

  outStrm << "?>\n";

}

   

Download the full source here: saxExample.zip.

What makes SAX important is not the idea behind the parsing—the essence of the event-driven approach is the same as with expatpp or any other event-oriented parser—but the standardization of the interfaces and classes that are used to communicate with the application during the parsing process.

These classes and interfaces (abstract classes in C++) are divided thus:

  • Classes implemented by the parser: Parser, AttributeList, Locator (optional class used to track the location of an event)

  • Classes implemented by the application: DocumentHandler (very important—this is the one you will subclass in nearly all applications), ErrorHandler, DTDHandler, EntityResolver.

  • Standard SAX classes: InputSource, SAXException, SAXParseException and HandlerBase. (This might be your starting point in many applications since it inherits from all the handlers, providing default behavior for all non-overriden methods.)

SAX was initially developed for Java, but it has been ported to other languages like Python, Perl and C++. In C++, you have several representations and strategies to choose from when porting the original SAX API. Since there is no common C++ SAX interface, the different implementations might have some small, and not-so-small, differences.

In this article, we'll stick with IBM's xml4c2 SAX implementation. In order to write your own XML modules, you will need to inherit from the application classes of the API and override the methods you want to perform special actions.

Here is an overview of the handlers you will inherit from, a more complete documentation of them can be found with the xml4c2 distribution.

Handler Description
DocumentHandler This is the main interface that SAX applications implement. It defines methods to let the parser inform the application about basic parsing events. In order to use it, the application should use a class that implements DocumentHandler and then register an instance with the parser, which will later feed it with the appropriate events.
ErrorHandler This interface is provided in order to allow the SAX application to implement customized error handling. It is registered using the setErrorHandler method. The parser will then report all errors and warnings through this interface.
DTDHandler Objects of a class that implement the DTDHandler interface receive information about notations and unparsed entities. They are registered using the parser's setDTDHandler method.
EntityResolver (less commonly used) If the application needs to intercept any external entities before their inclusion, it must make use of a class that implements this interface registering it via the setEntityResolver method. Any external entities (including the external DTD subset and external parameter entities) will be reported through it.

Note that by making use of the multiple inheritance support of C++, a user-defined handler can implement several of those functions (e.g., error handling and document handling).

            // ...

     class MyHandler : public DocumentHandler, ErrorHandler

     // ...

            parser = new NonValidatingSAXParser;



     MyHandler*  handler = new MyHandler();

     parser->setDocumentHandler(handler);

     parser->setErrorHandler(handler);

 

The XML part of your application will probably take the following form:

If you are familiar with patterns, you will see this is similar to a simple Builder Pattern, i.e., we detach the XML responsibility from the client objects and delegate it to a collection of objects (the parser itself and your handlers) that will know how to incrementally construct some product. For a complete description of the Builder pattern see the book "Design Patterns" by Gamma et al. ("The Gang of Four.")

Note that this product can be expressed as another object, a simple return value, or even as some transformation of the attributes of your handler object.

This concludes the SAX review, and can serve as a starting point for your C++ XML modules. Please review the documentation of your chosen implementation for further examples. IBM's xml4c2 is recommended because of its comprehensive documentation.