Menu

RAX: An XML Database API

April 26, 2000

Sean McGrath


Table of Contents

 • Example: Customer Database

 • RAX Implementation

 • SAX Implementation

 • DOM Implementation

 • Conclusion

 • Python Implementation of RAX

XML is finding its way into applications beyond those that traditionally utilize markup languages. In particular, XML is becoming popular as a data interchange notation for database-oriented applications.

The two mainstream XML APIs (SAX and DOM) are highly document-centric. They are designed to handle XML in all its generality, including deeply nested structures, mixed content, external entities, and so on. Although these APIs can be used to process the simple, flat XML that databases typically generate, they are not well-suited to doing so. Moreover, the processing paradigms they employ (event processing and tree processing, respectively) are unfamiliar to most database programmers.

This article presents a simple, record-oriented API for XML (RAX) that is suitable for processing database-style XML. The RAX, SAX, and DOM APIs are contrasted with a worked example. An implementation of RAX in Python that is based on the PYX notation introduced in the Pyxie project is also presented. (See also my introduction to Pyxie published previously on XML.com.)

Example: Customer Database

Let's take the well-worn example of a customer database to illustrate RAX. Here is the DTD for the data:


<!ELEMENT Table (Record)*>

<!ELEMENT Record (ID,Name,Phone,EMail,Address1,Address2,Address3)>

<!ELEMENT ID (#PCDATA)>

<!ELEMENT Name (#PCDATA)>

<!ELEMENT Phone (#PCDATA)>

<!ELEMENT EMail (#PCDATA)>

<!ELEMENT Address1 (#PCDATA)>

<!ELEMENT Address2 (#PCDATA)>

<!ELEMENT Address3 (#PCDATA)>

Here is a sample database dump that conforms to this DTD:


<!DOCTYPE Table SYSTEM "Customer.dtd">

<Table>

 <Record>

   <ID>

    42

   </ID>

   <Name>

     Sean McGrath

   </Name>

   <Phone>

    555-424242

   </Phone>

   <EMail>

    sean@digitome.com

   </EMail>

   <Address1>

    Enniscrone

   </Address1>

   <Address2>

    County Sligo

   </Address2>

   <Address3>

    Ireland

   </Address3>

 </Record>

</Table>

We will take the simple problem of extracting all the phone numbers (Phone elements) as an illustrative example.

We will start with the RAX implementation and then develop SAX and DOM implementations.

RAX Implementation

Here is the RAX implementation in Python. Excluding comments, its length runs a mere eight lines of code!:


# Import RAX

from RAX import *



# Open a stream connection to the PYX generated from

# the customers.xml file.

fo = os.popen ("xmln customers.xml")



# Create a RAX object

R = RAX(fo)



# Tell RAX that the element type "Record" is the basic unit of data

R.SetRecord("Record")



# Read a record

rec = R.ReadRecord()



# While we have more records...

while rec:

  # Print the contents of the Phone field

  print "Phone=%s" % rec.GetField("Phone")



  # Read the next record

  rec = R.ReadRecord()

The principal thing to note about the structure of the code is what old-timers like me might refer to as the central read ahead/read replace loop. In general, RAX applications take this basic shape:


rec = R.ReadRecord()

while rec:

  # Do something with the record using rec.GetField() to retrieve

  # field contents

  rec = R.ReadRecord()   

It goes without saying that this is a pretty trivial and easily understood control mechanism! The code drives the data reading process, record by record, in a way that is very familiar to database programmers. At any given point, the contents of particular elements making up the record structure are available with simple calls to GetField(), specifying the name of the field required.

Not only is the paradigm trivial, it scales well too. At any given moment, only one record from a potentially multi-gigabyte XML database dump need be in memory. This processing paradigm is a good fit with database style XML, in which records are typically processed independently of each other.

Let us now look at SAX- and DOM-based solutions to the same problem. We will use Python for illustrative purposes, but the essential shape of the code would be the same in any OO language (Java, C++, etc.).

SAX Implementation


# Import the SAX modules

from xml.sax import saxexts, saxlib, saxutils

import string



# Create a class to handle document events

class Handler1 (saxlib.DocumentHandler):

  def __init__(self):

    # initialize some storage

    self.Storage = ""

    self.StoreIt = 0

    

  def startElement(self,Element,Attributes):

    # Get here for every start-tag

    if Element == "Phone":

      # If a Phone element has started, begin data collection

      self.Storage = ""

      self.StoreIt = 1



  def endElement(self,Element):

    # Get here for every end-tag

    if Element == "Phone":

      # If a Phone element has started, end data collection

      print "Phone=%s" % string.strip (self.Storage)

      self.StoreIt = 0



  def characters(self,Data,start,end):

    if self.StoreIt:

      # If we  are collecting data, append the new data

      # to what we already have

      self.Storage = self.Storage + Data[start:end]



# Instantiate an instance of our document event handling class

h = Handler1()



# Create a parser object

parser = saxexts.make_parser()   



# Tell the parser about the document handler

parser.setDocumentHandler(h)



# Yield control to the parser

parser.parse ("Customers.xml")

SAX is an event-oriented API for XML. To use it you need to set up handlers for various events and then yield control to the parser, which will call back to your handlers as required.

To solve the problem in hand we need three handlers:

  • A handler to detect the start of a Phone element
  • A handler to detect the end of a Phone element
  • A handler for character data

When a start-tag for a Phone element comes along, we initialize storage space for it and use a Boolean variable, StoreIt, to record the fact that we are in a data storing mode.

Until such time as the end-tag for a Phone element appears, calls to the character data handler check to see if we are in storing mode. If so, the data is appended to the existing storage. When the end-tag for a Phone element finally appears, the accumulated storage is stripped of any white space and then printed out.

The event-oriented nature of SAX is a paradigm well-suited to dealing with nested and hierarchical XML, such as that commonly found in document-oriented XML applications. However, for record-oriented XML applications that do not use these features of XML, a SAX implementation requires close coupling between a number of event-handlers (that is, handlers that change their behavior based on shared Boolean variables and storage buffers). Although not too difficult in this simple application, for more complex applications the number of Boolean variables and intermediate storage buffers required can lead to hard to read, hard to maintain code.

Let us move on to the DOM implementation.

DOM Implementation


# Import the DOM modules

from xml.dom import core

from xml.dom.sax_builder import SaxBuilder



import string



# Use SAX as an event source for the DOM tree building

from xml.sax import saxexts,saxlib



# Create a parser

p = saxexts.make_parser()



# Create a DOM tree builder

dh = SaxBuilder()



# Tell the parser about the document handler (the tree builder)

p.setDocumentHandler(dh)



# Parse the XML

p.parse("Customers.xml")

p.close()



# Function to print out Phone elements appearing as children of

# a specified node

def PrintPhone(node):

  value = ""

  children = node.childNodes

  # for each field

  for n in children:

    if n.nodeType == core.ELEMENT_NODE and n.tagName == "Phone":

      for n1 in n.childNodes:

        value = value + n1.nodeValue

  print "Phone=%s" % string.strip(value)

    

# Grab the document node of the DOM tree

doc = dh.document



# Walk the table element, looking at each child (Record) element

for node in doc.documentElement.childNodes:

  # for each Record element print out the Phone element

  if node.nodeType == core.ELEMENT_NODE:

    PrintPhone(node)



DOM is a tree-oriented API to XML. The code works by first building a tree representation of the XML document. Then it navigates through the tree, looking for element nodes with an element type name of Phone. Once found, the code walks any child nodes accumulating the data in the Phone element. When the list of children is exhausted, the phone number data is printed out.

The tree-oriented paradigm of the DOM is well-suited to handling nested and hierarchical XML such as that commonly found in document-oriented XML applications. It is particularly handy for dealing with complex formatting problems that may require "looking-ahead" into the data yet to be processed in order to work.

Most DOM implementations are memory bound. That is to say, the entire document is read into memory before processing begins. For reasonably sized database dumps, memory bound DOM is at best inefficient -- at worst, unusable.

A lot of the machinery in the DOM is concerned with allowing full navigation around the tree structure. For record-oriented XML data, this tree structure is very flat and does not gain any great benefit from this functionality.

Conclusion

Both the SAX and DOM implementations of the simple record-oriented XML processing example presented in this article are significantly more complex than the RAX implementation. This is not because there is anything intrinsically wrong with SAX or DOM; it is simply a reflection of the fact that they are designed to handle many features of XML typically not used in record-oriented XML applications.

The RAX API provides a simple and scalable alternative when working with record-oriented XML. The next section of this article presents the Python implementation of the RAX API in full. Although I have used Python, RAX can easily be implemented in other languages such as Java and Perl.

(Download Python source)


"""

RAX = Record API for XML



A simple, record-oriented API for XML. Provides a simple, efficient

interface for processing the sort of XML often generated from

databases



Sean McGrath

http://www.digitome.com

"""



import os,sys,string



class Record:

  """

  A record drawn from an XML file is a collection of

  elements accessed by element type name

  """

  def __init__(self,ElementTypeName):

    self.ElementTypeName = ElementTypeName

    self.ElementStack = []

    self.Elements = {}

    self.CurrentChild = None

    self.StartChild (ElementTypeName)



  def GetField(self,ElementTypeName):

    return self.Elements[ElementTypeName]

    

  def StartChild(self,ElementTypeName):

    """

    Start accumulating data for a new child element

    """

    self.ElementStack.append (self.CurrentChild)

    self.Elements[ElementTypeName] = ""

    self.CurrentChild = ElementTypeName



  def EndChild(self,ElementTypeName):

    """

    End accumulating data for child element.

    Subsequent content will be associated with enclosing element

    """

    self.CurrentChild = self.ElementStack.pop()



  def AddData(self,Data):

    """

    Associate data with currently active element

    """

    Data = string.strip (string.replace (Data,"\\n",""))

    self.Elements[self.CurrentChild] = self.Elements[self.CurrentChild] + Data

      

    

class RAX:

  """

  Record API for XML - base class

  """

  def __init__(self,fo):

    # Store file object from which ESIS is read

    self.fo = fo

    # Default "record" element

    self.RecordElementTypeName = ""



  def SetRecord(self,ElementTypeName):

    """

    Set the "record" element

    """

    self.RecordElementTypeName = ElementTypeName



  def ReadRecord(self):

    """

    Read a record.

    """

    # Skip forward to the required start-tag event

    line = self.fo.readline()[:-1]

    while line and line[1:] != self.RecordElementTypeName:

      line = self.fo.readline()[:-1]

    if not line:

      return None



    # Create a new record

    R = Record(line[1:])

    # Trundle through accumulating info in the Record

    # until the end-tag event occurs

    line = self.fo.readline()[:-1]

    while line and line[1:] != self.RecordElementTypeName:

      if line[0] == "-":

        if len(line)>1:

          R.AddData(line[1:])

      elif line[0] == "(":

        R.StartChild(line[1:])

      elif line[0] == ")":

        R.EndChild(line[1:])

      else:

        sys.stderr.write ("Unsupported Event '%s'" % line[0])

      line = self.fo.readline()[:-1]

        

    return R

  



def test():

  """

  Test function for RAX



  Process customers.xml outputting various elements

  """



  # This code trundles through the invoices, printing

  # out the "from"  and "to" elements.

  fo = os.popen ("xmln customers.xml")

  R = RAX(fo)

  R.SetRecord("Record")

  rec = R.ReadRecord()

  while rec:

    print "Phone=%s" % rec.GetField("Phone")

    rec = R.ReadRecord()

  

if __name__ == "__main__":

  test()