Menu

The Document is the Database

July 9, 2003

Jon Udell

When you need to store and display a modest amount of structured or semistructured data, it's tempting to store it directly in an HTML file. I've used this strategy many times; undoubtedly you have too. The advantages and disadvantages of working directly with a presentation format are pretty clear. It's handy that the "database" is a self-contained package that can be updated using any text editor, emailed, read directly from a file system, or served by any web server. But it's awkward to share the work of updating with other people or to isolate and edit parts of the file as it grows. When we convert to a database-backed web application in order to solve these problems, we trade away the convenience of the file-oriented approach. Can we have our cake and eat it too? This month's column explores the idea that a complete web application can be wrapped around an XHTML document, using XSLT for search, insert, and update functions.

I've been developing the idea in the context of the Zope application server, so the first order of business was to come up with an XSLT wrapper for Zope. Since Zope is written in Python, my first inclination was to use a Python binding to an XSLT library. But which one? libxslt is a popular choice, but on one particular FreeBSD system -- where I lack the root privileges needed to install libxslt -- I use Sablotron instead. On Windows, meanwhile, MSXML is the incumbent. So I settled on the more basic strategy of wrapping a command-line XSLT processor -- such as libxslt's xsltproc, Sablotron's sabcmd, or MSXML's msxsl.exe -- in a Zope method. Because this method calls OS functions to create and remove temporary files, it has to be deployed in Zope as an External Method rather than a Python Script. To do that, I put the code in a file called xslt.py, put the file in Zope's Extensions directory, added an External Method called 'xslt' to a folder, and used 'xslt' as both its module name and function name.

xsltproc = 'xsltproc'  # or 'sabcmd' or 'msxsl.exe'

# see http://www.linuxjournal.com/article.php?sid=5821
def formatExceptionInfo(maxTBlevel=5):
    import traceback, sys
    cla, exc, trbk = sys.exc_info()
    excName = cla.__name__
    try:
        excArgs = exc.__dict__["args"]
    except KeyError:
        excArgs = "<no args>"
    excTb = traceback.format_tb(trbk, maxTBlevel)
    return (excName, excArgs, excTb)

def xslt(self,xsl,idXml,update=0):
    try:
        import os, tempfile
        xmldata    = self.findFileInFolder(idXml)
        xslfile    = tempfile.mktemp()
        xmlfile    = tempfile.mktemp()
        outfile    = tempfile.mktemp()
        errfile    = tempfile.mktemp()
        open(xslfile,"w").write(xsl)
        open(xmlfile,"w").write(xmldata.data)
        cmd = "( %s %s %s) > %s 2> %s" % (
            xsltproc, xslfile, xmlfile, outfile, errfile)
        errlev=os.system(cmd) >> 8
        out=open(outfile,"r").read()
        err=open(errfile,"r").read()
	os.remove(xslfile)
        os.remove(xmlfile)
        os.remove(outfile)
        os.remove(errfile)
        if ( update == 1 ):
            self.manage_delObjects(idXml)
            self.manage_addFile(idXml,out,'')

    except:
        return formatExceptionInfo()

    if ( errlev > 0 ):
        return err
    else:
        return out

The method has three required arguments and an optional fourth, but the first argument, self, is supplied automatically by Zope. It's the folder from which the method is called and, through the magic of Zope acquisition, it can be any folder below the one containing the External Method. The second argument is the XSLT data which, as we'll see, is produced by other scripts that interpolate values into templates. The third argument is the name (in Zope lingo, the id) of the Zope File object containing the XML data to be transformed. In this case, that File has a html extension, contains XHTML, and is served with a text/html content type. The optional fourth argument, update, defaults to false, but when true causes the XSLT transformation to overwrite the XML data.

To set the stage, let's suppose we're collecting and displaying data about speakers at a conference. Here's the shell of our XHTML data:

<?xml version="1.0"?>
<body>
<style>
.speaker { margin-bottom: 10px }
.speakername { font-weight: bold }
.speakerTitle { font-style: italic }
</style>
<speakers>
</speakers>
</body>

And let's assume that we're dealing with multiple conferences, so the Zope namespace looks like this:

/Conferences/OSCON
/Conferences/ETech

Our xslt External Method, installed in the /Conferences folder, can be acquired by any subfolder, as can the other scripts we'll use to add, find, and update speaker data. If the data are stored in a file called speakers.html, there can be multiple instances of it -- for example, /Conferences/OSCON/speakers.html and /Conferences/ETech/speakers.html.

Now let's add a speaker to /Conferences/OSCON/speakers.html. This script, called add, kicks off the process:

form = '''
<script>
function insertSpeaker(){
speaker = document.insertSpeaker.speaker.value;
location = 'insert?speakerEmail=' + speaker;
}
</script>
<form name="insertSpeaker" method="post" action="javascript:insertSpeaker()">
<div>new speaker's email address: <input name="speaker"/> </div>
<div><input type="submit" value="insertSpeaker"/></div>
</form>
'''
return context.showMenu() + form

The add script is a Python Script, not an External Method, which means that it's subject to security restrictions but is more convenient to update. It's located in /Conferences, but when called as /Conferences/OSCON/add it sets up a context that will cause /Conferences/OSCON/speakers.html to be updated. The script simply displays a form that collects the speaker's email address -- which will serve as the key into our XHTML database -- and passes it (by way of JavaScript) to another Python Script, insert:

speakerEmail = context.REQUEST.form['speakerEmail']

xsl = '''%s

%s

<xsl:template match="//speakers">
<speakers>
<xsl:if test="count(//div[@email='%s'])=0" >
<xsl:text>&#10;</xsl:text>
<div class="speaker" email="%s"><xsl:text>&#10;</xsl:text>
<div class="speakerTitle"/><xsl:text>&#10;</xsl:text>
<div class="speakerName"/><xsl:text>&#10;</xsl:text>
<div class="speakerTitle"/><xsl:text>&#10;</xsl:text>
<div class="speakerBio"><p>bio</p></div><xsl:text>&#10;</xsl:text>
</div>
</xsl:if>
<xsl:apply-templates />
</speakers>
</xsl:template>

</xsl:stylesheet>
''' % (context.xsltPreamble(), context.xsltIdentityTransform(), 
       speakerEmail, speakerEmail)

try:
    context.acquireLock()
except:
    return "insert: exception acquiring lock"

try:
    context.xslt(xsl, 'speakers.html', update=1)
except:
    return "insert: exception updating"

try:
    context.releaseLock()
except:
    return "insert: exception releasing lock"

return context.REQUEST.RESPONSE.redirect('select?key='+speakerEmail)

In a Zope Python Script, all the interesting stuff hangs off the context variable. In this case, we'll use it to get to the HTTP request with the caller's form data, to locate some convenience scripts that supply XSLT boilerplate, and to locate our xslt External Method.

The XSLT script that's created is a filter for speakers.html. It locates the <speakers> node in that file. If no <speaker> node with the given email address exists, it inserts one. The XSLT identity transform, i.e.:

<xsl:template match="node() | @*">
  <xsl:copy>
    <xsl:apply-templates select="@* | node()"/>
  </xsl:copy>
</xsl:template>

passes the rest of the XML data through the filter unchanged. When the insert script makes this call:

 context.xslt(xsl, 'speakers.html', update=1) 

the xslt external method receives an implicit first argument, self, which represents the context folder, in this case /Conferences/OSCON. It uses that handle in three times:

self.findFileInFolder Convert the name (e.g. speakers.html) to a ZODB object reference. The findFileInFolder function is:
files = context.objectValues(['File'])
for i in range(len(files)):
    if ( files[i].getId() == fname ):
        return files[i]
return None
self.manage_delObjects Delete speakers.html.
self.manage_addFile Recreate speakers.html.

The result is a modified speakers.html with a newly-added speaker:

<?xml version="1.0"?>
<body>
<style>
.speaker { margin-bottom: 10px }
.speakername { font-weight: bold }
.speakerTitle { font-style: italic }
</style>
<speakers>
<div class="speaker" email="dj.adams@pobox.com">
<div class="speakerTitle"/>
<div class="speakerName"/>
<div class="speakerTitle"/>
<div class="speakerBio"><p>bio</p></div>
</div>
</speakers>
</body>

Finally, the insert script redirects to another script, select, which locates the newly-added node and presents it for editing. Here's that script:

xsl = '''%s

<xsl:template match="//div[@email='%s']">
<xsl:for-each select=".">
<xsl:sort/>
<form method="post" action="update">
<div><input type="hidden" name="speakerEmail" value="{@email}"/>
  <xsl:value-of select="@email"/></div>
<div>speakerName: 
  <input name="speakerName" value="{normalize-space(*[@class='speakerName'])}"/> 
</div>
<div>speakerTitle: 
  <input name="speakerTitle" value="{normalize-space(*[@class='speakerTitle'])}"/> 
</div>
<div>speakerBio: <textarea name="speakerBio">
<xsl:copy-of select="*[@class='speakerBio']/*"/>
</textarea></div>
<div><input value="update" type="submit"/></div>
</form>
</xsl:for-each>
</xsl:template>

<xsl:template match="text()">
</xsl:template>

</xsl:stylesheet>'''

xsl = xsl % (context.xsltPreamble(),context.REQUEST.form['key'])

return context.showMenu() + context.xslt(xsl,'speakers.html')

The select script uses XSLT to find a speaker node and generate an update form. The work of updating is handled by another script, update:

speakerEmail = context.REQUEST.form['speakerEmail']
speakerName  = context.REQUEST.form['speakerName']
speakerTitle = context.REQUEST.form['speakerTitle']
speakerBio   = context.REQUEST.form['speakerBio']

xsl = '''%s

%s

<xsl:template match="//div[@email='%s']">
<div class="speaker" email="%s">
<xsl:text>&#10;</xsl:text>
<div class="speakerName">
%s
</div>
<xsl:text>&#10;</xsl:text>
<div class="speakerTitle">
%s
</div>
<xsl:text>&#10;</xsl:text>
<div class="speakerBio">
%s
</div>
<xsl:text>&#10;</xsl:text>
</div>
</xsl:template>

</xsl:stylesheet>
''' % (context.xsltPreamble(), context.xsltIdentityTransform(),
         speakerEmail, speakerEmail, speakerName, speakerTitle, speakerBio)

try:
    context.acquireLock()
except:
    return "update: exception acquiring lock"

try:
    context.xslt(xsl, 'speakers.html', update=1)
except:
    return "update: exception updating"

try:
    context.releaseLock()
except:
    return "update: exception releasing lock"

return context.REQUEST.RESPONSE.redirect('select?key='+speakerEmail)

After the update, speakers.html might look like this:

<?xml version="1.0"?>
<body>
<style>
.speaker { margin-bottom: 10px }
.speakername { font-weight: bold }
.speakerTitle { font-style: italic }
</style>
<speakers>
<div class="speaker" email="dj.adams@pobox.com">
<div class="speakerName">
DJ Adams
</div>
<div class="speakerTitle">
SAP hacker
</div>
<div class="speakerBio">
<p>
DJ Adams is an old SAP hacker who still thinks JCL and S/370 assembler
is pretty cool. In recent years he's been successfully combining Open
Source software with R/3 to produce hybrid systems that show off the
power of free software.
</p>
<p>
He is the author of O'Reilly's <a
href="http://www.oreilly.com/catalog/jabber/"><i>Programming
Jabber</i></a>, contributes <a
href="http://www.oreillynet.com/pub/au/139">articles</a> to
O'ReillyNet's P2P site, and has to own up to being responsible for the
Jabber::Connection, Jabber::RPC and Jabber::Component::Proxy modules
on CPAN.
</p>
</div>
</div>
</speakers>
</body>

As new speaker nodes are added to the file, they push down the older ones. In this naive implementation, there's no effort to sort the nodes stored in the XHTML file. But here's another script, find, that uses XSLT to produce an HTML SELECT statement sorted by speakers' email addresses. The selected item is fed to the select script for updating.

xsl = '''%s

<xsl:template match="//speakers">
<script>
function chooseSpeaker(){
var list = document.chooseSpeaker.speakers;
speaker = list[list.selectedIndex].value;
location = 'select?key=' + speaker;
}
</script>
<form name="chooseSpeaker" method="post" action="javascript:chooseSpeaker()">
<select name="speakers">
<xsl:apply-templates select="./div[@class='speaker']">
<xsl:sort select="@email"/>
</xsl:apply-templates>
</select>
<div><input value="chooseSpeaker" type="submit" /></div>
</form>
</xsl:template>

<xsl:template match="//div[@class='speaker']">
<option value="{@email}"><xsl:value-of select="@email"/></option>
</xsl:template>

<xsl:template match="text()">
</xsl:template>

</xsl:stylesheet>''' % (context.xsltPreamble())

return context.showMenu() + context.xslt(xsl,'speakers.html')
    

More from Jon Udell

The Beauty of REST

Lightweight XML Search Servers, Part 2

Lightweight XML Search Servers

The Social Life of XML

Interactive Microcontent

As speakers are added and updated, the speakers.html file remains immediately viewable in the browser. The file can also be searched in a structured way, using the technique I explored last month. Here, for example, is a query that finds speakers whose biographies contain 'JCL':

 //div[@class='speaker'][contains(./div[@class='speakerBio'] ,
'JCL')]

Is this really a practical way to manage a collection of semistructured data? Frankly, I'm undecided. But it's an interesting preview of how things will be when native XML storage, and node-level update capability, are standard features of all databases. Meanwhile, the ability to use Python to generate and run XSLT transformations, in a Zope context, seems like a useful pattern.