XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Functional Programming and XML
by Bijan Parsia | Pages: 1, 2, 3

XMLambda

XMLambda is one of a number of special purpose languages, (or, rather, since there isn't a publicly available implementation, language design) for dealing with XML to come out of the FP community (other examples include XDuce and various query languages). In particular, XMLambda is a language optimized for composing "dynamic" XML documents, i.e., documents where some portion is computed rather than entered. The neat thing about XMLambda is that it's basically the result of asking: "What would a type-obsessed functional language, where an XML document simply was a program, look like?" This means that the minimally XML savvy person can already write XMLambda programs, such as

<html>
<head>
        <title>Hello world in XMLambda</title>
</head>
<body>
        <h1>Hello world</h2>
</body>
</html>

It's important to recognize that that this is a fairly complex program, not just a string literal, in XMLambda. Each element is a program construct with an associated type. Furthermore, validation of the XML is accomplished by type-checking. Validating the XML is just a matter of compiling the program.

Compare this, for a moment, with various embedded scripting language template systems, such as JSP or ASP. In most of these systems, the surrounding XML is just text, element tags and content alike -- semantically speaking, you could replace it with a series of write statements (or one big one or some concatenations). More importantly, the write statements wouldn't have to respect, for example, tag boundaries. It'd be perfectly legitimate to embed something like

out.writeln("<title" + ">Hello world<" + "</title>" );

In the degenerate case of a pure XML document, the XMLambda compiler just is a validating XML parser, and we have plenty of those. However, since XMLambda's validation is actually part of a more general type checking system, even documents ("templates") with embedded computations can be "validated":

myTitle :: string
myTitle = "Hello world in XMLambda"

<html>
<head>
        <title><%= myTitle %></title>
</head>
<body>
        <h1>Hello world</h2>
</body>
</html>
Variable myTitle is of type string and thus can go inside a title element.
myTitle :: title
myTitle = <title>Hello world in XMLambda</title>

<html>
<head>
        <%= myTitle %>
</head>
<body>
        <h1>Hello world</h2>
</body>
</html>
Now, myTitle is of type title element, and so can appear inside the head element...
myTitle :: title
myTitle = <title>Hello world in XMLambda</title>

<html>
<head>
        <%= myTitle %>
</head>
<body>
        <h1><%= myTitle %></h2>
</body>
</html>
... but not inside an h1 element. This would be caught by the compiler.

If I'd declared myTitle to be a string in the third program, there still would have been a compile time type error, since what gets assigned to myTitle is an element. To subvert the type checks, I would have to make the right hand side of the assignment a string literal, as opposed to an element literal. That case would be pretty much equivalent to a standard ASP or JSP template (although JSP tag libraries make this a bit more complicated). In standard ASP/JSP templates you tend to be working at the fairly low level of string munging, which not only forgoes automatic validation, but also forces the programmer to keep three separate models in mind at once: the XML world of elements, the programming language's datatypes and constructs, and the string hacking interface between the two.

The standard solution is to use element node objects, or element generating procedures, to provide a unified, more abstract, view of the dynamic document. This can work, and you can even build in some sorts of validation (e.g., have nodes check the classes of their children, etc.). However, it's beastly awkward, mainly because (as the XMLambda authors point out) it forces you to too high a level of abstraction, not to mention a much more cumbersome notation. XMLambda lets you stay XML-centered while providing a consistent and fairly natural way to build complex, generated documents.

HaXML

Like XMLambda, the HaXML utilities treat a DTD as a series of type declarations in a functional language, thus conflating validation with type checking. In the case of HaXML, the language is Haskell, a general purpose FP language with many interesting features and implementations. While lacking the XML-centeredness of XMLambda, HaXML's mapping libraries allow one to smoothly work with XML data and documents from within a complete programming language with an economical and perspicuous notation.

One interesting claim I've seen in the literature is that the base Haskell type system -- while very expressive, especially in comparison with C, C++, or Java type systems -- may prove either insufficient or just unwieldy for XML documents. And it's this insufficiency that's one reason for the design of XML-specific languages with special type systems (such as XMLambda and XDuce). However, the Haskell type system is well-understood, well-documented, and well-implemented, all of which make it ideal for exploring more comprehensive type schemes for XML.

The HaXML utilities introduce another, distinctively FP, mechanism for manipulating XML: combinators. Combinators, or higher-order functions, are used to build complex functions out of simpler ones. They are "high-order" because they take functions as arguments and return a function as a result. Conceptually, they are much like Unix pipes in that they let you build up more complex computational sequences by flexibly arranging highly specialized tools. Indeed, the HaXML package defines the "Irish composition" combinator (represented by "o") which applies one function to the results of another (just the way a pipe "applies" one program to the results of another). However, there are many other combinators which do such useful things as append the results of one function to the other in various ways or use one function to select results from another. The nifty thing about combinators is that they're just functions defined and definable in the host language. Thus, as you find yourself using a certain pattern, it's easy to abstract it into a new combinator. Indeed, many of the predefined combinators in HaXML are such abstractions.

The processing model should be clear. First, HaXML defines a number of basic filtering functions, including elm which returns the input if it is an (arbitrary) XML element, children which, if the input is an element with children, returns those children, and txt which passes only plain text. Then, it adds a few filter constructors for building specialized filters, e.g., tag for building filters for specific tags or mkElem for transforming a string into an element. Finally, HaXML defines a combinator library for stringing these basic filters -- and any generated ones, of course -- together in various ways. The result? A domain specific language for parsing, filtering, transforming, and generating XML documents. Given the side-effect free nature of these functions, you can treat them as very strongly encapsulated components that can be mixed and matched freely. In fact, the HaXML paper describes various algebraic laws for their combinators which you could use to transform complex expressions manually into something more readable or could form the basis of a filter optimizer.

There's a very cumulative flavor to this approach: Each bit can be achieved via some more verbose technique that trades notation economy for a reduced number of constructs. But there's no need since the extra constructs are strictly defined in terms of the core set. The algebra let's you use derivations to figure out equivalences, rather than trying to keep a mental model of the processing flow.

Some folks might find the non-XML syntax of these combinators a welcome alternative to XSLT. Personally, I find the trend toward XMLizing everything -- especially of end-programmer languages -- to be rather a step backwards. After all, the designers of XML went out of their way to eliminate all the author support features of SGML.

These examples were taken from the first "Transforming XML" column. I don't promise that the semantics are exactly alike, for several reasons, including the need to pretty print the output of the combinators, but they should help give some of the flavor of the combinators.

Copying Elements to the Output
XSL HaXML
<xsl:template match="title">
  <xsl:copy>
    <xsl:apply-templates/>
  </xsl:copy>
</xsl:template>
multi(tag "title")

--"multi" is one of the recursive combinators
--it collects a list of the results of applying
--its argument to each node in the tree

Deleting Elements
XSL HaXML
<xsl:template match="nickname">
</xsl:template>
foldXml(keep `without` tag "nickname")

--"foldXml" is another recursive combinator
--it applies its argument to each node starting
--from the bottom up

Changing Element Names
XSL HaXML
<xsl:template match="article">
  <html>
    <xsl:apply-templates/>
  </html>
</xsl:template>
foldXml(mkElem "html" [children] `when` tag "article")

--This is the recursive version. It's harmless if there's only
--a top level "article". It's probably better to just do:

mkElem "html" [children] `when` tag "article"

--Of course, there's the nicer:

replaceTag "html" `o` tag "article"

Aside from writing transformations directly with these combinators, they also make a good compilation target for other query, style, and transformation languages. (For example, see Xtract, a XML query language similar to XPath.)

Pages: 1, 2, 3

Next Pagearrow