What's Wrong with Perl and XML?
October 11, 2000
|Table of Contents
The idea for this column came from a talk Nathan Torkington gave at YAPC, in which he described areas where Perl was weak, one of which is XML.
Although there are many excellent Perl modules dealing with many aspects of XML (among which a good dozen offer various ways of transforming XML documents), the languages that seem to be favored by XML developers are Java, C/C++, and maybe even Python. For example, questions on the XML-DEV list mostly involve Java, C++, and XSLT. Sun, IBM, and Microsoft all push Java or C++ implementations.
Even the opening lines of the report on The Perl Conference (TPC) on perl.com make the same point: "My flight to San Jose wasn't delayed too long. During my wait I did get to down some Sam Adams and Jack Daniels while reading Java and XML. Yeah, I know it's probably blasphemous to read a Java book on the way to TPC, but I was looking for some good ideas on using XML, and it is an O'Reilly book."
So what's wrong with Perl? Or "what's wrong with XML?" as Nathan would ask. This article explores what could be done to make Perl and XML a compelling combination, which niche of the XML field Perl could fill better than any other language, which modules should be written to achieve that aim, and what action could be taken to promote Perl usage for XML processing.
Why isn't Perl, despite its widespread adoption for text processing (and especially for the HTML and CGI worlds), the language of choice for XML?
XML doesn't need regular expressions.
In the first place, there is a peculiarity of XML: it is "pre-parsed," in the sense that its semantic units are already demarcated by tags.
One of Perl's biggest strengths is the unparalleled integration of a powerful regular expression engine within the language. Unfortunately, regexp processing is not fully useful when dealing with the majority of XML processing tags. There is little need for the power, or complexity, of pattern matching regular expressions since XML instances are, by their nature, structured; and thus there is no reason to pay the computational and developer costs of using regexes."
In eight years of processing SGML and XML I don't think I've written more than a couple of scripts that have made heavy use of regular expressions. Most of the time I do tree processing, cut a node here, paste it there, add a prefix based on an attribute value. My Perl XML code would look exactly like C code if it weren't for the funny characters in front of variable names.
So I don't think using Perl to process XML gives developers the usual productivity advantage they enjoy when processing regular text with regexes. Paranoid Perl addicts might even make the point that XML was probably created just to overcome the shortcomings of less advanced languages, in terms of cleaning up messy data that can be usually accommodated by Perl magic.
XML's Development Model
XML's development model does not fit well with the normal Perl development model.
Perl's success with CGI stems from the availability of a couple of widely used modules, CGI.pm and Text::Template, which have become informal industry standards. Many users start writing simple scripts using these modules and then wind up using advanced features of the language. Additionally, a good deal of CGI development is done in an ad hoc way, and this kind of development is certainly more suited to Perl than to C.
On the other hand, XML software is usually written in a more controlled way. It tends to be written by corporate IT staffs, rather than 'Net-types, who want to employ their usual methodology and their usual languages, typically Visual Basic, C++, and Java.
In addition, there's the availability of free Java and C++ XML libraries developed by IBM, Sun, or Microsoft. In comparison to a bunch of Perl modules, written by no-names with no corporate support, there is little doubt about which languages will be chosen by CTOs all over the world.
In a way, XML might be seen as the revenge of IT departments over the free-wheeling development style generated by early ad hoc web development.
XML Processing Standards
DOM and XSLT have little appeal for Perl hackers.
As much as I admire the Document Object Model, and with due respect for the people who spent so much time and effort writing it, I find the API very painful to use. Who but a Java-intoxicated contortionist would want to use such a constraining corset? I have heard people who write DOM tutorials complain about it's ugliness. Why would any freedom-loving Perl poet submit to this insanity?
The only thing that makes the DOM unavoidable is that it's likely to become the standard API to interface to XML databases. For example, XDBM offers DOM interfaces in Java and Perl.
As for XSLT, I have reservations about the degree to which it encourages the mixing of code and data in one file. While XSLT might be appropriate for light, template-driven transformations, I have no idea how I could use it to develop and maintain the kind of heavyweight processing I am used to doing.
Further evidence that DOM and XSLT are not an easy sell to the Perl crowd is the number of XML transformation modules on CPAN, over a dozen at the last count. Why would people keep writing those things if they were happy with DOM and XSLT? The problem here is that with so many modules available newcomers have no idea which one to use, undermining the overall appeal of Perl for them.
I see little reason for Perl hackers to use either DOM or XSLT to process XML as they seem to oppose many things we like about software creation.
Where is Perl without a doubt the best tool for an XML programming task? What could be done to make Perl more useful for XML processing? And for those Perl addicts among us, how can we enjoy working with XML without needing to switch to some other ugly, less fun, uncool language?
Converting Data to XML
A common task is the conversion of legacy data formats, often unstructured or semi-structured, into XML. It is here that the power of Perl's regular expressions can be used to good effect. Modules like RTF-Parser or the venerable Mif.pl, along with a number of HTML processing modules, can be useful, along with XML creation modules like XML::Writer.
But when it comes to manipulating the XML that has been created, it's trickier in Perl to start reorganizing the nodes in an XML document. Modules doing higher-level parsing, allowing the definition of rules to add structure to flat XML, would be a real help. As it is, the flexibility and power of Perl already allow for some really powerful processing.
Perl as Glue
Perhaps Perl's greatest strength is CPAN, which contains an astonishing number of modules, all of which make Perl an excellent glue language, even among diverse or unusual systems. Perl allows easy access to databases, through the DBI module, access to remote data through the LWP module, fuzzy spelling through soundex, credit card validity checking, LDAP access, etc. This makes XML-based software written in Perl able to do a lot more, and a lot more easily, than software written in any other language. Also, the number of Perl XML modules makes it easy to add XML integration to existing non-XML software.
Perl has many modules for handling XML, yet it seems that either they are not widely used, or they compete directly with more established libraries in other languages. So what could we do to improve existing modules, and which new ones should be written?
Making Simple Things Easy
Providing simple layers on top of XML standards.
Making simple things easy is part of one of Perl's many mottos. Writing a simple standard was also one of the goals of XML. Since then it seems that related XML standards may have abandoned this goal.
Layering modules over top of DOM that give the user a more Perlesque, interface to an XML document seems like a natural way to preserve compatibility with future XML databases, while allowing developers to enjoy hacking XML with a more familiar "Perl" feel to the APIs. The XML::EasyObj module, for example, is clearly a step in the right direction.
Making Complex Things Possible
Providing Perl escapes from XML standard implementations.
Another way to provide additional power to users of Perl is to provide ways to use it from within modules that implement XML standards (the easiest analogue is the way that mod_perl provides that escape for web pages). XPath, XSLT, XQL, and DOM could benefit from allowing developers to use Perl even within standard expressions. If the language is not powerful enough, just add calls to pure Perl functions accessing the underlying XML structure, or even throw a Perl eval in the mix and any shortcoming will be quickly (at least in terms of development time) overcome.
I know this would make Perl implementations stray from industry-standard specifications but at least it would work. And when I hear Microsoft, in their "embrace and extend" fashion, admit that their DOM processors implement all of the DOM but add other functionality, it makes me feel less concerned about sticking strictly to the standard.
Give newcomers help in finding their way around CPAN modules.
At the moment it's not only very difficult to know what each of the 35 XML modules on CPAN does, but it's also difficult ot know who uses them, how robust they are, how well supported, which obviously makes it very difficult for newcomers to figure out which ones to use.
While waiting for the planned CPANTS, a central Perl-XML site could provide information on how many people use a module, what they think about it, and examples its use. Some reviews have already been posted on Perl Monks' Module Reviews area, and these should make it easier to find the right tool for the job.
When one looks at all the work that has been done in Perl to accommodate XML, the Unicode support, and the number of existing XML modules, it would seem that Perl is already an excellent tool in the XML field. It just does not seem that enough people noticed it.
What would help others notice is if Perl advocates would publicize the work that has been done, whether on XML modules, XML tools such as AxKit (a Cocoon-like XML publishing system), or DBIx::XML_RDB (extracts XML data from a relational database), or theBiztalk toolkit for Perl; in other words, publicizing the significant XML applications built using Perl will counterbalance the Java or XSLT hype.
Writing better documentation, tutorials, and code examples could also ease the learning curve for Perl programmers who want to start hacking XML. A regular Perl XML column on XML.com will also help. (Watch this space -- E.D.)
I think that even though the XML world is in some ways the antithesis of the Perl Way, there are still plenty of ways for Perl to be used effectively for XML processing.
Sun, Microsoft, and IBM, probably the main forces pushing XML and driving W3C at the moment, are big Java supporters. Nevertheless I think they should realize that Perl appeals to a different audience than Java. So supporting and promoting Perl XML development, even bringing in Perl people in the standards working groups, would most likely increase XML's overall usage, converting Perl HTML/CGI developers to XML more easily than if they have to learn a new language.
To conclude on a more optimistic note, the situation is not beyond remedy. I conducted keyword searches for XML and various programming languages over the W3C's site, the XML Cover Pages, XML-DEV, and XML.com. Though behind Java and XSLT, Perl did give quite a strong showing, usually above Python.