Perl and XML on the Command Line
April 17, 2002
Over the last several months we have explored some the of ways that Perl's XML modules can by used to create complex, modern Web publishing systems. Also, the growing success of projects like AxKit, Bricolage, and others shows the combination of Perl and XML to be quite capable for creating large-scale applications. However, in looking at more conceptual topics here recently, together with the fact that the Perl/XML combination is often seen in complex systems, seems to give the impression to the larger Perl community that processing XML with Perl tools is somehow complex and only worth the effort for big projects.
The truth is that putting Perl's XML processing facilities to work is no harder than using any other part of Perl; and if the applications that feature Perl/XML in a visible way are complex, it is because the problems that those applications are designed to solve are complex. To drive this point home, this month we will get back to our Perlish roots by examining how Perl can be used on the command line to perform a range of common XML tasks.
For our first few examples we will focus on those modules that ship with command line tools as part of their distributions.
XML::XPath and the
Requires: XML::Path, XML::Parser
Matt Sergeant's fine
XML::XPath module provides a way access the contents of
XML documents using the W3C-recommended XPath Language. This module also installs
xpath, which allows XPath expressions to be used to examine the
contents of XML documents. The XML document can be specified either by passing in
a path to
the file as the first argument or by piping the document via STDIN.
Find all section titles in a DocBook XML:
xpath mybook.xml //section/title
The same command using a pipe:
cat files/mybook.xml | xpath //section/title
Retrieve just the significant text (not including nodes containing all-whitespace) from a given document:
xpath somefile.xml "//text()[string-length(normalize-space(.)) > 0 ]"
DBIx::XML_RDB and the
Requires DBIx::XML_RDB, DBI
Fans of Matt's popular
DBIx::XML_RDB module will be pleased to know that it
too ships with a command line tool,
sql2xml, that returns an entire database
table as a single XML document.
Save the data stored in the 'users' table as the file
sql2xml.pl -sn myserver -driver Oracle -uid user -pwd seekrit -table user -output users.xml
Or, to send data to STDOUT,
sql2xml.pl -sn myserver -driver Oracle -uid user -pwd seekrit -table user -output -
XML::Handler::YAWriter and the
Requires: XML::Handler::YAWriter, XML::Parser::PerlSAX
No matter how carefully XML document are edited, they often need reformatting to
reasonably called "human-readable". Michael Koehne's
Handler installs an XML pretty-printer called
xmlpretty which reduces this task
to a quick one-liner.
Also in Perl and XML
Passing a file name:
xmlpretty overwrought.xml > new.xml
Reading from STDIN:
cat overwrought.xml | xmlpretty > new.xml
XML::SemanticDiff and the
Requires: XML::SemanticDiff, XML::Parser
Unfortunately, standard command line text-processing tools like
fall short when dealing with XML documents. My
XML::SemanticDiff was designed
to make comparing the relevant parts of two XML documents (while ignoring things like
whitespace, or having the same namespace URI bound to different prefixes) easy and
straightforward. Newer versions of this module install the
which allows simple access from the shell.
Print the semantic differences between two XML documents to STDOUT
xmlsemdiff file1.xml file2.xml
The Apache Software Foundation's Xerces-Perl project offers a Perl interface to the
C++ XML parser. Xerces-Perl ships with several sample scripts that can be copied into
bin directory. The most notable difference between Xerces and the
other XML parsers available to Perl is that it provides a way to validate XML documents
against W3C XML Schemas.
Calculate the time needed to process an XML document while validating it against an XML Schema:
DOMCount.pl -v=auto -s mydoc.xml
A Visitor From Planet C --
XML::LibXML often aren't aware of the feature-rich command
line XML processing tool,
xmllint, which is installed with the C libraries that
XML::LibXML depends upon. No,
xmllint is not a Perl tool, but
its many features, and the fact that it can be easily piped together with other tools,
it more than worthy of mention here.
Use the built-in HTML parser to convert ill-formed HTML to XML before further processing:
xmllint --html khampton_perl_xml_17.html | xpath "//a[@href]"
Or the same thing, but using the DocBook SGML parser:
xmllint --sgml ye-olde.sgml | xpath "//chapter[@id='chapt4']"
xmllint as a pretty-printer:
cat some.xml | xmllint --format
xmllint to validate a document against an external DTD:
cat some.xml | xmllint --postvalid --dtdvalid my.dtd
Devel::TraceSAX and XML::SAX::Machines
Requires: Devel::TraceSAX, XML::SAX, XML::SAX::Machines
While the syntax may be a bit verbose, it is entirely possible to use
XML::SAX::Machines to bring the power of Perl SAX2 to the command line.
XML::SAX::Machines to produce an XML document to STDOUT after applying a
perl -MXML::SAX::Machines=Pipeline -e 'Pipeline("XML::MyFilter", \*STDOUT)->parse_uri("files/camelids.xml");'
Or, reading from STDIN,
cat files/camelids.xml | perl -MXML::SAX::Machines=Pipeline -e 'Pipeline("XML::MyFilter", \*STDOUT)->parse_string(join "", <STDIN>);'
It is often very helpful when writing custom SAX Filters to be able to examine what
are being generated and forwarded to which classes. Barrie Slaymaker's
Devel::TraceSAX makes this painless.
Debugging SAX events by tracing them through multiple filters:
perl -d:TraceSAX -MXML::SAX::Machines=Pipeline -e 'Pipeline("XML::Filter1", "XML::Filter2")->parse_uri("file.xml");'
Processing XML with Perl does not have to mean buying into a huge XML-centric application with a steep learning curve or departing from Perl's long history as a command line tool. You may not use all of the tools or techniques described here, but it is nice to know that they are available when and if you need them.