Getting Started with XOM
November 27, 2002
Elliotte Rusty Harold's new XML Object Model ( XOM) is a simple, tree-based API for XML, written in Java. XOM attempts to build on good ideas from other Java XML APIs -- SAX, DOM, and JDOM -- and to leave behind some of their frustrations. The result is a high-level open-source API that is easy to learn and use, assuming that you are already familiar with Java and XML.
Unlike SAX, XOM is written with classes instead of interfaces, making it more
straightforward to use. With SAX you must first implement interfaces before you can
get it
to work. This work is eased somewhat by helper classes like DefaultHandler
; but
overall, interfaces make programming in SAX somewhat more complex, even though they
also
make SAX uniform and flexible. XOM's classes provide some flexibility by offering
a number
of check
methods that may be overridden in subclasses.
XOM does not stand by itself. It depends on an underlying SAX parser, such as a recent version of Xerces, to handle well-formedness checking and validation. XOM provides a simple interface to a parser, in effect hiding code without much of a performance hit.
I like XOM for the same reasons I like RELAX NG: you can pick it up in a snap if you already have a reasonable familiarity with Java idioms. And, like RELAX NG, the more I use XOM, the more I like it. It is well considered and doesn't try to do everything or please everybody. For more information on XOM's relationship to other XML APIs, you can read a presentation that Elliotte gave at the New York XML SIG meeting on 17 September 2002.
On the other hand, if you are underwhelmed by XOM's simplicity, you can go back to your favorite old API or mix APIs, taking what you like from each. But if simplicity, openness, and ready availability are keys to the wide adoption of software, XOM has little problem measuring up to that standard.
Bear in mind that XOM is still a work in progress. This article only walks through part of the interface, but it should give you enough example code to get you well on your way.
The sample programs and documents discussed in this article are available for download in ZIP archive form. And you can read
the Javadocs for nu.xom.*
online.
To run the examples, your system must have:
- Java version 1.2 or later. I have tested the examples with Java version 1.4 in a Windows 2000 environment.
- Xerces version 2.1 or later. I have tested appropriate examples with Xerces 2.2.
- The latest XOM JAR file. The latest version at this writing is xom-1.0d8.jar.
Parsing a Document with XOM
Create a working directory and unzip the program
archive there. Copy the Xerces and XOM JAR files there, too. The program
Wf.java
checks a document for XML 1.0 well-formedness:
import java.io.IOException; import nu.xom.Builder; import nu.xom.Document; import nu.xom.ParseException; public class Wf { public static void main(String[] args) throws IOException, ParseException { Builder builder = new Builder(); Document doc = builder.build(args[0]); System.out.println(doc.toXML()); } }
To compile the program, type the command:
javac -classpath xom.jar Wf.java
Use colons to separate the JAR files if you are working on a UNIX platform. The command line explicitly places the Xerces and XOM JARs on the classpath, making it evident what is going on. I've renamed the latest XOM JAR, from "xom-1.0d8.jar" to "xom.jar" for simplicity. After you successfully compile the program, you can run it by typing
java -cp .;xercesImpl.jar;xom.jar Wf file:///wrk/inst.xml
The fully qualified file path for the argument to Wf
may work more reliably
than the filename alone, depending on your platform. If the program runs successfully
and
inst.xml
proves to be well-formed (it should), the program will echo the
input and add an XML declaration:
<?xml version="1.0"?> <instant> <date month="December" day="1" year="2002"/> <time hour="10" minute="17" second="33" zone="PST"/> </instant>
Wf.java
imports three XOM classes: nu.xom.Builder
,
nu.xom.Document
, and nu.xom.ParseException
. Builder
creates a document object by reading an XML document. It can pick up the document
from a
file (as shown), a URL, or an input stream. Builder
's build()
method actually reads the document. Document
represents the document, including
its document element and prolog. XML output is delivered by Document
's
toXML()
method, with help from System.out.println()
. The entire
document is echoed using this mechanism, with an XML declaration thrown in as part
of the
parcel.
The IOException
and ParseException
classes are checked and
therefore required. They are declared the easy way in Wf.java
, that is, with a
throws
keyword. Wf2.java
uses a try/catch
statement
instead.
Validating a Document with XOM
With just a few changes, you can add validation support. In the following program
(Val.java
), notice three additions highlighted in bold.
import java.io.IOException; import nu.xom.Builder; import nu.xom.Document; import nu.xom.ParseException; import nu.xom.ValidityException; public class Val { public static void main(String[] args) throws IOException, ParseException, ValidityException { Builder builder = new Builder(true); Document doc = builder.build(args[0]); System.out.println(doc.toXML()); } }
When you add true
as an argument to the Builder
constructor, you
create a document object that is set for validation. When this is the case, you also
need to
check for validity exceptions by importing nu.xom.ValidityException
and
declaring it on main()
or in a try/catch
statement (see
Val2.java
).
Compile this program and then run it against instant.xml
with this
command:
java -cp .;xercesImpl.jar;xom.jar Val file:///wrk/instant.xml
The document is validated against the DTD asserted in the document type declaration,
intant.dtd
:
<!ELEMENT instant (date, time)> <!ELEMENT date EMPTY> <!ATTLIST date month NMTOKEN #REQUIRED day NMTOKEN #REQUIRED year NMTOKEN #REQUIRED> <!ELEMENT time EMPTY> <!ATTLIST time hour NMTOKEN #REQUIRED minute NMTOKEN #REQUIRED second NMTOKEN #REQUIRED zone NMTOKEN #REQUIRED>
When running Val.class
, success is indicated when the program echoes its
input:
<?xml version="1.0"?> <!DOCTYPE instant SYSTEM "instant.dtd"> <instant> <date month="December" day="1" year="2002"/> <time hour="10" minute="17" second="33" zone="PST"/> </instant>
Adding Elements and Attributes
Suppose you picked up a copy of inst.xml
and you wanted to add an element with
an attribute to it? The program AddUtc.java
does just that (note changes in
bold):
import java.io.IOException; import nu.xom.Attribute; import nu.xom.Builder; import nu.xom.Document; import nu.xom.Element; import nu.xom.ParseException; public class AddUtc { public static void main(String[] args) throws IOException, ParseException { Builder builder = new Builder(); Document doc = builder.build("inst.xml"); Element root = doc.getRootElement(); Element utc = new Element("utc"); Attribute att = new Attribute("offset", "-08:00"); utc.addAttribute(att); root.insertChild(0, "\n "); root.insertChild(1, utc); root.removeChild(4); root.removeChild(4); System.out.println(doc.toXML()); } }
This program imports the Element
and Attribute
classes from the
nu.xom
package. Instead of using a command line argument to pick up a file to
parse, it is hardcoded to grab inst.xml
. It uses the
getRootElement()
method from the Document
class to determine the
document element of inst.xml
.
A utc
element is created along with an offset
attribute using the
Attribute
class. The addAttribute()
method from
Element
adds this attribute to the utc
element. Calling
insertChild()
inserts a text child at position 0, immediately after the root
element time
. Following that, insertChild()
places the
utc
element at position 1.
The code also removes the time
element (and preceding whitespace) by using the
removeChild()
method twice with the same argument value. (The argument
represents a node position.) After XOM removes the first node (two contiguous whitespace
characters), the following node (the time
element) moves up in the tree to the
position previously occupied by the whitespace.
The result looks like this (utc.xml
):
<?xml version="1.0"?> <instant> <utc offset="-08:00" /> <date month="December" day="1" year="2002" /> </instant>
Serializing Output
You can use the Serializer
class to encode output, format it, or send it to a
file, among other things. The Time.java
program shows you how to do this.
import java.io.FileOutputStream; import java.io.IOException; import nu.xom.Attribute; import nu.xom.Builder; import nu.xom.Element; import nu.xom.Document; import nu.xom.Serializer; import nu.xom.ParseException; public class Time { public static void main(String[] args) throws IOException, ParseException { Builder builder = new Builder(); Document doc = builder.build("inst.xml"); Element root = doc.getRootElement(); Element utc = new Element("utc"); Attribute att = new Attribute("offset", "-08:00"); utc.addAttribute(att); root.insertChild(0, "\n "); root.insertChild(1, utc); root.removeChild(4); root.removeChild(4); Element time = new Element("time"); Element hr = new Element("hour"); time.appendChild(hr); hr.appendChild("10"); Element min = new Element("minute"); time.appendChild(min); min.appendChild("17"); Element sec = new Element("second"); time.appendChild(sec); sec.appendChild("33"); Element zone = new Element("zone"); time.appendChild(zone); zone.appendChild("PST"); root.appendChild(time); FileOutputStream out = new FileOutputStream("inst-new.xml"); Serializer ser = new Serializer(out, "ISO-8859-1"); ser.setIndent(1); ser.write(doc); } }
The program creates five elements and appends these nodes after the last remaining
child of
instant
, which happens to be the date
element (the old
time
element having been removed). The FileOutputStream
is also
imported and an output file is created (inst-new.xml
). The constructor for
Serializer
specifies an output stream and a character encoding
(ISO-8859-1
). Serializer
also supports encoding for UTF-8,
UTF-16, ISO-10646-USC-2, and ISO-8859-2 through ISO-8859-16. The setIndent()
method indents child nodes by a line feed plus one space character. The write()
method writes the document to the file inst-new.xml
:
<?xml version="1.0" encoding="ISO-8859-1"?> <instant> <utc offset="-08:00"/> <date month="December" day="1" year="2002"/> <time> <hour>10</hour> <minute>17</minute> <second>33</second> <zone>PST</zone> </time> </instant>
Without Serializer
, the output of the new elements would appear without
indentation, as in time2.xml
(see Time2.java
):
<?xml version="1.0"?> <instant> <utc offset="-08:00" /> <date month="December" day="1" year="2002" /> <time><hour>10</hour><minute>17</minute><second>33</second> <zone>PST</zone></time></instant>
You could also send the new XML document to standard output instead of a file (see
the
Serializer
constructor in Time3.java
).
One More Program
This last program, Final.java
, adds several other common structures to the XML
document:
import java.io.FileOutputStream; import java.io.IOException; import nu.xom.Attribute; import nu.xom.Builder; import nu.xom.Comment; import nu.xom.DocType; import nu.xom.Element; import nu.xom.Document; import nu.xom.Serializer; import nu.xom.Text; import nu.xom.ProcessingInstruction; import nu.xom.ParseException; public class Final { public static void main(String[] args) throws IOException, ParseException { Builder builder = new Builder(); Document doc = builder.build("inst.xml"); Element root = doc.getRootElement(); DocType dtd = new DocType("instant", "final.dtd"); ProcessingInstruction pi = new ProcessingInstruction("xml-stylesheet", "href=\"final.xsl\" type=\"text/xsl\""); doc.insertChild(0, dtd); doc.insertChild(1, pi); Element utc = new Element("utc", "http://www.wyeast.net/utc"); Comment gmt = new Comment(" Greenwich Mean Time "); Attribute att = new Attribute("offset", "-08:00"); utc.addAttribute(att); root.insertChild(0, "\n "); root.insertChild(1, gmt); root.insertChild(2, "\n "); root.insertChild(3, utc); root.removeChild(6); root.removeChild(6); Element time = new Element("time"); Element hr = new Element("hour"); time.appendChild(hr); Text h = new Text("11"); h.setData("10"); hr.appendChild(h); Element min = new Element("minute"); time.appendChild(min); min.appendChild("17"); Element sec = new Element("second"); time.appendChild(sec); sec.appendChild("33"); Element zone = new Element("zone", "urn:wyeast-net:utc"); zone.setNamespaceURI("http://www.wyeast.net/utc"); time.appendChild(zone); zone.appendChild("PST"); root.appendChild(time); FileOutputStream out = new FileOutputStream("final.xml"); Serializer ser = new Serializer(out, "UTF-8"); ser.setIndent(3); ser.write(doc); } }
This program creates both a document type declaration and a processing instruction,
then
inserts them into the prolog of the final.xml
. A namespace is declared for the
utc
element and a comment is inserted just above it. The text child or
content of the hour
element is set with the Text
class; then it's
changed with the setData()
method of Text
. Another namespace is
set for zone
in its Element
constructor and then altered with the
setNamespaceURI()
method.
Here is the file that this program outputs:
<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE instant SYSTEM "final.dtd"> <?xml-stylesheet href="final.xsl" type="text/xsl"?> <instant> <!-- Greenwich Mean Time --> <utc xmlns="http://www.wyeast.net/utc" offset="-08:00"/> <date month="December" day="1" year="2002"/> <time> <hour>10</hour> <minute>17</minute> <second>33</second> <zone xmlns="http://www.wyeast.net/utc">PST</zone> </time> </instant>
Wrapping Up
It's worth noting that XOM avoids convenience methods like the plague. But it is flexible
enough to allows users to write their own methods in subclasses. XOM also has several
other
packages, which I haven't discussed in this article: nu.com.canonical
, a
serializer for outputting canonical XML;
nu.xom.xslt
, supporting XSLT
transformations for TrAX-aware
processors, such as Saxon; and
nu.xom.xinclude
, an implementation of XML Inclusions.
I've found XOM to be simple and straightforward. It offers me a lot of functionality without much fuss. If you have any suggestions for XOM's development, contribute them by subscribing to the XOM-interest mailing list.