REXML: Processing XML in Ruby
November 9, 2005
REXML (Ruby Electric XML) is
the XML processor of choice for Ruby
programmers. It comes bundled with the standard Ruby distribution. It's fast, written in Ruby, and can be used in two ways: tree
parsing and stream parsing. In this article, we show some basic constructs on how
to use
REXML for XML processing. We also introduce the use of Ruby's interactive debugger
irb
for exploring XML documents with the help of REXML.
We'll be using a DocBook bibliography file as example XML document. You will learn how to parse the document with the tree parsing API, to access elements and attributes, and to create and insert elements. We'll also look into the peculiarities of text nodes and entity processing. Finally, we will show an example use of the stream parsing API. This is our DocBook file:
Listing 1: The bibliography.xml
file
Beginning with Tree Parsing
We start with the tree parsing API, which is very DOM-like, but more intuitive. This is our first code example:
Listing 2: Showing an XML File (code1.rb
)
require 'rexml/document' include REXML file = File.new("bibliography.xml") doc = Document.new(file) puts doc
The require
statement loads the REXML library. Then we include the REXML
namespace, so we don't have to use names like REXML::Document
all the time. We
open an existing file named bibliography.xml
and parse the XML source code,
with the result in a Document object. Finally we show the document on the screen.
When you
execute the command ruby code1.rb
, the source code of our bibliography XML
document is shown.
It's possible that you may get this error message:
example1.rb:1:in `require': No such file to load -- rexml/document (LoadError) from example1.rb:1
In which case your Ruby installation doesn't have REXML installed, because some package managers, such as Debian's APT, install the libraries as separate packages. Install the rexml package and try again.
The Document.new
method takes an IO, Document or String object as its
argument. The argument specifies the source from which we want to read an XML document.
In
our first example, we used an IO object, namely the File object which inherits from
the IO
class. Another child class of IO is the Socket class, which can be used with
Document.new
to get an XML document over a network connection.
If the Document constructor takes a Document as its argument, all its Element nodes are cloned to the new Document object. If the constructor takes a String argument, the string will be expected to contain an XML document. An example:
Listing 3: Showing an XML "Here Document" (code2.rb
)
require 'rexml/document' include REXML string = <<EOF <?xml version="1.0" encoding="ISO-8859-15"?> <!DOCTYPE bibliography PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN" "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd"> <bibliography> <biblioentry id="FHIW13C-1234"> <author> <firstname>Godfrey</firstname> <surname>Vesey</surname> </author> <title>Personal Identity: A Philosophical Analysis</title> <publisher> <publishername>Cornell University Press</publishername> </publisher> <pubdate>1977</pubdate> </biblioentry> </bibliography> EOF doc = Document.new(string) puts doc
We use a "here document" string: all characters between <<EOF and EOF, with newlines included, are part of the string.
Accessing Elements and Attributes
From now on, we will use irb
, Ruby's interactive debugger, for examples of the
use of the library REXML. At the irb
prompt, we load the file
bibliography.xml
into a document. After that, we can execute commands to
access the elements and attributes of the document, in an interactive way.
koan$ irb irb(main):001:0> require 'rexml/document' => true irb(main):002:0> include REXML => Object irb(main):003:0> doc = Document.new(File.new("bibliography.xml")) => <UNDEFINED> ... </>
Now you can explore the document very easily. Let's look at a typical irb
session with our XML file:
irb(main):004:0> root = doc.root => <bibliography id='personal_identity'> ... </> irb(main):005:0> root.attributes['id'] => "personal identity" irb(main):006:0> puts root.elements[1].elements["author"] <author> <firstname>Godfrey</firstname> <surname>Vesey</surname> </author> irb(main):007:0> puts root.elements["biblioentry[1]/author"] <author> <firstname>Godfrey</firstname> <surname>Vesey</surname> </author> irb(main):008:0> puts root.elements["biblioentry[@id='FHIW13C-1260']"] <biblioentry id='FHIW13C-1260'> <author> <firstname>Sydney</firstname> <surname>Shoemaker</surname> </author> <author> <firstname>Richard</firstname> <surname>Swinburne</surname> </author> <title>Personal Identity</title> <publisher> <publishername>Basil Blackwell</publishername> </publisher> <pubdate>1984</pubdate> </biblioentry> => nil irb(main):009:0> root.each_element('//author') {|author| puts author} <author> <firstname>Godfrey</firstname> <surname>Vesey</surname> </author> <author> <firstname>René</firstname> <surname>Marres</surname> </author> <author> <firstname>James</firstname> <surname>Baillie</surname> </author> <author> <firstname>Brian</firstname> <surname>Garrett</surname> </author> <author> <firstname>John</firstname> <surname>Perry</surname> </author> <author> <firstname>Geoffrey</firstname> <surname>Madell</surname> </author> <author> <firstname>Sydney</firstname> <surname>Shoemaker</surname> </author> <author> <firstname>Richard</firstname> <surname>Swinburne</surname> </author> <author> <firstname>Jonathan</firstname> <surname>Glover</surname> </author> <author> <firstname>Harold</firstname> <othername>W.</othername> <surname>Noonan</surname> </author> => [<author> ... </>, <author> ... </>, <author> ... </>, <author> ... </>, <author> ... </>, <author> ... </>, <author> ... </>, <author> ... </>, <author> ... </>, <author> ... </>]
First we use the name root
to access the document root. The document root is
here the bibliography
element. Each Element object has an Attributes object
named attributes
which acts as a hash map with the names of the attributes as
keys and the attribute values as values. So with root.attributes['id']
we get
the value of the attribute id
of the root element. In the same manner, each
Element object has an Elements object named elements
, with each
and []
methods to get access to the subelements. The []
method
takes an index or XPath
expression as its argument and returns the child elements which match the expression.
The
XPath expression acts like a filter, deciding which elements will be returned. Note
that
root.elements[1]
is the first child element, because XPath indexes
start from 1, not from 0. Actually, root.elements[1]
equals
root.elements[*[1]]
, where *[1]
is the XPath expression for the
first child. The method each
of the class Elements
iterates
through all the child elements, optionally filtering them by a given XPath expression.
The
code block will be executed then for each iteration. In addition, the method
Element.each_element
is a shorthand notation for
Element.elements.each
.
Creating and Inserting Elements and Attributes
Now we will create a small bibliography document, consisting of one
biblioentry
, from scratch. Here's how it goes:
irb(main):010:0> doc2 = Document.new => <UNDEFINED/> irb(main):011:0> doc2.add_element("bibliography", {"id" => "philosophy"}) => <bibliography id='philosophy'/> irb(main):012:0> doc2.root.add_element("biblioentry") => <biblioentry/> irb(main):013:0> biblioentry = doc2.root.elements[1] => <biblioentry/> irb(main):014:0> author = Element.new("author") => <author/> irb(main):015:0> author.add_element("firstname") => <firstname/> irb(main):016:0> author.elements["firstname"].text = "Bertrand" => "Bertrand" irb(main):017:0> author.add_element("surname") => <surname/> irb(main):018:0> author.elements["surname"].text = "Russell" => "Russell" irb(main):019:0> biblioentry.elements << author => <author> ... </> irb(main):020:0> title = Element.new("title") => <title/> irb(main):021:0> title.text = "The Problems of Philosophy" => "The Problems of Philosophy" irb(main):022:0> biblioentry.elements << title => <title> ... </> irb(main):023:0> biblioentry.elements << Element.new("pubdate") => <pubdate/> irb(main):024:0> biblioentry.elements["pubdate"].text = "1912" => "1912" irb(main):025:0> biblioentry.add_attribute("id", "ISBN0-19-285423-2") => "ISBN0-19-285423-2" irb(main):026:0> puts doc2 <bibliography id='philosophy'> <biblioentry id='ISBN0-19-285423-2'> <author> <firstname>Bertrand</firstname> <surname>Russell</surname> </author> <title>The Problems of Philosophy</title> <pubdate>1912</pubdate> </biblioentry> </bibliography> => nil
As you see, we create an empty new document and we add one element to it. This element
becomes the root. The add_element
method takes the name of the element as
argument and an optional argument which is a hash map of name/value pairs of the attributes.
So this method adds a new child to the document or an element, optionally setting
attributes
of the element.
You can also make a new element, like we do with the author
element, and then
add it afterwards to another element: if the add_element
method gets an
Element
object, the object will be added to the parent element. Instead of
the add_element
method, you can also use the <<
method on
Element.elements
. The two methods return the added element.
In addition, with the method add_attribute
, you can add an attribute to an
existing element. The first parameter is the attribute name, the second is the attribute
value. The method returns the attribute that was added. The text value of an element
can
easily be changed with Element.text
or alternatively with the
add_text
method.
If you want to insert an element at a specific location, you can use the methods
insert_before
and insert_after
:
irb(main):027:0> publisher = Element.new("publisher") => <publisher/> irb(main):028:0> publishername = Element.new("publishername") => <publishername/> irb(main):029:0> publishername.add_text("Oxford University Press") => <publishername> ... </> irb(main):030:0> publisher << publishername => <publishername> ... </> irb(main):031:0> doc2.root.insert_before("//pubdate", publisher) => <bibliography id='philosophy'> ... </> irb(main):032:0> puts doc2 <bibliography id='philosophy'> <biblioentry id='ISBN0-19-285423-2'> <author> <firstname>Bertrand</firstname> <surname>Russell</surname> </author> <title>The Problems of Philosophy</title> <publisher> <publishername>Oxford University Press</publishername> </publisher> <pubdate>1912</pubdate> </biblioentry> </bibliography> => nil
Deleting Elements and Attributes
The add_element
and add_attribute
methods have their counterparts
for deleting elements and attributes, respectively. This is how it goes with attributes:
irb(main):033:0> doc2.root.delete_attribute('id') => <bibliography> ... </> irb(main):034:0> puts doc2 <bibliography> <biblioentry id='ISBN0-19-285423-2'> <author> <firstname>Bertrand</firstname> <surname>Russell</surname> </author> <title>The Problems of Philosophy</title> <publisher> <publishername>Oxford University Press</publishername> </publisher> <pubdate>1912</pubdate> </biblioentry> </bibliography> => nil
The delete_attribute
method returns the removed attribute.
The delete_element
method can take an Element object, a string or an index as
its argument:
irb(main):034:0> doc2.delete_element("//publisher") => <publisher> ... </> irb(main):035:0> puts doc2 <bibliography> <biblioentry id='ISBN0-19-285423-2'> <author> <firstname>Bertrand</firstname> <surname>Russell</surname> </author> <title>The Problems of Philosophy</title> <pubdate>1912</pubdate> </biblioentry> </bibliography> => nil irb(main):036:0> doc2.root.delete_element(1) => <biblioentry id='ISBN0-19-285423-2'> ... </> irb(main):037:0> puts doc2 <bibliography/> => nil
The first delete_element
invocation in our example uses an XPath expression to
locate the element that has to be deleted. The second time we use the index 1
,
meaning the first element in the document root will be deleted. The
delete_element
method returns the removed element.
Text Nodes and Entity Processing
We already used text nodes in the previous examples. In this section we will show some more advanced stuff with text nodes. Especially, how does REXML handle entities? REXML is a non-validating parser, and therefore is not required to expand external entities. So external entities aren't replaced by their values, but internal entities are: when REXML parses an XML document, it processes the DTD and creates a table of the internal entities with their values. When one of these entities occurs in the document, REXML replaces it with its value. An example:
irb(main):038:0> doc3 = Document.new('<!DOCTYPE testentity [ irb(main):039:1' <!ENTITY entity "test">]> irb(main):040:1' <testentity>&entity; the entity</testentity>') => <UNDEFINED> ... </> irb(main):041:0> puts doc3 <!DOCTYPE testentity [ <!ENTITY entity "test">]> <testentity>&entity; the entity</testentity> => nil irb(main):042:0> doc3.root.text => "test the entity"
You see that the XML document, when printed, correctly contains the entity. When you
access
the text, the entity &entity;
gets expanded correctly to "test".
However, REXML uses lazy evaluation of the entities. As a result, the following problem occurs:
irb(main):043:0> doc3.root.text = "test the &entity;" => "test the &entity;" irb(main):044:0> puts doc3 <!DOCTYPE testentity [ <!ENTITY entity "test"> ]> <testentity>&entity; the &entity;</testentity> => nil irb(main):045:0> doc3.root.text => "test the test"
As you see, the text "test the &entity;" is changed to "&entity; the &entity;".
If you change the value of the entity, it will give a different result than you expect:
more
will be changed in the document than you want. If this is problematic for your application,
you can set the :raw
flag on any Text
or Element
node, even on the Document
node. The entities in that node won't be processed,
so you have to deal with entities yourself. An example:
irb(main):046:0> doc3 = Document.new('<!DOCTYPE testentity [ irb(main):047:1' <!ENTITY entity "test">]> irb(main):048:1' <testentity>test the &entity;</testentity>', {:raw => :all}) => <UNDEFINED> ... </> irb(main):049:0> puts doc3 <!DOCTYPE testentity [ <!ENTITY entity "test"> ]> <testentity>test the &entity;</testentity> => nil irb(main):050:0> doc3.root.text => "test the test"
The entities for &, <, >, ", and ' are processed automatically. Moreover, if you
write one of these characters in a Text
node or attribute, REXML converts them
to their entity equivalent, e.g. & for &.
Stream Parsing
Stream parsing is faster than tree parsing. If speed matters, the stream parser can be handy. However, features such as XPath are not available. You have to supply a listener class and each time REXML encounters an event (start tag, end tag, text, etc.), the listener will be notified of the event. An example program:
Listing 4: The Stream Parser in Action (code3.rb
)
require 'rexml/document' require 'rexml/streamlistener' include REXML class Listener include StreamListener def tag_start(name, attributes) puts "Start #{name}" end def tag_end(name) puts "End #{name}" end end listener = Listener.new parser = Parsers::StreamParser.new(File.new("bibliography2.xml"), listener) parser.parse
The file bibliography2.xml
is the following:
Listing 5: The bibliography2.xml
File
Running code3.rb
gives this output:
koan$ ruby code3.rb Start bibliography Start biblioentry Start author Start firstname End firstname Start surname End surname End author Start title End title Start publisher Start publishername End publishername End publisher Start pubdate End pubdate End biblioentry End bibliography
In Conclusion
Ruby and XML make a great team. The REXML XML
processor allows one to create, access, and modify XML documents in a very intuitive
way.
With the help of Ruby's interactive debugger irb
, we can also explore XML
documents very easily.
Related Links
- Code for this article: code.tgz
- REXML website
- REXML API documentation
- Ruby-lang website
- Some IRB tips