Menu

REXML: Processing XML in Ruby

November 9, 2005

Koen Vervloesem

REXML (Ruby Electric XML) is the XML processor of choice for Ruby programmers. It comes bundled with the standard Ruby distribution. It's fast, written in Ruby, and can be used in two ways: tree parsing and stream parsing. In this article, we show some basic constructs on how to use REXML for XML processing. We also introduce the use of Ruby's interactive debugger irb for exploring XML documents with the help of REXML.

We'll be using a DocBook bibliography file as example XML document. You will learn how to parse the document with the tree parsing API, to access elements and attributes, and to create and insert elements. We'll also look into the peculiarities of text nodes and entity processing. Finally, we will show an example use of the stream parsing API. This is our DocBook file:

Listing 1: The bibliography.xml file

bibliography.xml

Beginning with Tree Parsing

We start with the tree parsing API, which is very DOM-like, but more intuitive. This is our first code example:

Listing 2: Showing an XML File (code1.rb)

require 'rexml/document'

include REXML

file = File.new("bibliography.xml")

doc = Document.new(file)

puts doc

The require statement loads the REXML library. Then we include the REXML namespace, so we don't have to use names like REXML::Document all the time. We open an existing file named bibliography.xml and parse the XML source code, with the result in a Document object. Finally we show the document on the screen. When you execute the command ruby code1.rb, the source code of our bibliography XML document is shown.

It's possible that you may get this error message:

example1.rb:1:in `require': No such file to load 

  -- rexml/document (LoadError)

        from example1.rb:1

In which case your Ruby installation doesn't have REXML installed, because some package managers, such as Debian's APT, install the libraries as separate packages. Install the rexml package and try again.

The Document.new method takes an IO, Document or String object as its argument. The argument specifies the source from which we want to read an XML document. In our first example, we used an IO object, namely the File object which inherits from the IO class. Another child class of IO is the Socket class, which can be used with Document.new to get an XML document over a network connection.

If the Document constructor takes a Document as its argument, all its Element nodes are cloned to the new Document object. If the constructor takes a String argument, the string will be expected to contain an XML document. An example:

Listing 3: Showing an XML "Here Document" (code2.rb)

require 'rexml/document'

include REXML

string = <<EOF

<?xml version="1.0" encoding="ISO-8859-15"?>

<!DOCTYPE bibliography PUBLIC "-//OASIS//DTD DocBook XML V4.2//EN"

    "http://www.oasis-open.org/docbook/xml/4.2/docbookx.dtd">

<bibliography>

    <biblioentry id="FHIW13C-1234">

      <author>

        <firstname>Godfrey</firstname>

        <surname>Vesey</surname>

      </author>

      <title>Personal Identity: A Philosophical Analysis</title>

      <publisher>

        <publishername>Cornell University Press</publishername>

      </publisher>

      <pubdate>1977</pubdate>

   </biblioentry>

</bibliography>

EOF

doc = Document.new(string)

puts doc

We use a "here document" string: all characters between <<EOF and EOF, with newlines included, are part of the string.

Accessing Elements and Attributes

From now on, we will use irb, Ruby's interactive debugger, for examples of the use of the library REXML. At the irb prompt, we load the file bibliography.xml into a document. After that, we can execute commands to access the elements and attributes of the document, in an interactive way.

koan$ irb

irb(main):001:0> require 'rexml/document'

=> true

irb(main):002:0> include REXML

=> Object

irb(main):003:0> doc = Document.new(File.new("bibliography.xml"))

=> <UNDEFINED> ... </>

Now you can explore the document very easily. Let's look at a typical irb session with our XML file:

irb(main):004:0> root = doc.root

=> <bibliography id='personal_identity'> ... </>

irb(main):005:0> root.attributes['id']

=> "personal identity"

irb(main):006:0> puts root.elements[1].elements["author"]

<author>

  <firstname>Godfrey</firstname>

  <surname>Vesey</surname>

</author>

irb(main):007:0> puts root.elements["biblioentry[1]/author"]

<author>

  <firstname>Godfrey</firstname>

  <surname>Vesey</surname>

</author>

irb(main):008:0> puts root.elements["biblioentry[@id='FHIW13C-1260']"]

<biblioentry id='FHIW13C-1260'>

      <author>

        <firstname>Sydney</firstname>

        <surname>Shoemaker</surname>

      </author>

      <author>

        <firstname>Richard</firstname>

        <surname>Swinburne</surname>

      </author>

      <title>Personal Identity</title>

      <publisher>

        <publishername>Basil Blackwell</publishername>

      </publisher>

      <pubdate>1984</pubdate>

    </biblioentry>

=> nil

irb(main):009:0> root.each_element('//author') {|author| puts author}

<author>

  <firstname>Godfrey</firstname>

  <surname>Vesey</surname>

</author>

<author>

  <firstname>René</firstname>

  <surname>Marres</surname>

</author>

<author>

  <firstname>James</firstname>

  <surname>Baillie</surname>

</author>

<author>

  <firstname>Brian</firstname>

  <surname>Garrett</surname>

</author>

<author>

  <firstname>John</firstname>

  <surname>Perry</surname>

</author>

<author>

  <firstname>Geoffrey</firstname>

  <surname>Madell</surname>

</author>

<author>

  <firstname>Sydney</firstname>

  <surname>Shoemaker</surname>

</author>

<author>

  <firstname>Richard</firstname>

  <surname>Swinburne</surname>

</author>

<author>

  <firstname>Jonathan</firstname>

  <surname>Glover</surname>

</author>

<author>

  <firstname>Harold</firstname>

  <othername>W.</othername>

  <surname>Noonan</surname>

</author>

=> [<author> ... </>, <author> ... 

  </>, <author> ... </>, <author> ... 

  </>, <author> ... </>, <author> ... 

  </>, <author> ... </>, <author> ... 

  </>, <author> ... </>, <author> ... </>]

First we use the name root to access the document root. The document root is here the bibliography element. Each Element object has an Attributes object named attributes which acts as a hash map with the names of the attributes as keys and the attribute values as values. So with root.attributes['id'] we get the value of the attribute id of the root element. In the same manner, each Element object has an Elements object named elements, with each and [] methods to get access to the subelements. The [] method takes an index or XPath expression as its argument and returns the child elements which match the expression. The XPath expression acts like a filter, deciding which elements will be returned. Note that root.elements[1] is the first child element, because XPath indexes start from 1, not from 0. Actually, root.elements[1] equals root.elements[*[1]], where *[1] is the XPath expression for the first child. The method each of the class Elements iterates through all the child elements, optionally filtering them by a given XPath expression. The code block will be executed then for each iteration. In addition, the method Element.each_element is a shorthand notation for Element.elements.each.

Creating and Inserting Elements and Attributes

Now we will create a small bibliography document, consisting of one biblioentry, from scratch. Here's how it goes:

irb(main):010:0> doc2 = Document.new

=> <UNDEFINED/>

irb(main):011:0> doc2.add_element("bibliography", 

                    {"id" => "philosophy"})

=> <bibliography id='philosophy'/>

irb(main):012:0> doc2.root.add_element("biblioentry")

=> <biblioentry/>

irb(main):013:0> biblioentry = doc2.root.elements[1]

=> <biblioentry/>

irb(main):014:0> author = Element.new("author")

=> <author/>

irb(main):015:0> author.add_element("firstname")

=> <firstname/>

irb(main):016:0> author.elements["firstname"].text = "Bertrand"

=> "Bertrand"

irb(main):017:0> author.add_element("surname")

=> <surname/>

irb(main):018:0> author.elements["surname"].text = "Russell"

=> "Russell"

irb(main):019:0> biblioentry.elements << author

=> <author> ... </>

irb(main):020:0> title = Element.new("title")

=> <title/>

irb(main):021:0> title.text = "The Problems of Philosophy"

=> "The Problems of Philosophy"

irb(main):022:0> biblioentry.elements << title

=> <title> ... </>

irb(main):023:0> biblioentry.elements << Element.new("pubdate")

=> <pubdate/>

irb(main):024:0> biblioentry.elements["pubdate"].text = "1912"

=> "1912"

irb(main):025:0> biblioentry.add_attribute("id", "ISBN0-19-285423-2")

=> "ISBN0-19-285423-2"

irb(main):026:0> puts doc2

<bibliography id='philosophy'>

  <biblioentry id='ISBN0-19-285423-2'>

    <author>

      <firstname>Bertrand</firstname>

      <surname>Russell</surname>

    </author>

    <title>The Problems of Philosophy</title>

    <pubdate>1912</pubdate>

  </biblioentry>

</bibliography>

=> nil

As you see, we create an empty new document and we add one element to it. This element becomes the root. The add_element method takes the name of the element as argument and an optional argument which is a hash map of name/value pairs of the attributes. So this method adds a new child to the document or an element, optionally setting attributes of the element.

You can also make a new element, like we do with the author element, and then add it afterwards to another element: if the add_element method gets an Element object, the object will be added to the parent element. Instead of the add_element method, you can also use the << method on Element.elements. The two methods return the added element.

In addition, with the method add_attribute, you can add an attribute to an existing element. The first parameter is the attribute name, the second is the attribute value. The method returns the attribute that was added. The text value of an element can easily be changed with Element.text or alternatively with the add_text method.

If you want to insert an element at a specific location, you can use the methods insert_before and insert_after:

irb(main):027:0> publisher = Element.new("publisher")

=> <publisher/>

irb(main):028:0> publishername = Element.new("publishername")

=> <publishername/>

irb(main):029:0> publishername.add_text("Oxford University Press")

=> <publishername> ... </>

irb(main):030:0> publisher << publishername

=> <publishername> ... </>

irb(main):031:0> doc2.root.insert_before("//pubdate", publisher)

=> <bibliography id='philosophy'> ... </>

irb(main):032:0> puts doc2

<bibliography id='philosophy'>

  <biblioentry id='ISBN0-19-285423-2'>

    <author>

      <firstname>Bertrand</firstname>

      <surname>Russell</surname>

    </author>

    <title>The Problems of Philosophy</title>

    <publisher>

      <publishername>Oxford University Press</publishername>

    </publisher>

    <pubdate>1912</pubdate>

  </biblioentry>

</bibliography>

=> nil

Deleting Elements and Attributes

The add_element and add_attribute methods have their counterparts for deleting elements and attributes, respectively. This is how it goes with attributes:

irb(main):033:0> doc2.root.delete_attribute('id')

=> <bibliography> ... </>

irb(main):034:0> puts doc2

<bibliography>

  <biblioentry id='ISBN0-19-285423-2'>

    <author>

      <firstname>Bertrand</firstname>

      <surname>Russell</surname>

    </author>

    <title>The Problems of Philosophy</title>

    <publisher>

      <publishername>Oxford University Press</publishername>

    </publisher>

    <pubdate>1912</pubdate>

  </biblioentry>

</bibliography>

=> nil

The delete_attribute method returns the removed attribute.

The delete_element method can take an Element object, a string or an index as its argument:

irb(main):034:0> doc2.delete_element("//publisher")

=> <publisher> ... </>

irb(main):035:0> puts doc2

<bibliography>

  <biblioentry id='ISBN0-19-285423-2'>

    <author>

      <firstname>Bertrand</firstname>

      <surname>Russell</surname>

    </author>

    <title>The Problems of Philosophy</title>

    <pubdate>1912</pubdate>

  </biblioentry>

</bibliography>

=> nil

irb(main):036:0> doc2.root.delete_element(1)

=> <biblioentry id='ISBN0-19-285423-2'> ... </>

irb(main):037:0> puts doc2

<bibliography/>

=> nil

The first delete_element invocation in our example uses an XPath expression to locate the element that has to be deleted. The second time we use the index 1, meaning the first element in the document root will be deleted. The delete_element method returns the removed element.

Text Nodes and Entity Processing

We already used text nodes in the previous examples. In this section we will show some more advanced stuff with text nodes. Especially, how does REXML handle entities? REXML is a non-validating parser, and therefore is not required to expand external entities. So external entities aren't replaced by their values, but internal entities are: when REXML parses an XML document, it processes the DTD and creates a table of the internal entities with their values. When one of these entities occurs in the document, REXML replaces it with its value. An example:

irb(main):038:0> doc3 = Document.new('<!DOCTYPE testentity [

irb(main):039:1' <!ENTITY entity "test">]>

irb(main):040:1' <testentity>&entity; the entity</testentity>')

=> <UNDEFINED> ... </>

irb(main):041:0> puts doc3

<!DOCTYPE testentity [

<!ENTITY entity "test">]>

<testentity>&entity; the entity</testentity>

=> nil

irb(main):042:0> doc3.root.text

=> "test the entity"

You see that the XML document, when printed, correctly contains the entity. When you access the text, the entity &entity; gets expanded correctly to "test".

However, REXML uses lazy evaluation of the entities. As a result, the following problem occurs:

irb(main):043:0> doc3.root.text = "test the &entity;"

=> "test the &entity;"

irb(main):044:0> puts doc3

<!DOCTYPE testentity [

<!ENTITY entity "test">

]>

<testentity>&entity; the &entity;</testentity>

=> nil

irb(main):045:0> doc3.root.text                      

=> "test the test"

As you see, the text "test the &entity;" is changed to "&entity; the &entity;". If you change the value of the entity, it will give a different result than you expect: more will be changed in the document than you want. If this is problematic for your application, you can set the :raw flag on any Text or Element node, even on the Document node. The entities in that node won't be processed, so you have to deal with entities yourself. An example:

irb(main):046:0> doc3 = Document.new('<!DOCTYPE testentity [

irb(main):047:1' <!ENTITY entity "test">]>

irb(main):048:1' <testentity>test the &entity;</testentity>', 

                 {:raw => :all})

=> <UNDEFINED> ... </>

irb(main):049:0> puts doc3

<!DOCTYPE testentity [

<!ENTITY entity "test">

]>

<testentity>test the &entity;</testentity>

=> nil

irb(main):050:0> doc3.root.text

=> "test the test"

The entities for &, <, >, ", and ' are processed automatically. Moreover, if you write one of these characters in a Text node or attribute, REXML converts them to their entity equivalent, e.g. &amp; for &.

Stream Parsing

Stream parsing is faster than tree parsing. If speed matters, the stream parser can be handy. However, features such as XPath are not available. You have to supply a listener class and each time REXML encounters an event (start tag, end tag, text, etc.), the listener will be notified of the event. An example program:

Listing 4: The Stream Parser in Action (code3.rb)

require 'rexml/document'

require 'rexml/streamlistener'

include REXML



class Listener

  include StreamListener

  def tag_start(name, attributes)

    puts "Start #{name}"

  end

  def tag_end(name)

    puts "End #{name}"

  end

end



listener = Listener.new

parser = Parsers::StreamParser.new(File.new("bibliography2.xml"), listener)

parser.parse

The file bibliography2.xml is the following:

Listing 5: The bibliography2.xml File

bibliography2.xml

Running code3.rb gives this output:

koan$ ruby code3.rb 

Start bibliography

Start biblioentry

Start author

Start firstname

End firstname

Start surname

End surname

End author

Start title

End title

Start publisher

Start publishername

End publishername

End publisher

Start pubdate

End pubdate

End biblioentry

End bibliography

In Conclusion

Ruby and XML make a great team. The REXML XML processor allows one to create, access, and modify XML documents in a very intuitive way. With the help of Ruby's interactive debugger irb, we can also explore XML documents very easily.

Related Links