Menu

Getting Started with XML Programming

April 21, 1999

Norman Walsh

If you're new to programming with XML, you may be wondering how to get started. The benefits of using XML to store structured data may be obvious, but once you've got some data in XML, how do you get it back out? In this article, we'll explore several alternatives and look at some concrete solutions in Perl. (The process and the alternatives are much the same in Python, Java, C++, or your favorite programming language.)

We're going to build a simple text processing application that uses XML to store user preferences and other configuration data. It's the sort of thing that's typically been done with plain text files in the past, and it's probably familiar to most readers.

Reading configuration files

Many applications need to store user preferences and other sorts of configuration information. One common way to do this is to use text files. There are probably nearly as many conventions for the format of these files as there are programmers who've created them, but one common style is the Windows INI file format (see Example 1). The format is very simple: The file is divided into named sections and within each section, names and values are associated by assignment.

Most programmers have probably written code to process text files like this at one time or another. In some languages, this is easy; in some it's more difficult. But it's always about the same algorithm: You loop over the lines of the file and parse the strings that you get back. On Windows systems, there are convenient functions for accessing data in the INI file format: GetProfileString()and SetProfileString().

Example 1. A Simple INI file.

[section1]
name1=value1
name2=value2
[section2]
someothername=someothervalue

In this article, I'm going to propose an XML version of the configuration file format (see Example 2), and explore several ways to get information out of files in this format using Perl. We'll end up with our own versions of getProfileString() and setProfileString() that provide transparent access to XML configuration files.

A simple configuration file format in XML

Example 2 shows a simple XML version of the configuration file shown in Example 1.

Example 2. A Simple INI file in XML.

  <configuration-file>
    <section name="section1">
      <entry name="name1" value="value1"/>
      <entry name="name2" value="value2"/>
    </section>
    <section name="section2">
      <entry name="someothername" value="someothervalue"/>
    </section>
  </configuration-file>

A DTD for this format can be seen in Figure 1.

Figure 1. A DTD for the XML INI File Format.

 <!ELEMENT configuration-file (section+)>
 
 <!ELEMENT section (entry*)>
 <!ATTLIST section
 name    CDATA    #REQUIRED>
 
 <!ELEMENT entry EMPTY>
 <!ATTLIST entry
 name    CDATA    #REQUIRED
 value   CDATA    #REQUIRED>
        

The DTD serves mainly as documentation for the intended format; in practice, we're going to treat XML INI files as simple well-formed documents.

Simplifying assumptions

The examples presented in this article make some simplifying assumptions about configuration files:

  • There are no repeated section names.

  • The underlying language will transparently deal with encoding issues.

  • The configuration files are properly structured. We won't worry about validation and we'll try to be forgiving if there's a little variation in the files (extra attributes, for example).

XML files aren't lines of text

Your first temptation, especially if you're in a hurry, may be to try to process the file just like you would the non-XML version—reading it one line at a time. This is a very, very fragile way to handle XML data. Line breaks are insignificant in the XML INI file. The example would be just as valid if the file was organized as shown in Example 4:

Example 4. INI file with different line breaks.

  <configuration-file><section name="section1">
    <entry
      name="name1"
      value="value1"/><entry name="name2"
        value="value2"/>
  </section><section name="section2">
    <entry name="someothername" value="someothervalue"/>
  </section>
  </configuration-file>

If you're willing to reinvent the lexical analysis of XML files, you could process this one line at a time, but it's not worth the effort.

Brute force: Using regular expressions

XML was consciously designed so that it could be effectively processed by "low-tech" solutions, in particular text processors like Perl using regular expressions. As we've just seen, you can't process the file one line at a time, but if your data files are small enough to load entirely into memory, you can parse them with regular expressions. Example 5 shows a version of getProfileString() that uses regular expressions:

  1 |sub getProfileString { 
    | my ($xmlfile, $section, $variable, $default) = @_; 
    | local (*F, $_); 
    |
  5 | open (F, $xmlfile) || return undef; # Load the document
    | read (F, $_, -s $xmlfile); 
    | close (F); 
    |
    | while (/(<section[^>]*>)(.*?)<\/section\s*>/s) { # Process sections
 10 | my($sectstart) = $1; # Save start tag
    | my($sectdata) = $2; 
    | $_ = $'; 
    |
    | if ($sectstart =~ /name=([\"\'])$section\1/s) {# Process entrys
 15 | while ($sectdata =~/<entry[^>]*?\/>/) { 
    | my($entry) = $&;
    | $sectdata = $'; 
    | if ($entry =~ /name=([\"\'])$variable\1/s) { 
    | if ($entry =~ /value=([\"\'])(.*?)\1/s) { 
 20 | return $2; 
    | } else { 
    | return ""; 
    | } 
    | } 
 25 | } 
    | return $default; 
    | } 
    | } 
    |
 30 | return $default; 
    |}

This code has four important features:

Line 5  

The entire document is loaded into memory.

This avoids the problems associated with reading an XML file line-by-line, and allows us to match elements that contain newlines in “unexpected” places.

Line 9  

The document is processed a section at a time.

The regular expression /(<section[^>]*>)(.*?)<\/section\s*>/s matches an entire section, from start-tag to end-tag. Note how the regular expressions for the start- and end-tags allow for the possibility of whitespace, and the use of the /s modifier to allow Perl to match across newlines. (Beware that this code would not work properly if <section> tags could be nested.)

Matching large regular expressions can have a performance impact, although the impact is probably insignificant for small files such as these. If you're tempted to use this method on larger files, the impact may be worth considering. For a complete discussion of regular expressions, see Mastering Regular Expressions by Jeffrey Friedl.

Line 10  

The start-tag for the section is stored in a variable.

By storing the start-tags in variables, we can examine them (using another regular expression) for the attributes we're after. Since attributes can occur in any order, matching in one step quickly leads to extremely hairy regular expressions.

Line 15  

Once we find a section with the name we're after, we perform an analogous parse of the entries within that section.

Let a parser do the work

Using regular expressions to process XML files will work for many simple cases, but handling most XML documents this way is difficult. This is a task best left to a specialized tool—the XML parser. Luckily, XML parsers are available for most languages. One of the most popular XML parsers for Perl is Clark Cooper's XML::Parser module, currently at version 2.22.

XML::Parser, built on top of James Clark's expat, is an event based parser. This simply means that you tell the parser what you're interested in and then let the parser do the work. Each time the parser encounters something that you've registered an interest in, it makes calls back to your code.

Here's another version of getProfileString(), this one uses a parser to do the hard work:

   1 |use XML::Parser; # Use the parser module
     |
     |my $target_section = ""; # Setup global variables
     |my $target_entry = ""; 
   5 |my $current_section = ""; 
     |my $entry_value = ""; 
     |
     |sub getProfileString { 
     | my($cfgfile, $section, $variable, $default) = @_; 
  10 | my $parser = new XML::Parser(ErrorContext => 2); # Create a parser
     | $parser->setHandlers(Start =>\&start_handler); # Report start-tagevents
     |
     | $target_section = $section; 
     | $target_entry = $variable; 
  15 | $current_section = ""; 
     | $entry_value = $default; 
     |
     | $parser->parsefile($cfgfile); # Run the parser
     |
  20 | return $entry_value; 
     |} 
     |
     |sub start_handler { # The start-tag handler
     | my $parser_context = shift; 
  25 | my $element = shift; 
     | my %attr = @_;   |
     | if ($element eq 'section') { 
     | $current_section = $attr{'name'}; 
  30 | } elsif ($element eq 'entry' && $current_section eq $target_section) { 
     | if ($attr{'name'} eq $target_entry) { 
     | $entry_value = $attr{'value'};   | $parser_context->finish();
     | } 
  35 | }
     |}
      

Let's take a closer look at what's going on. Using the parser requires several steps:

Line 1  

Tell Perl that we're using the parser module.

Line 3  

Setup some package-global variables; we'll use these to keep track of the things we've seen. It's not particularly elegant, but it keeps the example fairly simple.

Line 10  

The XML::Parser module exposes the parser as an object. This line creates a new instance of the parser object. (ErrorContext tells the parser how many lines of the XML file it should display if the XML document is not well-formed.)

Line 11  

The next step is to tell the parser what events we're interested in. For the purposes of this example, we only care about start-tags, so we tell the parser to call the start_handler function each time a start-tag is encountered.

Line 18  

After establishing which events we're interested in, we run the parser. This function won't actually return until the entire document has been parsed.

Line 23  

Each time the parser encounters a start-tag, it calls the start_handler function, passing it several arguments: the parser context, which provides access to the underlying parser (and that you can safely ignore), the element name, and an associative array holding the attributes.

A more complete solution

Both of the solutions presented so far have some serious drawbacks. Most significantly, they read the entire XML document each time a string is requested. A more complete solution can be found in cfgfile.pm. This module loads the entire file into memory and provides methods for setting as well as getting profile strings.

Conclusion

In this article, we've looked at several ways to process XML documents and demonstrated the benefits of using an XML parser to do the hard work. With these examples in hand, I hope you're ready to tackle your next XML project. Let me know how it goes, and remember to send your XML questions to xmlqna@xml.com. For the next article in the series, click here.

Appendix A. Getting the code

The code samples mentioned in this article can be retrieved separately: