Getting Started with XML Programming
If you're new to programming with XML, you may be wondering how to get started. The benefits of using XML to store structured data may be obvious, but once you've got some data in XML, how do you get it back out? In this article, we'll explore several alternatives and look at some concrete solutions in Perl. (The process and the alternatives are much the same in Python, Java, C++, or your favorite programming language.)
We're going to build a simple text processing application that uses XML to store user preferences and other configuration data. It's the sort of thing that's typically been done with plain text files in the past, and it's probably familiar to most readers.
Most programmers have probably written code to process text files like this at one time or another. In some languages, this is easy; in some it's more difficult. But it's always about the same algorithm: You loop over the lines of the file and parse the strings that you get back. On Windows systems, there are convenient functions for accessing data in the INI file format: GetProfileString()and SetProfileString().
Example 1. A Simple INI file.
[section1] name1=value1 name2=value2 [section2] someothername=someothervalue
In this article, I'm going to propose an XML version of the configuration file format (see Example 2), and explore several ways to get information out of files in this format using Perl. We'll end up with our own versions of getProfileString() and setProfileString() that provide transparent access to XML configuration files.
Example 2. A Simple INI file in XML.
<configuration-file> <section name="section1"> <entry name="name1" value="value1"/> <entry name="name2" value="value2"/> </section> <section name="section2"> <entry name="someothername" value="someothervalue"/> </section> </configuration-file>
A DTD for this format can be seen in Figure 1.
Figure 1. A DTD for the XML INI File Format.
<!ELEMENT configuration-file (section+)>
<!ELEMENT section (entry*)>
<!ATTLIST section
name CDATA #REQUIRED
>
<!ELEMENT entry EMPTY>
<!ATTLIST entry
name CDATA #REQUIRED
value CDATA #REQUIRED
>
The DTD serves mainly as documentation for the intended format; in practice, we're going to treat XML INI files as simple well-formed documents.
The examples presented in this article make some simplifying assumptions about configuration files:
There are no repeated section names.
The underlying language will transparently deal with encoding issues.
The configuration files are properly structured. We won't worry about validation and we'll try to be forgiving if there's a little variation in the files (extra attributes, for example).
Example 4. INI file with different line breaks.
<configuration-file><section name="section1"> <entry name="name1" value="value1"/><entry name="name2" value="value2"/> </section><section name="section2" > <entry name="someothername" value="someothervalue"/> </section> </configuration-file>
If you're willing to reinvent the lexical analysis of XML files, you could process this one line at a time, but it's not worth the effort.
1 |sub getProfileString { | my ($xmlfile, $section, $variable, $default) = @_; | local (*F, $_); | 5 | open (F, $xmlfile) || return undef; # Load the document | read (F, $_, -s $xmlfile); | close (F); | | while (/(<section[^>]*>)(.*?)<\/section\s*>/s) { # Process sections 10 | my($sectstart) = $1; # Save start tag | my($sectdata) = $2; | $_ = $'; | | if ($sectstart =~ /name=([\"\'])$section\1/s) { # Process entrys 15 | while ($sectdata =~ /<entry[^>]*?\/>/) { | my($entry) = $&; | $sectdata = $'; | if ($entry =~ /name=([\"\'])$variable\1/s) { | if ($entry =~ /value=([\"\'])(.*?)\1/s) { 20 | return $2; | } else { | return ""; | } | } 25 | } | return $default; | } | } | 30 | return $default; |}
This code has four important features:
|
The entire document is loaded into memory. This avoids the problems associated with reading an XML file line-by-line, and allows us to match elements that contain newlines in “unexpected” places. |
||
|
The document is processed a section at a time. The regular expression /(<section[^>]*>)(.*?)<\/section\s*>/s matches an entire section, from start-tag to end-tag. Note how the regular expressions for the start- and end-tags allow for the possibility of whitespace, and the use of the /s modifier to allow Perl to match across newlines. (Beware that this code would not work properly if <section> tags could be nested.) Matching large regular expressions can have a performance impact, although the impact is probably insignificant for small files such as these. If you're tempted to use this method on larger files, the impact may be worth considering. For a complete discussion of regular expressions, see Mastering Regular Expressions by Jeffrey Friedl. |
||
|
The start-tag for the section is stored in a variable. By storing the start-tags in variables, we can examine them (using another regular expression) for the attributes we're after. Since attributes can occur in any order, matching in one step quickly leads to extremely hairy regular expressions. |
||
|
Once we find a section with the name we're after, we perform an analogous parse of the entries within that section. |
XML::Parser, built on top of James Clark's expat, is an event based parser. This simply means that you tell the parser what you're interested in and then let the parser do the work. Each time the parser encounters something that you've registered an interest in, it makes calls back to your code.
Here's another version of getProfileString(), this one uses a parser to do the hard work:
1 |use XML::Parser; # Use the parser module | |my $target_section = ""; # Setup global variables |my $target_entry = ""; 5 |my $current_section = ""; |my $entry_value = ""; | |sub getProfileString { | my($cfgfile, $section, $variable, $default) = @_; 10 | my $parser = new XML::Parser(ErrorContext => 2); # Create a parser | $parser->setHandlers(Start => \&start_handler); # Report start-tag events | | $target_section = $section; | $target_entry = $variable; 15 | $current_section = ""; | $entry_value = $default; | | $parser->parsefile($cfgfile); # Run the parser | 20 | return $entry_value; |} | |sub start_handler { # The start-tag handler | my $parser_context = shift; 25 | my $element = shift; | my %attr = @_; | | if ($element eq 'section') { | $current_section = $attr{'name'}; 30 | } elsif ($element eq 'entry' && $current_section eq $target_section) { | if ($attr{'name'} eq $target_entry) { | $entry_value = $attr{'value'}; | $parser_context->finish(); | } 35 | } |}
Let's take a closer look at what's going on. Using the parser requires several steps:
|
Tell Perl that we're using the parser module. |
||
|
Setup some package-global variables; we'll use these to keep track of the things we've seen. It's not particularly elegant, but it keeps the example fairly simple. |
||
|
The XML::Parser module exposes the parser as an object. This line creates a new instance of the parser object. (ErrorContext tells the parser how many lines of the XML file it should display if the XML document is not well-formed.) |
||
|
The next step is to tell the parser what events we're interested in. For the purposes of this example, we only care about start-tags, so we tell the parser to call the start_handler function each time a start-tag is encountered. |
||
|
After establishing which events we're interested in, we run the parser. This function won't actually return until the entire document has been parsed. |
||
|
Each time the parser encounters a start-tag, it calls the start_handler function, passing it several arguments: the parser context, which provides access to the underlying parser (and that you can safely ignore), the element name, and an associative array holding the attributes. |
sample.xml, the sample configuration file.
iniregexp.pl, the INI parser that uses regular expressions, and tregexp.pl, a test program that uses the regular expression version..
iniparser.pl, the INI parser that uses XML::Parser, and tparser.pl, another test program.
cfgfile1.pm, a more complete INI parsing application, and tcfgfile.pl, yet another tester.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.