XML: Can the Desperate Perl Hacker Do It?

October 2, 1997

Can the Desperate Perl Hacker Do It?

Michael Leventhal

Abstract

Is Perl a suitable language for programming XML? The use of Perl with XML is illustrated in this article with a program that checks to see if an XML document is well-formed. The relative simplicity of the program demonstrates that lightweight Perl programs may be used with XML, although Unicode and the use of entities make it difficult for Perl programmers to handle some XML files.

Perl Meets XML

This article presents a little program written in Perl (4) which checks the well-formedness of an XML document. We'll discuss what well-formedness is exactly in a couple of paragraphs. The main purpose of this article is to show that Perl hackers can do it; that is, that little programs, utility scripts, and CGI stuff, can be hacked in the hacking language of choice, Perl.

Java is all sizzle and sex these days and while proclaiming my liking for Java in the extreme it is not terribly controversial to assert that Java is

NOT a quick and dirty hacking language
BEYOND the ken of most people who write computer programs, professional programmers being a minority within this group

I think XML is going to become the ubiquitious standard for the encoding of text, and people a lot smarter than me have already said that--so it stands a reasonable chance of being approximately true. Not only is Perl the pre-eminent, practically uncontested hacking language, but it is doubly pre-eminent in the hacking of text where its regular expression facilities outshine every competitor. People will hack: they will hack text, they will hack text encoding in XML, they will hack in Perl, they will hack XML in Perl. Java will be the language of choice for pristine computer programs and billions of lines of code will be handy-randy shoved into miserable little Perl programs that do all the dirty work.

The Catch: Perl and Unicode

Lovely or unlovely as my proposition of a harmonious coexistence between Java and Perl may be, there is at least one rather important impediment to this vision: Unicode. Perl regular expressions assume 8-bit characters. It is neither terribly difficult to read and write wide characters, nor is it impossible to perform regular expressions on wide character strings, but such acts may prove to be a "royal pain in the butt-inski." You can get away with saying that most 8-bit character files are Unicode--for the time being that is as close to being Unicode as Perl will get. XML, on the other hand, has taken the hard line on Unicode: XML processors must be able to read both UTF-8 and UCS-2 documents and use the Byte Order Mark in UCS-2 to distinguish which is which. It is more or less the case that Perl programs cannot, by this definition, be XML processors. However, the fact that Perl programs cannot be XML processors does not mean that Perl can't and won't be written to do useful things with 8-bit XML documents.

Well-Formedness in a Nutshell

The program described in this article (Example 1) does something useful with XML documents; it checks to see if they are "well-formed." The essential objective in well-formedness checking is to make sure the document can be properly handled by an application that does not use or need the Document Type Definition (DTD). An application that does need to know the DTD would require a different kind of validation--a validity check--to ensure that the document instance follows the grammar expressed in the DTD. The standard requires all XML processors to, at a minimum, check well-formedness. At least one omnipresent application, browsing, is expected to not require anything beyond well-formedness; but it is my guess that in fact most applications will not use the DTD to process XML documents.

"Well-formed" is defined precisely in the XML standard, and more succintly than I could manage, as follows:

A textual object is said to be a well-formed XML document if, first, it matches the production labeled document, and if for each entity reference which appears in the document, either the entity has been declared in the document type declaration or the entity name is one of: amp, lt, gt, apos, quot.

Matching the document production implies that:

It contains one or more elements.
It meets all the well-formedness constraints (WFCs) given in the grammar.
There is exactly one element, called the root, or document element, for which neither the start-tag nor the end-tag is in the content of any other element. For all other elements, if the start-tag is in the content of another element, the end-tag is in the content of the same element. More simply stated, the elements, delimited by start- and end-tags, nest within each other.

As a consequence of this, for each non-root element C in the document, there is one other element P in the document such that C is in the content of P, but is not in the content of any other element that is in the content of P. Then P is referred to as the parent of C, and C as a child of P.

There is a bit more to "matching the document production" than stated; one has to actually look at the document production to get every bit of it. There are various things other than elements that may appear in an XML document, as well as some well-formed conditions to check with respect to their syntax and order.

Caveats

Now I have to admit that I've lied. My program is not a well-formedness checker because I have not done anything with entities. An entity is, well, lots of things--but generally a placeholder for a text or data object that is stored elsewhere. Entities are a good idea in theory since they allow you to store a bit of text that may be used in, say, thousands of files in a single location. When it is time to modify that bit of text this is performed once rather than thousands of times in each file. The idea sounds so good it is hard to believe that entities have proven to be of relatively little use in SGML practice. Among the reasons for this are that other, more robust and complete tools are usually used to manage text fragments outside of SGML's internal mechanisms. Only slightly tongue-in-cheek do we point out the following: had the designers of entities known Perl they would not have invented them in the first place--why bother when you can write a script to replace text in 50,000 files in less than 30 seconds.

Along with text replacement, there are approximately 11 other uses for entities in SGML: fewer in XML than in SGML but still more than the one described in XML, so the entity picture isn't nearly as simple as one may think. Parameter entities, used to modularize DTDs, particularly complicate matters.

One of the goals of XML, as actually stated in the standard is the following:

It shall be easy to write programs which process XML documents.

Of course, this isn't exactly an objective statement. A few more objective criteria have been proposed as modifications of this statement, but no consensus has been reached. To some, "easy" means that a Perl hacker[1] supplied with enough Jolt cola can write an XML application in a single sitting. To others, it means a computer science graduate student writing an XML parser over the weekend. Another point of view is that the complexity of XML is relatively unimportant in determining whether or not it is easy to write programs. Programs will simply use an XML API that hides the complexity of XML, and applications will be constructed with component architectures such as JavaBeans.

Anyway, my objective is to answer the question--is XML a format that Perl programmers will come to love because of its speed and ease in hacking together applications? The answer is a qualified yes--the two doubtful areas are Unicode and entities.

The following is a list of things I would need to do to handle entities according to the last draft of the XML standard:

I must read both the internal and external DTD subsets. I could have entity declarations in either.

I'm not sure that I'm helped any by the fact that the rules are different for the internal and external subsets. In fact, maybe the fact that they are different just makes it harder. (Internal: No marked sections and "integral" parameter entity declarations.)

Of course it would help if I could use the declaration RMD=internal. But then I would only have a well-formedness checker for documents with RMD=internal or RMD=none.
In view of the above, I must process marked sections in order to correctly interpret the external DTD and possible entity declarations declared within them.
I must handle paramter entities. They are used in the declaration of marked sections and also may be used within entity declarations.
I must completely expand nested entities to ensure that a general entity isn't recursive, as well as to instantiate it in the document for the document production check.
Since it is a WF error to use an entity reference to binary data, I must note which general entities refer to binary data.

After I checked how many cans of Jolt I had in stock, I said forget it.

So here is what my program does do: it checks the document production without expanding entities and without looking at the DTD, either external or internal. It is perfectly adequate for checking the well-formedness of any document with or without entities although the consequences of expanding the entity references are not taken into account.

Real-World Experience

This program started out as an XML transformation engine; I used it to teach XML Web programming to the students in my U.C. Berkeley Extension class. The document is treated as an event stream--each time a start tag, end tag, empty tag, or content is seen, a processing routine is invoked. The name of the processing routine was generated from the tag name and concatenated with the names of ancestor tags to achieve context sensivity. The core program--that is, everything except the processing routines themselves--took thirty lines. Although the processing routine invocation has been removed, most of the program is intact in the process_element subroutine (see Example 1). My students, 90% of whom were non-programmers, were able to use this framework to write fairly complete CGI programs for transforming XML into HTML; for many it was the first time they had ever written a program in their lives. The program is also used on several Web sites in real-life XML applications.

The original program would catch errors in tag nesting but lacked error reporting and recovery. It would "report" errors by crashing! It did not check for all possible error conditions dicated by the document production. It cost a couple of hundred lines to make the program more complete but it is still a pretty lightweight piece of work.

A point to be gleaned from the above is that the approach used in this program is useful. Although it does not process entities and does not handle UCS-2 Unicode documents, it works very well indeed for the projects for which it was intended.

Example 1

#!/usr/bin/perl ##################################### # iswf - checks an XML file to see
        if it is well-formed. # # # # # # iswf < XMLINPUT # # writes error messages to STDOUT # #
        # # M.Leventhal, Grif, S.A. # # michael@grif.fr # # 1 Sept 1997 # # # # Notes: based on
        07-Aug-97 XML Working draft. Not complete, does no # # entity checks, ASCII-only, among
        other omissions, but catches lots of # # stuff. # # # # Unrestricted use is hereby granted
        as long as the author is credited or # # discredited as the case may be. #
        #####################################   # The first two lines cause the entire document
        is read into the # $file variable. This spares me certain # complications which arise from
        reading it line by line # and Perl is able to do this sort of thing fairly # efficiently.
          undef($/); $file = <>;   # I loop through the file, processing each start
        or end # tag when it is seen.   while ($file =~ /[^<]*<(\/)?([^>]+)>/) {
        $st_or_et = $1; $gi = $2; $file = $';   # I recognize the following kinds of objects:
        XML declaration # (a particular type of processing instruction), processing # instructions,
        comments, doctype declaration, cdata marked # sections, and elements. Since the document
        production has # order rules I set a flag when a particlar type of object # has been
        processed. I invoke a subroutine to process each # type of object.   if ($gi =~
        /^\?XML/) { &process_decl; $decl_seen = 1; } elsif ($gi =~ /^\?/) { &process_pi;
        $misc_seen = 1; } elsif ($gi =~ /^!\-\-/) { &process_comment; $misc_seen; } elsif ($gi
        =~ /^!DOCTYPE/) { &process_doctype; $doctype_seen = 1; } elsif ($gi =~ /^\!\[CDATA\[/) {
        &process_cdata; } else { &process_element; $element_seen = 1; } }   # There are
        some checks to catch various errors at the end. I # make sure I have emptied the stack of
        all parents and I # make sure there is no uncontained character data hanging # around.
          &check_empty_stack; &check_uncontained_pcdata;   # Print a happy message
        if there are no errors.   &check_error_count;  
        #--------------------------------------------------------------------------# sub
        check_error_count { if ($error_count == 0) { print "This document appears to be
        well-formed.\n"; } }
        #--------------------------------------------------------------------------#   # Check
        to see if the ancestor stack containing all parents up to the # root is empty.   sub
        check_empty_stack { if ($#ancestors > -1) { &print_error_at_context; } }
        #--------------------------------------------------------------------------#   # Check
        to see if there is any uncontained PCDATA lying around (white space # at the end of the
        document doesn't count). I check also to see that # a root to the document was found which
        catches a null file error.   sub check_uncontained_pcdata { if ($file !~ /^\s*$/ ||
        $ROOT eq "") { $error_count++; print "\nNot well formed uncontained #PCDATA
        or null file\n"; } }
        #--------------------------------------------------------------------------#   # Check
        that the XML declaration is coded properly and in the correct # position (before any other
        object in the file and occuring only # once.)   sub process_decl { if ($decl_seen ||
        $misc_seen || $doctype_seen || $element_seen) { $error_count++; print "XML declaration
        can only be at the head of the document.\n"; }   # No checks are performed on
        processing instructions but the following # will be used to store the PI in the $gi variable
        and advance the # file pointer.   &process_pi;   # This is slightly lazy since
        we allow version='1.0". It is quite simple # to fix just by making an OR of each
        parameter with either ' ' or " " # quote marks.   if ($gi
        !~/\?XML\s+version=[\'\"]1.0[\'\"](\s+encoding=[\'\"][^\'\"]*[\'\"])?
        (\s+RMD=[\'\"](NONE|INTERNAL|ALL)[\'\"])?\s*\?/) { $error_count++; print
        "Format of XML declaration is wrong.\n"; } }
        #--------------------------------------------------------------------------#   # Check
        that the Doctype statement is in the right position and, otherwise, # make no attempt to
        parse its contents, including the root element. The # root element will determined from the
        element production itself and # the "claim" of the Doctype won't be verified.
          sub process_doctype { if ($doctype_seen || $element_seen) { $error_count++; print
        "Doctype can only appear once and must be within prolog.\n"; } if ($gi =~ /\[/
        && $gi !~ /\]$/) { $file =~ /\]>/; $file = $'; $gi = $gi.$`.$&; } }
        #--------------------------------------------------------------------------#   #
        Performs the well-formed check necessary to verify that CDATA is not # nested. We will pick
        up the wrong end of CDATA marker if this is the # case so the error message is critical.
          sub process_cdata { if ($gi !~ /\]\]$/) { $file =~ /\]\]>/; $file = $'; $gi =
        $gi.$`."]]"; } $gi =~ /\!\[CDATA\[(.*)\]\]/; $body = $1; if ($body =~
        /<\!\[CDATA\[/) { print "Nested CDATA.\n"; &print_error_at_context; } }
        #--------------------------------------------------------------------------#   #
        Performs the well-formed check of ensuring that '--' is not nested # in the comment body
        which would cause problems for SGML processors.   sub process_comment { if ($gi !~
        /\-\-$/) { $file =~ /\-\->/; $file = $'; $gi = $gi.$`."--"; } $gi =~
        /\!\-\-((.|\n)*)\-\-/; $body = $1; if ($body =~ /\-\-/) { $error_count++; print
        "Comment contains --.\n"; } }
        #--------------------------------------------------------------------------#   # This
        is the main subroutine which handles the ancestor stack (in an # array) checking the proper
        nesting of the element part of the document # production.   sub process_element {
          # Distinguish between empty elements which do not add a parent to the # ancestor
        stack and elements which can have content.   if ($gi =~ /\/$/) { $xml_empty = 1; $gi =~
        s/\/$//;   # XML well-formedness says every document must have a container so an #
        empty element cannot be the root, even if it is the only element in # the document.  
        if (!$element_seen) { print "Empty element <$gi/> cannot be the root.\n"; }
        } else { $xml_empty = 0; }   # Check to see that attributes are well-formed.   if
        ($gi =~ /\s/) { $gi = $`; $attrline = $'; $attrs = $attrline;   # This time we properly
        check to see that either ' ' or " " is # used to surround the attribute values.
          while ($attrs =~ /\s*([^\s=]*)\s*=\s*(("[^"]*")|('[^']*'))/) {  
        # An end tag may not, of course, have attributes.   if ($st_or_et eq "\/") {
        print "Attributes may not be placed on end tags.\n"; &print_error_at_context;
        } $attrname = $1;   # Check for a valid attribute name.  
        &check_name($attrname); $attrs = $'; } $attrs =~ s/\s//g;   # The above regex
        should have processed all the attributes. If anything # is left after getting rid of white
        space it is because the attribute # expressesion was malformed.   if ($attrs ne
        "") { print "Malformed attributes.\n"; &print_error_at_context; } }
          # If XML is declared case-sensitive the following line should be # removed. At the
        moment it isn't so I set everything to lower # case so we can match start and end tags
        irrespective of case # differences.   $gi =~ tr/A-Z/a-z/; if (!$element_seen) { $ROOT =
        $gi; }   # Check to see that the generic identifier is a well-formed name.  
        &check_name($gi);   # If I have an end tag I just check the top of the stack, the #
        end tag must match the last parent or it is an error. If I # find an error I have I could
        either pop or not pop the stack. # What I want is to perform some manner of error recovery
        so # I can continue to report well-formed errors on the rest of # the document. If I pop the
        stack and my problem was caused # by a missing end tag I will end up reporting errors on
        every # tag thereafter. If I don't pop the stack and the problem # was caused by a
        misspelled end tag name I will also report # errors on every following tag. I happened to
        chose the latter.   if ($st_or_et eq "\/") { $parent =
        $ancestors[$#ancestors]; if ($parent ne $gi) { if (@ancestors eq $ROOT) { @ancestors =
        ""; } else { &print_error_at_context; } } else { pop @ancestors; } } else {
          # This is either an empty tag or a start tag. In the latter case # push the generic
        identifier onto the ancestor stack.   if (!$xml_empty) { push (@ancestors, $gi); } }
          } #--------------------------------------------------------------------------#  
        # Skip over processing instructions.   sub process_pi { if ($gi !~ /\?$/) { $file =~
        /\?>/; $gi = $gi.$`."?"; $file = $'; } }
        #--------------------------------------------------------------------------# sub
        print_error_at_context {   # This routine prints out an error message with the contents
        of the # ancestor stack so the context of the error can be identified.   # It would be
        most helpful to have line numbers. In principle it # is possible but more difficult since we
        choose to not process the # document line by line. We could still count line break
        characters # as we scan the document.   # Nesting errors can cause every tag thereafter
        to generate an error # so stop at 10.   if ($error_count == 10) { print "More than
        10 errors ...\n"; $error_count++; } else { $error_count++; print "Not well formed
        at context ";   # Just cycle through the ancestor stack.   foreach $element
        (@ancestors) { print "$first$element"; $first = "->"; } $first =
        ""; print " tag: <$st_or_et$gi $attrline>\n"; }   }
        #--------------------------------------------------------------------------#   # Check
        for a well-formed Name as defined in the Name production.   sub check_name {
        local($name) = @_;   if ($name !~ /^[A-Za-z_:][\w\.\-:]*$/) { print "Invalid
        element or attribute name: $name\n"; &print_error_at_context; } }
        #--------------------------------------------------------------------------#

Conclusion

I've been writing Perl programs to process SGML for the last several years, aware of, but blithely undisturbed by, the fact that

My SGML files may not have followed all the rules, and
My programs could only handle my private SGML variants.

On the other hand, I've rewritten lots of programs over and over again because I did not and could not adhere to a fixed standard. XML is attractive, in principle, because it gives me an attainable target for my Perl programs. And if the promise holds we should start seeing hundreds, maybe thousands, of reusable Perl modules appearing on the Web which will make XML as easy to process as ordinary strings. It will be the ubiquitious text format.

I think the program in Example 1 illustrates that Perl programs can do XML. It would be a blessing for all if those working on the XML standard would simplify entity processing a bit more and fight like the devil against any and all attempts to restuff the relatively Spartan design of XML with padding and fluff from SGML's historical legacy. From the perspective of the Desperate Perl Hacker, XML would do well to simplify a bit more, and cannot afford to add complications of relatively little value.

About the Author

Michael Leventhal

Grif, S.A.

Vice-President, Technology

1800 Lake Shore Avenue

Suite 14

Oakland, California 94606

Michael.Leventhal@grif.fr

Michael Leventhal is Vice-President, Technology for GRIF and is responsible for the definition and planning of GRIF's XML products. Before joining GRIF he ran his own consulting company and has worked for Oracle and other Silicon Valley firms in software architecture and development. He has taught an SGML class for U.C. Berkeley Extension and is writing a book on XML and Intranets which will be published by Prentice-Hall next year.