XML.com: XML From the Inside Out

XML.comWebServices.XML.comO'Reilly Networkoreilly.com
  Articles | Weblogs | Newsletter | Safari Bookshelf
advertisement

Article:
 XML Isn't Too Hard
Subject: XML can NOT be parsed with regexp
Date: 2003-04-21 06:57:03
From: Dirk Herr-Hoyman

One point that should be raised in this discussion
on XML parsing is that from a theoretical perspective, XML can NOT be parsed with only regular expressions.


This comes from the theory of Formal Languages and
was proven in a formal sense in the 1960's by linguists, mathematicians and computer scientists.
This is the basis for programming language design and compilers. Anyone who gets trained as a
Computer Scientist gets this background in undergrad or possibly grad level courses. I happen to have taught some of those very courses.


The basic classes of formal languages in this theory are:


regular expressions (RE)
context free (CF)
context sensitive (CS


with RE being contained by CF and CF being contained by CS. A CF language is one that has
so-called block or nested structures. Pascal and
C, for example. And XML too, of courses.
A RE describes constructs that do NOT have nested
structures, and in this theory it's how you get
as base terms or words.


To handle an RE, one constructs a state machine.
All the regexp engines do this, with the finer
touches being things like lookahead. To handle
a CF, you need a state machine and a stack, the
stack being how you deal with nested structures.
There are many textbooks and references in the scientific literature that will show you the gory details, oh like Aho, Sethi, and Ullman's "Dragon Book".


It's a basic mistake in this realm to think you can parse a CF with REs. This would be a homework
assignment I'd give in a Programming Languages course, just to make a point, allowing the students to struggle with it. You end up with a series a really ugly RE that attempts to look for begin/end constructs. I see our industry going thru the same learning curve. I see individuals I work with trying to do the same thing too. It's pretty frustrating to see this, as the answer is to use a CF grammar parser, not an RE.


Software tools to do a CF parse have been around since the 1970's. In the 1st wave, these were called compiler-compilers and had a declarative syntax. Things like YACC in unix. More recently,
we've seen a similar approach in the Apache Jetspeed project for Velocity, which uses javacc.
I ask myself, why this approach is not more pervasive in the world of XML?


Actually, it IS, it's just that the tools that do this have some learning curve. A DOM XML parser does deal with a CF language. SAX is related, not quite the same thing, but still to GET it you have to understand about a CF here too.


No Previous Message Previous Message   Next Message No Next Message

Sponsored By:


Contact Us | Our Mission | Privacy Policy | Advertise With Us | | Submissions Guidelines
Copyright © 2008 O'Reilly Media, Inc. | (707) 827-7000 / (800) 998-9938