An Introduction to XML Processing with Lark
Personal gratification
As for Java: <OldFart>well, maybe there's something to this "Object-Oriented" fad.</OldFart> But it ain't fast.
On the design-goal-compliance front, the good news is that if you wanted a program to process XML that you knew was well-formed, you could probably bash it out in Perl (don't know about Java) in a day or so. On the other hand, if you want to build a general purpose tool that does all of XML and provides helpful error messages and a useful API, the nominal week is not nearly enough. The development of Lark has consumed several weeks. I started sketching out Lark at SGML '96 sometime around the 20th of November, and it took into the first week of January. Mind you, in the interim I returned from Boston, bought a house, traveled to Australia to get married and to Saskatchewan for Christmas, and did some other work. As 1997 has worn on, I have put in a day here and there patching missing pieces and tracking changes in the XML spec.
I do think that Lark will be useful for exploring API designs. Of course, none of this will happen unless there are people out there who want to use an XML processor for something or other. Among other things, Lark currently has no user interface at all; while I don't mind editing the Driver.java file and recompiling to run tests, presumably a UI would be a good thing to have.
As for the financial aspects, I'm kind of gloomy. I think most XML processors are going to be purpose-built for the needs of particular applications, and will thus hide inside them. Which is good; XML's simplicity makes this approach cost-effective. Failing that, processors will be full-dress validating parsers with incremental parsing for authoring support. So I'm not sure that there's all that much need for a standalone processor; but I'd love to be wrong.
Just in case, for the moment I'm going to be giving away all the .class files, and some of the Java source code, but not the source code for the two classes with the hard bits. In any case, they're sure to be buggy at this stage, and I wouldn't want to be letting them out of my hands without a bit more polishing. If you can see a way to get a little revenue out of this project, give me a call.
There is some more scope for compression, when some useful facilities appear in the Java class libraries that ought to be there; e.g., a usable symbol table and better Unicode support.
Lark's error-handling is draconian. After encountering the first well-formedness error, no further internal data structures are built or returned to the application. However, Lark does continue processing the document looking for more syntax errors (and in fact performing some fairly aggressive heuristics on the tag stack in order to figure out what's going on), and calling doSyntaxError application callback (see below) to report further such errors.
Example 1
-rwxrwxrwa 810 Jun 26 15:07 Attribute.class |
-rwxrwxrwa 1756 Jun 26 15:07 Element.class |
-rwxrwxrwa 1022 Jun 26 15:07 Entity.class |
-rwxrwxrwa 1698 Jun 26 15:07 Handler.class |
-rwxrwxrwa 32774 Jun 26 15:07 Lark.class |
-rwxrwxrwa 1686 Jun 26 15:07 Namer.class |
-rwxrwxrwa 1934 Jun 26 15:07 Segment.class |
-rwxrwxrwa 855 Jun 26 15:07 Text.class |
-rwxrwxrwa 997 Jun 26 15:07 XmlInputStream.class |
Those named Lark and Namer have all the "tricky bits" of parsing and text management logic; they are black boxes for the moment.
The class XmlInputStream is a place to put logic to get access to streams of Unicode characters in various encodings. It is to be hoped that it can soon be replaced by equivalent functions in the Java base libraries.
The Text and Segment classes do Lark's character data management; details are below.
From an application's point of view, the Lark and Handler classes are central. Handler has methods such as doPI, doSTag, doEntityReference, doEtag; the application passes a Handler instance to Lark's readXML method, and Lark calls the routines as it recognizes significant objects in the XML document. The base class provides do-nothing methods; the intent is that an application would subclass Handler to provide the appropriate processing semantics.
Along with presenting this event stream to the application, Lark can optionally build a parse tree, and if so doing, can optionally save copies of the XML document's character data, all in parallel with providing the event stream. Lark provides methods to toggle these behaviors. These methods may be used in the Handler callbacks while Lark is running, to build the parse tree or save the text for only a subset of the document.
In building the parse tree, Lark is guaranteed to update only elements which are still open; i.e., for which the end-tag has not been seen. All other sections of the tree, and the entire tree once Lark has hit end-of-file, may be manipulated freely by the application.
An instance of Lark may be initialized with an optional list of element GIs which are to be considered as those of empty elements, whether or not the XML /> syntax is used. A typical set might begin: "HR", "BR", "IMG", ....
Another initializer allows a set of entities to be predefined.
Lark also provides the application access to the "entity tree"; there is a method that toggles whether Lark attempts to retrieve and parse external entities.
Handler has a method named element, which the internal logic uses to construct a new Element object every time this is required (rather than calling new Element() directly). The idea is that this can be subclassed to return, rather than an Element object, a custom-built instance of a subclass of Element.
The first argument of all the other Handler methods is an Entity object, which contains information about the location of the object just encountered and the current state of the entity stack. All of the Handler methods return booleans. If any handler method returnstrue, Lark's readXML method returns control to whatever called it. In the base Handler class, all the methods return false by default, except for the doSyntaxError method, which generates a message via System.err.println(). Example 2 summarizes the methods; the names and types of the arguments should make their meaning self-explanatory.
Example 2
// Detected Processing Instruction
public boolean doPI(Entity e, String PI)
// Detected a text segment
public boolean doText(Entity ent, Element el, char[] text, int length)
// Doctype declaration; version for SYSTEM-only
public boolean doDoctype(Entity e, String rootID, String externalSubsetID)
// version for no SYSTEM or PUBLIC keyword
public boolean doDoctype(Entity e, String rootID)
// Internal (text) entity declaration
public boolean doInternalEntity(Entity e, String name, char[] value)
// External text entity declaration; external ID is a URL
public boolean doSystemTextEntity(Entity e, String name, String extID)
// NDATA entity declaration
public boolean doSystemBinaryEntity(Entity e, String name, String extID,
String notation)
// End-tag detected; the element should be complete
public boolean doETag(Entity e, Element element)
// Start tag detected; the element will in general be
// incomplete, unless element.empty() is true
public boolean doSTag(Entity e, Element element)
// Entity reference detected; e is the containing entity,
// not the one that was detected
public boolean doEntityReference(Entity e, String name)
// Error condition detected; the entity e will have the
// current locational information filled in
public boolean doSyntaxError(Entity e, String message)Notes:
public boolean doSyntaxError(Entity e, String message){System.err.println("Syntax error at line " +e.lineCount() + ":" + e.lineOffset() + ": " + message);return false;}
public Lark() throws Exception
public Lark(String[] emptyElementTypes) throws
Exception
public Lark(String[] entityNames, String[] entityValues) throws
Exception
public Lark(String[] emptyElementTypes,
String[] entityNames,
String[] entityValues) throws Exception
/> XML empty-element syntax. Lark ignores case in GIs, and always reports GIs to the handler in uppercase form.
public void buildTree(boolean build)public void saveText(boolean save)public void processExternalEntities(boolean process)
In each case, the caller must provide a Handler (or more likely an instance of a class subclassed from Handler). Lark can read from an XmlInputStream (subclassed from BufferedInputStream--see below) or from a byte array. Perhaps Lark should also be able to read from a character array.
As noted above, Lark returns from readXML whenever one of the Handler methods returns true; in which case it maintains the parsing state and readXML can be reinvoked. It also returns when it encounters end of input on the document entity.
The Element object returned by readXML is not meaningful if tree-building has not been turned on. If tree-building has been turned on, it is topmost node in the tree built by Lark--this may not be the node for the root element if tree-building was not turned on until later in the tree.
Example 3
// no input stream provided - assume current state is valid
public Element readXML(Handler handler)
throws java.io.IOException
// XmlInputStream provided - assume it's new
public Element readXML(Handler handler, XmlInputStream input)
throws java.io.IOException
// byte-buffer provided - make an inputstream and put it on stack
public Element readXML(Handler handler, byte[] buffer)
throws java.io.IOException
Example 4
class Attribute
{
String mName;
Text mValue;
// Constructor
Attribute(String name, String value)
public String name() { return mName; }
public void setName(String name) { mName = name; }
public String value() { return mValue; }
public void setValue(String value) { mValue = value; }
}
public int getXmlChar() throws IOException
This is intended to retrieve enough bytes from the BufferedInputStream to constitute one Unicode character, and return it. Obviously, the logic is highly dependent on the input encoding.XmlInputStream has all the same constructors as BufferedInputStream, plus one extra:
XmlInputStream(char[] charBuf)
This allows getXmlChar() to work out of a character array. Lark uses this facility to support internal text entities.It is remarkable that Java's built-in BufferedInputStream has no getCharacter() method; with any luck, this oversight will soon be corrected, allowing XmlInputStream to be retired. Java does have facilities for reading characters from a DataInputStream, but their design makes them unusable by XML (or by any other application I can think of). Or perhaps I just didn't understand the documentation.
Element is more complex. Readers are urged to look at the source code, which is heavily commented. A few fields and methods of special interest are excerpted in Example 5.
Example 5
public class Element
{
Attribute[] mAttrs; // Attributes
String mType;
// Children are a Vector because they might be Elements
// and they might be Text objects
Vector mChildren = new Vector();
// Parent element. null does not mean this is the root,
// it just means this is where Lark started constructing
// the tree.
Element mParent;
public String type();
public Attribute[] allAttributes()
public void setAllAttributes(Attribute[] attributes)
// get & set attribute by name
public Attribute attribute(String name)
public void setAttribute(String name, String value)
public Vector children()
public Element parent()
}
Note that Lark only creates one String object for each unique element type. This means that the String objects returned by the type() method may be simply compared for equality to establish whether two elements are the same type.
The Entity class's fields are shown in Example 6: it has only trivial get and set methods for these fields.
Note that Entities differ from Elements in one important respect: Lark always constructs the Entity tree.
The Segment class is much more complex, and tries to provide a means of storing information about character data, and optionally the character data itself, efficiently while keeping track of source offset information. Its implementation, particularly the constructors, is fairly complex, but the methods that an application might use are straightforward:
Example 6
// locational info int mOffset; int mLineCount; int mLineOffset; // how far into this line // not true for internal entities boolean mTextIsReal; XmlInputStream mInput; // read characters for this entity String mDescription; // metadata Entity mParent; // entity tree; null means doc entity
public Entity entity()public int sourceOffset()public int sourceLength()public int dataOffset()public int dataLength()public String string()
Example 7
class Text
{
// This is a Text object, which can appear in the Children()
// of an element. The text itself is actually stored in a
// Vector of Segments; this object really only exists to
// provide a nice singular object to sit in the parse
// tree, and to avoid recalculating the String value
// of a multi-segment object.
Vector mSegments = new Vector();
String mString;
public void addSegment(Segment segment)
{
mSegments.addElement(segment);
mString = null; // invalidate string
}
public Vector segments() { return mSegments; }
// construct string only on request
public String string()
{
int i;
Segment seg;
if (mString == null)
{
mString = new String();
for (i = 0; i < mSegments.size(); i++)
{
// there must be a better way
seg = (Segment) mSegments.elementAt(i);
mString = mString.concat(seg.string());
}
}
return mString;
}
}
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.