XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

The Evolution of JAXP

July 06, 2005

Introduction

After the first release of the W3C XML 1.0 recommendation in early 1998, XML started gaining huge popularity. Sun Microsystems Inc., at that time had just formalized the Java Community Process (JCP), and the first version of JAXP (JSR-05) was made public in early 2000, supported by industry majors like (in chronological order) BEA Systems, Fujitsu Limited, Hewlett-Packard, IBM, Netscape Communications, Oracle, and Sun Microsystems, Inc.

JAXP 1.0, then called Java API for XML Parsing, was a box office hit in the developer community, because of the pluggability layer provided by JAXP; that's what the essence of JAXP is. Developers can write program independent of the underlying XML processor by using the JAXP APIs, and can replace the underlying XML processor by choice without even changing a single line of application code.

So what exactly is JAXP? First of all, there has been some confusion in the past about the P in JAXP: Parsing or Processing? Because JAXP 1.0 supported only parsing, therefore, it was called Java API for XML Parsing. But in JAXP 1.1 (JSR-63), XML transformation was introduced using XSL-T. Unfortunately, the W3C XSL-T specification does not provide any APIs for transformation. Therefore, the JAXP 1.1 Expert Group (EG) introduced a set of APIs called Transformation API for XML (TrAX) in JAXP 1.1, and since then, JAXP is called Java API for XML Processing. Thereafter, JAXP has evolved to an extent, where now it supports a lot more things (like validation against schema while parsing, validation against preparsed schema, evaluating XPath expressions, etc.,) than only parsing an XML document.

So, JAXP is a lightweight API to process XML documents by being agnostic of the underlying XML processor, which are pluggable.

XML Parsing Using JAXP

JAXP supports Object-based and Event-based parsing. In Object-based, only W3C DOM parsing is supported so far. Maybe in future versions of JAXP, the EG might decide to support J-DOM as well. In Event-based, only SAX parsing is supported. Another Event-based parsing called Pull Parsing, should have been made part of JAXP. But, there is a different JSR (#173) filed for pull parsing, also known as Streaming API for XML (StAX) parsing, and nothing much can be done about that now.

Figure 1
Figure 1: Various mechanism of parsing XML.

Simple API for XML (SAX) Parsing

SAX APIs were proposed by David Megginson (in early 1998) as an effort towards a standard API for event-based parsing of XML (read the genesis of SAX here). Even though SAX is not a W3C REC, it is surely the de facto industry standard for parsing XML documents.

SAX parsing is an event-based, push-parsing mechanism, which generates events for the <opening> tags, </closing> tags, the character data, and so on. A SAX parser parses an XML document in a streaming fashion (forward only) and reports the events, in the sequence encountered, to the registered content handler, org.xml.sax.ContentHandler, (Don't get confused with the java.net.ContentHandler.) and errors (if any) to the registered error handler, org.xml.sax.ErrorHandler.

If you don't register an error handler, you will never know if there was any error while parsing the XML, and what it was. Therefore, it becomes extremely important to always register a meaningful error handler while SAX parsing an XML document.

If the application needs to be informed of the parsing events (and process it), it must implement the org.xml.sax.ContentHandler interface and register it with the SAX parser. A typical sequence of events reported through the callbacks could be startDocument, startElement, characters, endElement, endDocument, in that order. startDocument is called only once before reporting any other event. Similarly, endDocument is called only once after the entire XML is parsed successfully. See the javadocs for more details.

Figure 2
Figure 2: SAX Parsing XML

Snippet to SAX parse an XML document using JAXP:

        SAXParserFactory spfactory = SAXParserFactory.newInstance();
        spfactory.setNamespaceAware(true);
        SAXParser saxparser = spfactory.newSAXParser();

        //write your handler for processing events and handling error
        DefaultHandler handler = new MyHandler();

        //parse the XML and report events and errors (if any) to the handler
        saxparser.parse(new File("data.xml"), handler);
        
Document Object Model (DOM) Parsing

DOM parsing is an object-based parsing mechanism, which generates an XML object model: an inverted tree-like data structure, which represents the XML document. Every element node in the object model represents a pair of <opening> and </closing> tags in the XML. A DOM parser reads the entire XML file and creates an in-memory data structure called DOM. If the DOM parser is W3C compliant, then, the DOM created is a W3C DOM, which can be traversed or modified using the org.w3c.dom APIs.

Most of the DOM parsers also allow you to create an in-memory DOM structure from scratch, rather than just parsing an XML to a DOM.

Figure 3
Figure 3: DOM Parsing XML

Snippet to DOM parse an XML document using JAXP:

        DocumentBuilderFactory dbfactory = DocumentBuilderFactory.newInstance();
        dbfactory.setNamespaceAware(true);
        DocumentBuilder domparser = dbfactory.newDocumentBuilder();

        //parse the XML and create the DOM
        Document doc = domparser.parse(new File("data.xml"));

        //to create a new DOM from scratch -
        //Document doc = domparser.newDocument();

        //once you have the Document handle, then you can use
        //the org.w3c.dom.* APIs to traverse or modify the DOM...
        

Parsing in Validating Mode

Validation Against DTD

DTD is a grammar for XML documents. Often people think that DTD is something alien because it has a different syntax than XML, but DTD is an integral part of W3C XML 1.0. If an XML instance document has a DOCTYPE declaration, then to turn on validation against DTD, while parsing XML, you need to set the validating feature to true using the setValidating method on the appropriate factory. For example:

        DocumentBuilderFactory dbfactory = DocumentBuilderFactory.newInstance();
        dbfactory.setValidating(true);

        OR

        SAXParserFactory spfactory = SAXParserFactory.newInstance();
        spfactory.setValidating(true);
        

Note that, even if the validation is turned off, and if the XML instance has a DOCTYPE declaration to an external DTD, the parser always tries to load that DTD. This is done to ensure that any entity references in the XML instance (entity declarations being in the DTD) are expanded properly, which otherwise might lead to a malformed XML document, until and unless the standalone attribute on the XML declaration prolog is set to true, in which case the external DTD is ignored completely. For example:

        <?xml version="1.1" encoding="UTF-8" standalone="yes"?>
        
Validation Against W3C XMLSchema (WXS)

XMLSchema is yet another grammar for XML documents, and has gained huge popularity because of the XML syntax it uses, and the richness it provides to define fine grained validation constraints. If an XML instance document points to XMLSchema using the "schemaLocation" and "noNamespaceSchemaLocation" hints, then to turn on validation against XMLSchema, you need to do the following things:

  1. Set the validating feature to true using the setValidating method on SAXParserFactory or DocumentBuilderFactory, as mentioned above.
  2. Set the property "http://java.sun.com/xml/jaxp/properties/schemaLanguage" with the corresponding value as "http://www.w3.org/2001/XMLSchema"

Note that, in this case, even if a DOCTYPE exists in the XML instance, the instance won't be validated against DTD. But as mentioned earlier, surely it would be loaded so that any entity references can be expanded properly.

Since "schemaLocation" and "noNamespaceSchemaLocation" are just hints, the schemas can also be provided externally to override these hints, using the property "http://java.sun.com/xml/jaxp/properties/schemaSource". The acceptable value for this property must be one of the following:

  • java.lang.String that points to the URI of the schema
  • java.io.InputStream with the contents of the schema
  • org.xml.sax.InputSource
  • java.io.File
  • an array of java.lang.Object with the contents being one of the types defined above.

For example:

        SAXParserFactory spfactory = SAXParserFactory.newInstance();
        spfactory.setNamespaceAware(true);

        //turn the validation on
        spfactory.setValidating(true);

        //set the validation to be against WXS
        saxparser.setProperty("http://java.sun.com/xml/jaxp/properties/
		   schemaLanguage", "http://www.w3.org/2001/XMLSchema");

        //set the schema against which the validation is to be done
        saxparser.setProperty("http://java.sun.com/xml/jaxp/properties/
		   schemaSource", new File("myschema.xsd"));
        

Pages: 1, 2

Next Pagearrow