org.brownell.xml
Class HtmlParser

java.lang.Object
  |
  +--org.brownell.xml.HtmlParser

public final class HtmlParser
extends java.lang.Object
implements org.xml.sax.Parser, org.xml.sax.Configurable

This is a wrapper around the javax.swing.text.html.parser.* HTML parser, implementing the 1-June-1999 draft SAX2 interfaces. On valid HTML, and much invalid or malformed HTML, it produces a stream of SAX parsing events corresponding to the parse of the corresponding (well formed) XHTML document. Element and attribute names are uniformly presented in lower case. (The Level 1 HTML DOM spec seems to be exotic in not adopting that convention.)

Only one type of lexical event is reported: comments are visible. This is generally used with HTML to access inlined CSS comments which are protected against browsers old enough that they don't understand what the "style" tag means. Expansions of built-in entities (such as " ") or character references are accordingly not visible.

This parser does not support dynamic modification of the input stream to the parser, needed to fully support <script> tags which use the DOM to splice new page content into documents as they load.

Current (Swing 1.1) HTML parsing issues of note include:

This driver adds ignorable newlines at various locations where they won't be confused with HTML content. These may of course be ignored. If they are not ignored, they make the output of this parser be more easily printed, since otherwise HTML files of all sizes will appear without line breaks of any kind, and viewing the output of this parser will cause trouble for most text editors.

Version:
1.2 (4 September 1999)
Author:
David Brownell (db@post.harvard.edu)

Constructor Summary
HtmlParser()
          Constructs a new HTML parser.
 
Method Summary
 boolean getFeature(java.lang.String featureId)
          SAX2: Tells whether this parser supports the specified feature.
 java.lang.Object getProperty(java.lang.String propertyId)
          SAX2: Returns the specified property.
 void parse(org.xml.sax.InputSource input)
          SAX1: parse the HTML text in the given input source.
 void parse(java.lang.String uri)
          SAX1: Parse the HTML text at the given input URI.
 void setDocumentHandler(org.xml.sax.DocumentHandler handler)
          SAX1: Provides an object which receives callbacks for the most significant document information.
 void setDTDHandler(org.xml.sax.DTDHandler handler)
          SAX1: Provides an object which may be used to intercept declarations related to notations and unparsed entities.
 void setEntityResolver(org.xml.sax.EntityResolver resolver)
          SAX1: Provides an object which may be used when resolving external entities during parsing (both general and parameter entities).
 void setErrorHandler(org.xml.sax.ErrorHandler handler)
          SAX1: Provides an object which receives callbacks for HTML errors of all levels (fatal, nonfatal, warning).
 void setFeature(java.lang.String featureId, boolean state)
          SAX2: Sets the state of features supported in this parser.
 void setLocale(java.util.Locale locale)
          SAX1: Identifies the locale which the parser should use for the diagnostics it provides.
 void setProperty(java.lang.String propertyId, java.lang.Object property)
          SAX2: Assigns the specified property.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Constructor Detail

HtmlParser

public HtmlParser()
Constructs a new HTML parser.
Method Detail

setErrorHandler

public void setErrorHandler(org.xml.sax.ErrorHandler handler)
SAX1: Provides an object which receives callbacks for HTML errors of all levels (fatal, nonfatal, warning).

Note that this parser does not provide a consistent categorization of errors according to the categories defined in the SAX API. Most problems are reported at the "warning" level, and even those few validity related errors reported at the "nonfatal" level may not be viewed as issues in all HTML environments. No errors are reported as "fatal".

Throwing an exception from an error handler may not work well.

Specified by:
setErrorHandler in interface org.xml.sax.Parser

setDocumentHandler

public void setDocumentHandler(org.xml.sax.DocumentHandler handler)
SAX1: Provides an object which receives callbacks for the most significant document information.
Specified by:
setDocumentHandler in interface org.xml.sax.Parser

setDTDHandler

public void setDTDHandler(org.xml.sax.DTDHandler handler)
SAX1: Provides an object which may be used to intercept declarations related to notations and unparsed entities.

Not used by this parser.

Specified by:
setDTDHandler in interface org.xml.sax.Parser

setEntityResolver

public void setEntityResolver(org.xml.sax.EntityResolver resolver)
SAX1: Provides an object which may be used when resolving external entities during parsing (both general and parameter entities).

Not used by this parser.

Specified by:
setEntityResolver in interface org.xml.sax.Parser

setLocale

public void setLocale(java.util.Locale locale)
               throws org.xml.sax.SAXException
SAX1: Identifies the locale which the parser should use for the diagnostics it provides.

Not used by this parser.

Specified by:
setLocale in interface org.xml.sax.Parser
Throws:
org.xml.sax.SAXException - as defined in the specification for org.xml.sax.Parser.setLocale()

parse

public void parse(org.xml.sax.InputSource input)
           throws org.xml.sax.SAXException,
                  java.io.IOException
SAX1: parse the HTML text in the given input source.
Specified by:
parse in interface org.xml.sax.Parser
Throws:
org.xml.sax.SAXException - as defined in the specification for org.xml.sax.Parser.parse()
java.io.IOException - as defined in the specification for org.xml.sax.Parser.parse()

parse

public void parse(java.lang.String uri)
           throws org.xml.sax.SAXException,
                  java.io.IOException
SAX1: Parse the HTML text at the given input URI.
Specified by:
parse in interface org.xml.sax.Parser
Throws:
org.xml.sax.SAXException - as defined in the specification for org.xml.sax.Parser.parse()
java.io.IOException - as defined in the specification for org.xml.sax.Parser.parse()

getFeature

public boolean getFeature(java.lang.String featureId)
                   throws org.xml.sax.SAXException
SAX2: Tells whether this parser supports the specified feature.
Specified by:
getFeature in interface org.xml.sax.Configurable
Throws:
org.xml.sax.SAXException - as defined in the specification for org.xml.sax.Configurable.getFeature()

getProperty

public java.lang.Object getProperty(java.lang.String propertyId)
                             throws org.xml.sax.SAXException
SAX2: Returns the specified property. At this time only lexical handlers are supported.
Specified by:
getProperty in interface org.xml.sax.Configurable
Throws:
org.xml.sax.SAXException - as defined in the specification for org.xml.sax.Configurable.getProperty()

setFeature

public void setFeature(java.lang.String featureId,
                       boolean state)
                throws org.xml.sax.SAXException
SAX2: Sets the state of features supported in this parser. As of this writing, no feature's state may be changed from its default value.
Specified by:
setFeature in interface org.xml.sax.Configurable
Throws:
org.xml.sax.SAXException - as defined in the specification for org.xml.sax.Configurable.setFeature()

setProperty

public void setProperty(java.lang.String propertyId,
                        java.lang.Object property)
                 throws org.xml.sax.SAXException
SAX2: Assigns the specified property. At this time only lexical handlers are supported, and these must not be changed to values of the wrong type. Like SAX1 handlers, these may be changed at any time.
Specified by:
setProperty in interface org.xml.sax.Configurable
Throws:
org.xml.sax.SAXException - as defined in the specification for org.xml.sax.Configurable.setProperty()