XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

HTML and XSLT

August 30, 2000


HTML Web pages have played a big part in electronic publishing for some time now, and will continue to for several years. If you use XSLT as a system development tool, you may work on an application that needs to read or write HTML.

If your application is reading or writing the HTML flavor known as XHTML, a W3C Recommendation that describes itself in its spec as "a reformulation of HTML 4 as an XML 1.0 application," then there's nothing special to worry about: XHTML is perfectly good XML, just like anything else that XSLT can read or write. If your application is reading older legacy HTML or outputting HTML for use in older browsers, however, there are a few small problems to keep in mind and some simple techniques for getting around these problems.

HTML as Input

XSLT processors expect their input to be well-formed XML, and although HTML documents can be well-formed, but most aren't. For example, any Web browser would understand the following HTML document, but a number of things prevent it from being well-formed XML:

<html>
<body>
<h1>My Heading</H1>
<p>Here is the first paragraph.
<P>Here is the second.<br>
Second line of the second paragraph.
<IMG SRC=somepic.jpg>
</BODY>
  • The html start-tag lacks a corresponding end-tag.

  • The tags enclosing the h1 and body elements are not in a consistent case.

  • The value of the IMG element's SRC attribute isn't quoted.

  • The br and IMG elements' tags have no closing slash or matching end-tags to show that they're empty elements.

If a browsing program can still figure out what's what in this document and display it on a screen, it was inevitable that someone would write a utility to parse such a document and output a proper well-formed version. Dave Raggett of the W3C turned out to be that person, and luckily, his "HTML Tidy" program is free and available for many different platforms from the W3C's website. With the -asxml option added to the tidy program's command line--telling it to include closing slashes in empty elements--it turns the example above into the following well-formed XHTML document:

<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/strict.dtd">
<html xmlns="http://www.w3.org/TR/xhtml1">
<head>
<title></title>
</head>
<body>
<h1>My Heading</h1>

<p>Here is the first paragraph.</p>

<p>Here is the second.<br />
Second line of the second paragraph. <img src="somepic.jpg" /></p>
</body>
</html>

Writing an XSLT stylesheet to process this tidied-up version is no different from writing an XSLT stylesheet to process any other well-formed XML.

HTML as Output

A basic rule of XSLT is that your stylesheets must be well-formed. All tags must either be a member of a start- and end-tag pair or an empty element tag with its closing slash. Since the XSLT processor's output will reflect the structure of much of the stylesheet, this could be a problem when creating HTML to be read by older browsers.

While Internet Explorer 5 and Netscape Navigator 4 have no problem with a closing slash in empty HTML elements such as br, hr, and img, a browser like Navigator 3 won't know what to do with elements that have this closing slash--after all, when Navigator 3 was released, XML hadn't been invented yet.

If you're going to the trouble of converting your XML to HTML, you probably want to ensure that the widest selection of browsers can read the Web pages that you're creating, and the xsl:output element lets you do this. With its method attribute set to a value of "html", the xsl:output element in the following stylesheet tells the XSLT processor to represent empty HTML elements (area, base, basefont, br, col, frame, hr, img, input, isindex, link, meta and param, according to the XSLT spec) as a single tag with no closing slash.

<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="html"/>

<xsl:template match="poem">
  <html><body>
    <xsl:apply-templates/>
  </body></html>
</xsl:template>

<xsl:template match="title">
  <h1><xsl:apply-templates/></h1>
</xsl:template>

<xsl:template match="excerpt">
  <p><xsl:apply-templates/></p>
  <hr></hr>
</xsl:template>

<xsl:template match="verse">
  <xsl:apply-templates/><br/>
</xsl:template>

</xsl:stylesheet>

An XSLT processor using the stylesheet shown above will convert the following XML document:

<poem><title>From Book I</title>
<excerpt>
<verse>Then with expanded wings he steers his flight</verse>
<verse>Aloft, incumbent on the dusky Air</verse>
<verse>that felt unusual weight, till on dry Land</verse>
<verse>He lights, if it were Land that ever burne'd</verse>
<verse>With solid, as the Lake with liquid fire;</verse>
</excerpt>
<excerpt>
<verse>For who can yet believe, though after loss</verse>
<verse>That all these puissant Leginos, whose exile</verse>
<verse>Hath emptied Heav'n, shall fail to re-ascend</verse>
<verse>Self-rais'd, and repossess their native seat.</verse>
</excerpt>
</poem>

to this:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<body>
<h1>From Book I</h1>
<p>
Then with expanded wings he steers his flight<br>
Aloft, incumbent on the dusky Air<br>
that felt unusual weight, till on dry Land<br>
He lights, if it were Land that ever burne'd<br>
With solid, as the Lake with liquid fire;<br>
</p>
<hr>
<p>
For who can yet believe, though after loss<br>
That all these puissant Leginos, whose exile<br>
Hath emptied Heav'n, shall fail to re-ascend<br>
Self-rais'd, and repossess their native seat.<br>
</p>
<hr>
</body>
</html>

After converting each excerpt element to a p element, it adds an HTML hr element for a horizontal rule. After outputting the contents of each verse element it outputs an HTML br "break" element. Because the stylesheet itself must be well-formed, it includes both the start- and end-tags for the hr elements, and it uses a br element with a closing slash to show that it represents an empty element.

Because of the xsl:output element's method value in the stylesheet, the hr and br elements in the output above are single tags with no closing slash, just as they were in pre-XML styles of HTML. The document should pose no problem to any release of any Web browser.

Between Dave Raggett's Tidy program and the xsl:output element, you should be all set to incorporate old-fashioned HTML into your new XSLT-based systems!