XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Mastering DocBook Indexes
by Jirka Kosek | Pages: 1, 2, 3

Generating an Index

The DocBook XSL stylesheets generate indexes automatically. The only thing we have to do is place an empty index element into a location where a real index should appear. This is usually somewhere near the end of a document.

The stylesheets adapt the index appearance to the output format. An index on an HTML page does not contain page numbers, but instead uses section or chapter titles that link back to the index term occurrence in the document flow. If the output format is an HTML Help, then the HTML Help index is built instead of a simple HTML page with links.

However, print output is not without obstacles. Generating an index in XSL is a two-phase process. The first phase is a XSLT transformation that converts a source DocBook document into a set of abstract formatting objects. Page numbers for the index entries are not known at this moment. The actual rendering and page-number evaluation takes part during the second formatting phase, which is performed by a FO processor like FOP, XEP, or XSL Formatter.

Problems arise when one index term occurs twice within a page. In this case, the index contains duplicate page numbers for this entry. We will see how to deal with it in the following parts of this article.

Indexes for non-English languages represent another issue. Generating the index consists of grouping the index terms with the same initial letters and then alphabetical sorting the entries within each letter group. The stylesheets exactly implement this algorithm that is unfortunately insufficient for many languages.

For example, some languages treat "ch" as a single letter that should sort between "c" and "d" in traditional Spanish or between "h" and "i" in Czech. Diacritics can be the cause of another complexity. Some languages completely ignore them, some use complex rules. In Czech, for example, the words starting with letters "u" and "ú" belong to the same index group, but words starting with "c" and "č" belongs to two different groups. And we don't even want to start thinking about the CJKV languages.

XSLT offers very poor support for grouping, which is why the index generation is very difficult. If you want to implement locale-aware indexing in XSLT you will reach the limits of the language. Fortunately, many XSLT processors offer extensions to the XSLT core -- so we will next see how internationalized indexing is supported in the DocBook XSL stylesheets.

Removing Duplicate Page Numbers from a Printed Index

As I mentioned earlier, the current combination of the XSLT and XSL-FO standards does not provide a mechanism for removing duplicate page numbers from a printed index. This serious drawback can be overcome in two ways. The first solution utilizes the FO processor, which implements a vendor extension for the index generation. The other possibility is to use multiple passes over a document to detect and remove the duplicities.

The vendor extensions are supported in the two best-known commercial FO processors -- XEP and XSL Formatter. The DocBook XSL stylesheets contain support for these FO implementations; we just tell stylesheets to use these extensions by turning on an appropriate parameter. For instance, XEP should be invoked by the following command line:


xep -xml document.xml -xsl .../fo/docbook.xsl -param xep.extensions=1

In the real world we usually change behavior of the stylesheets by customizing more than one parameter. The best practice is then to create a customization layer, which imports stock stylesheets and sets all necessary parameters.


<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl=
"http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:import href="http://docbook.sourceforge.net/release/xsl
/current/fo/docbook.xsl"/>

<xsl:param name="paper.type" select="'A4'"/>
<xsl:param name="xep.extensions" select="1"/>

</xsl:stylesheet>

If you prefer XSL Formatter over XEP, you can use a similar parameter axf.extensions to turn on the XSL Formatter support. Using both parameters results in removing duplicate page numbers and in creating a page range for continuous sequences of page numbers. For example, if a single index entry occurs on the following pages:

5, 5, 8, 9, 10, 37

The output will be more reasonable and aesthetic in the following way:


            5, 8–10, 37
          

In the future no need of using such vendor extensions will be necessary because the upcoming version 1.1 of XSL-FO has the direct index support.

When using another FO processor, we must employ a more difficult procedure. This is also the case of the open-source FOP processor. We must process the document twice. The first pass is done with the make.index.markup parameter set.

The resulting PDF will contain an XML markup for index entries and page numbers. This PDF can be converted to plain text from which the XML markup is extracted. The duplicates are then removed and the modified XML fragment of the index is now used to get the proper PDF. This process is a real hackery, and it does not work very well for languages that use characters outside the ISO Latin 1 -- the FOP does not insert the proper Unicode mapping vector for embedded fonts. This technique was invented by G. Ken Holman.

Internationalized Indexes

The DocBook XSL stylesheets adapt its output to a document language. The document language can be specified by using the lang attribute.


<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE book PUBLIC '-//OASIS//DTD DocBook XML V4.3//EN'
'http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd'>

<book lang="de">
  ... German book ...
</book>

Due to the previously mentioned limitations of XSLT the stylesheets cannot use different grouping criteria for each language. Fortunately, several XSLT processors offer extensions to XSLT that can be used to overcome this limitation. As these extensions are not backward compatible with pure XSLT, they cannot be included in the default stylesheet. If we want to generate an internationalized index, we must use the EXSLT-aware XSLT processor, which supports user-defined functions. Then these functions can be used in the definition of a lookup key (xsl:key). These criteria are met by Saxon; xsltproc is still having some unresolved issues at the time of this writing.

If we want to use internationalized indexing features of the stylesheets we must create a customization layer that will override default index-generating templates by including a small autoidx-ng.xsl stylesheet.


<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet xmlns:xsl=
"http://www.w3.org/1999/XSL/Transform" version="1.0">

<xsl:import href="http://docbook.sourceforge.net/release
/xsl/current/fo/docbook.xsl"/>

<xsl:include href="http://docbook.sourceforge.net/release
/xsl/current/fo/autoidx-ng.xsl"/>

<!-- 
Parameter settings and other modifications of stylesheet
-->

</xsl:stylesheet>

The internationalized indexing is implemented for both HTML and print (FO) output. Each output format has its own autoidx-ng.xsl file in the corresponding directory. The stylesheets currently support the internationalized indexing for the following languages: Czech, Danish, German, English, Spanish, French, and Turkish.

The described method of internationalization places each index term into the correct letter group and the groups are sorted in proper collating order. Sorting of entries within one group is left to the XSLT processor, which may be a problem because many XSLT processors support only the English sort order out-of-the-box. Saxon 6.5.3 (the recommended version for use with the DocBook stylesheets) can be easily extended to support user-defined collation.

We first create simple implementation of a TextComparer, which must be named after the language code. For example, for German we must create a class named Compare_de.


package com.icl.saxon.sort;

import java.text.Collator;
import java.util.Locale;

public class Compare_de extends TextComparer
{

  int caseOrder = UPPERCASE_FIRST;

  public int compare(Object a, Object b)
  {

      Collator deCollator = 
      Collator.getInstance(new Locale("de", "de"));

      return deCollator.compare(a, b);
  }

  public Comparer setCaseOrder(int caseOrder)
  {
      this.caseOrder = caseOrder;
      return this;
  }

}

Then we must compile this class into the Java bytecode:


javac -classpath /path/to/saxon.jar Compare_de.java

The resulting file Compare_de.class must be on the CLASSPATH when Saxon is invoked in order to get the proper German sorting. The same procedure applies to another languages.

Pages: 1, 2, 3

Next Pagearrow