Menu

Mastering DocBook Indexes

July 14, 2004

Jirka Kosek

These days DocBook is considered to be a standard documentation format. Good documentation should be accompanied by a good index. This article will show you how to create professional indexes in DocBook and how to deal with indexes in languages other than English.

DocBook succeeded because it is supported by plenty of tools, many of them free. XML editors matured over the years and now they offer comfortable editing environments. There are even free WYSIWYG editors available like XMLmind XML Editor. Freely available DocBook XSL stylesheets allow conversion from DocBook to a variety of output formats including HTML, XHTML, print (PDF, PostScript), HTML Help, and JavaHelp. Moreover, output from the stylesheets can be easily customized by a lot of parameters.

Usability of a document, especially a printed document, can be boosted by a good index. Creating an index is a very laborious task often performed by specialists. Unfortunately in the area of open-source projects, samizdat publications, or books targeted to small language markets, you often have to cope with very limited resources, both financial and personal. For that reason I am going to show you how to create and process an index in a DocBook document yourself. In the remainder of this article you will see how to generate indexes for non-English languages, how to put several indexes into an individual document, and finally how to turn semantic markup into index entries quite easily.

Marking up Index Entries

The most difficult part of creating an index must be done manually and consists of marking up index entries in a document. In DocBook this is done by placing the indexterm elements wherever you write about the given topic. The content of the indexterm is not displayed as a part of a document flow; it is used later when building the index.


<para>Wealth of a modern societies is built upon information

<indexterm><primary>information</primary></indexterm>.</para>

                                                            

The indexterm element can also hold multilevel entries:


<indexterm>

<primary>information</primary>

</indexterm>

...

<indexterm>

<primary>information</primary>

<secondary>retrieval</secondary>

</indexterm>

...

<indexterm>

<primary>information</primary>

<secondary>dissemination</secondary>

</indexterm>

...

<indexterm>

<primary>information</primary>

<secondary>dissemination</secondary>

<tertiary>oral</tertiary>

</indexterm>

Such index terms will result in the following index output (the page numbers are, of course, for illustration only):

 information, 13
  dissemination, 17
    oral, 25
  retrieval, 15

If there is a large document chunk corresponding to a certain topic, we can use a special mode of indexterm to assign a range in a document to the topic. In this scenario, two indexterms elements are used for marking the start and the end of the range. A unique identifier is used to set up the relation between these two elements.


<indexterm class="startofrange" id="ix.xml.history">

<primary>XML</primary>

<secondary>history</secondary>

</indexterm>

  ... other DocBook markup and text describing history of XML ...

<indexterm class="endofrange" startref="ix.xml.history"/>

In the resulting index we will see something like this:


            XML
  history, 27–42

If an entry should be sorted in a different way than it is displayed, then we can use the sortas attribute. During index grouping and sorting, the text of the entry is ignored and the sortas attribute is then used instead. This can be useful in situations when an index entry contains special symbols that should sort differently; for example, based on their phonetic representation. The following example creates an index entry that will result in a Greek letter Ω displayed in the index, but this letter will be put in the place of word "Omega."


<indexterm>

<primary sortas="Omega">&Omega;</primary>

</indexterm>

If some occurrences of a term in the index should be emphasized (e.g., the number of a page with the term definition should be bold) then we can specify significance of each entry.


<indexterm significance="preferred">

<primary>information</primary>

</indexterm>

If an index term should not point to a particular page number, or an anchor in HTML output, but rather to a different term, we can utilize the see and seealso elements.


<indexterm>

<primary>DTD</primary>

</indexterm>



<indexterm>

<primary>document type definition</primary>

<see>DTD</see>

</indexterm>



<indexterm>

<primary>XML Schema</primary>

<seealso>DTD</seealso>

</indexterm>

Which results in:


            - D -
document type definition, see DTD
DTD, 42

- X -
XML Schema, 81, see also DTD

Up to this point we covered most DocBook capabilities in marking up index entries. I left out the zone attribute that can be used to place index entries outside the document flow. I personally do not consider this method to be useful for handmade indexes, but if you are interested you can read more about it in the documentation.

Generating an Index

The DocBook XSL stylesheets generate indexes automatically. The only thing we have to do is place an empty index element into a location where a real index should appear. This is usually somewhere near the end of a document.

The stylesheets adapt the index appearance to the output format. An index on an HTML page does not contain page numbers, but instead uses section or chapter titles that link back to the index term occurrence in the document flow. If the output format is an HTML Help, then the HTML Help index is built instead of a simple HTML page with links.

However, print output is not without obstacles. Generating an index in XSL is a two-phase process. The first phase is a XSLT transformation that converts a source DocBook document into a set of abstract formatting objects. Page numbers for the index entries are not known at this moment. The actual rendering and page-number evaluation takes part during the second formatting phase, which is performed by a FO processor like FOP, XEP, or XSL Formatter.

Problems arise when one index term occurs twice within a page. In this case, the index contains duplicate page numbers for this entry. We will see how to deal with it in the following parts of this article.

Indexes for non-English languages represent another issue. Generating the index consists of grouping the index terms with the same initial letters and then alphabetical sorting the entries within each letter group. The stylesheets exactly implement this algorithm that is unfortunately insufficient for many languages.

For example, some languages treat "ch" as a single letter that should sort between "c" and "d" in traditional Spanish or between "h" and "i" in Czech. Diacritics can be the cause of another complexity. Some languages completely ignore them, some use complex rules. In Czech, for example, the words starting with letters "u" and "ú" belong to the same index group, but words starting with "c" and "č" belongs to two different groups. And we don't even want to start thinking about the CJKV languages.

XSLT offers very poor support for grouping, which is why the index generation is very difficult. If you want to implement locale-aware indexing in XSLT you will reach the limits of the language. Fortunately, many XSLT processors offer extensions to the XSLT core -- so we will next see how internationalized indexing is supported in the DocBook XSL stylesheets.

Removing Duplicate Page Numbers from a Printed Index

As I mentioned earlier, the current combination of the XSLT and XSL-FO standards does not provide a mechanism for removing duplicate page numbers from a printed index. This serious drawback can be overcome in two ways. The first solution utilizes the FO processor, which implements a vendor extension for the index generation. The other possibility is to use multiple passes over a document to detect and remove the duplicities.

The vendor extensions are supported in the two best-known commercial FO processors -- XEP and XSL Formatter. The DocBook XSL stylesheets contain support for these FO implementations; we just tell stylesheets to use these extensions by turning on an appropriate parameter. For instance, XEP should be invoked by the following command line:


xep -xml document.xml -xsl .../fo/docbook.xsl -param xep.extensions=1

In the real world we usually change behavior of the stylesheets by customizing more than one parameter. The best practice is then to create a customization layer, which imports stock stylesheets and sets all necessary parameters.


<?xml version="1.0" encoding="utf-8"?>

<xsl:stylesheet xmlns:xsl=

"http://www.w3.org/1999/XSL/Transform" version="1.0">



<xsl:import href="http://docbook.sourceforge.net/release/xsl

/current/fo/docbook.xsl"/>



<xsl:param name="paper.type" select="'A4'"/>

<xsl:param name="xep.extensions" select="1"/>



</xsl:stylesheet>

If you prefer XSL Formatter over XEP, you can use a similar parameter axf.extensions to turn on the XSL Formatter support. Using both parameters results in removing duplicate page numbers and in creating a page range for continuous sequences of page numbers. For example, if a single index entry occurs on the following pages:

5, 5, 8, 9, 10, 37

The output will be more reasonable and aesthetic in the following way:


            5, 8–10, 37

          

In the future no need of using such vendor extensions will be necessary because the upcoming version 1.1 of XSL-FO has the direct index support.

When using another FO processor, we must employ a more difficult procedure. This is also the case of the open-source FOP processor. We must process the document twice. The first pass is done with the make.index.markup parameter set.

The resulting PDF will contain an XML markup for index entries and page numbers. This PDF can be converted to plain text from which the XML markup is extracted. The duplicates are then removed and the modified XML fragment of the index is now used to get the proper PDF. This process is a real hackery, and it does not work very well for languages that use characters outside the ISO Latin 1 -- the FOP does not insert the proper Unicode mapping vector for embedded fonts. This technique was invented by G. Ken Holman.

Internationalized Indexes

The DocBook XSL stylesheets adapt its output to a document language. The document language can be specified by using the lang attribute.


<?xml version='1.0' encoding='utf-8'?>

<!DOCTYPE book PUBLIC '-//OASIS//DTD DocBook XML V4.3//EN'

'http://www.oasis-open.org/docbook/xml/4.3/docbookx.dtd'>



<book lang="de">

  ... German book ...

</book>

Due to the previously mentioned limitations of XSLT the stylesheets cannot use different grouping criteria for each language. Fortunately, several XSLT processors offer extensions to XSLT that can be used to overcome this limitation. As these extensions are not backward compatible with pure XSLT, they cannot be included in the default stylesheet. If we want to generate an internationalized index, we must use the EXSLT-aware XSLT processor, which supports user-defined functions. Then these functions can be used in the definition of a lookup key (xsl:key). These criteria are met by Saxon; xsltproc is still having some unresolved issues at the time of this writing.

If we want to use internationalized indexing features of the stylesheets we must create a customization layer that will override default index-generating templates by including a small autoidx-ng.xsl stylesheet.


<?xml version="1.0" encoding="utf-8"?>

<xsl:stylesheet xmlns:xsl=

"http://www.w3.org/1999/XSL/Transform" version="1.0">



<xsl:import href="http://docbook.sourceforge.net/release

/xsl/current/fo/docbook.xsl"/>



<xsl:include href="http://docbook.sourceforge.net/release

/xsl/current/fo/autoidx-ng.xsl"/>



<!-- 

Parameter settings and other modifications of stylesheet

-->



</xsl:stylesheet>

The internationalized indexing is implemented for both HTML and print (FO) output. Each output format has its own autoidx-ng.xsl file in the corresponding directory. The stylesheets currently support the internationalized indexing for the following languages: Czech, Danish, German, English, Spanish, French, and Turkish.

The described method of internationalization places each index term into the correct letter group and the groups are sorted in proper collating order. Sorting of entries within one group is left to the XSLT processor, which may be a problem because many XSLT processors support only the English sort order out-of-the-box. Saxon 6.5.3 (the recommended version for use with the DocBook stylesheets) can be easily extended to support user-defined collation.

We first create simple implementation of a TextComparer, which must be named after the language code. For example, for German we must create a class named Compare_de.


package com.icl.saxon.sort;



import java.text.Collator;

import java.util.Locale;



public class Compare_de extends TextComparer

{



  int caseOrder = UPPERCASE_FIRST;



  public int compare(Object a, Object b)

  {



      Collator deCollator = 

      Collator.getInstance(new Locale("de", "de"));



      return deCollator.compare(a, b);

  }



  public Comparer setCaseOrder(int caseOrder)

  {

      this.caseOrder = caseOrder;

      return this;

  }



}

Then we must compile this class into the Java bytecode:


javac -classpath /path/to/saxon.jar Compare_de.java

The resulting file Compare_de.class must be on the CLASSPATH when Saxon is invoked in order to get the proper German sorting. The same procedure applies to another languages.

Multiple Indexes in a Document

Occasionally we have to deal with documents that contain more than one index. Combination of author and subject indexes is quite common. The stylesheets are ready for this situation. The only thing we have to do is to distinguish index entries by specifying an index identifier in the type attribute.


<para>

  Wealth of modern societies is built upon information

  <indexterm type="subj">

    <primary>information</primary>

  </indexterm>.

  Information theory was evolved in the Forties by Claude

  Shannon.

  <indexterm type="name">

    <primary>Shannon, Claude</primary>

  </indexterm>

</para>

Then we must place two index elements at the end of the document, each denoting one specialized index by its type.


<index type="subj"/>



<index type="name">

<title>Name index</title>

</index>

Generating multiple indexes is turned on by default, but can be suppressed by the index.on.type parameter. DocBook 4.2 and earlier versions do not support the new type attribute. In that case we can use the universal role attribute for index typing. The stylesheets also contain the corresponding index.on.role parameter.

From the Semantic Markup to the Index

In DocBook you can use dozens of different elements to distinguish between file names, function names, commands, etc. The following paragraph demonstrates how to use semantic markup.


<para>

  <command>rm</command> command can be very useful, but be

  careful when you are using it. There are several files in

  your system like <filename>/etc/passwd</filename> which

  are quite important.

</para>

Adding semantically distinguished terms into an index is important since readers often use indexes for quick lookups. In order to place terms from the previous example into the index we must use quite a lot of markup.


<para>

  <command>rm</command>

  <indexterm><primary>rm</primary></indexterm>

  <indexterm>

    <primary>commands</primary>

    <secondary>rm</secondary>

  </indexterm>

  command can be very useful, but be careful when you are

  using it. There are several files in your system like

  <filename>/etc/passwd</filename>

  <indexterm><primary>/etc/passwd</primary></indexterm>

  which are quite important.

</para>

This markup will produce the following output in the index:


            - Symbols -
/etc/passwd, 42

- C -
commands,
  rm, 42

- R -
rm, 42

The resulting index is useful, isn't it? But to be honest, no one wants to type all these redundant index terms manually. Fortunately, mapping from semantic markup to index entries is simple and unambiguous in this situation and can be easily automated. The following standalone stylesheet takes an arbitrary DocBook document and adds index entries for each command and filename element.

<?xml version="1.0" encoding="utf-8"?>

<xsl:stylesheet xmlns:xsl = 

"http://www.w3.org/1999/XSL/Transform" version="1.0">



<!-- By default copy the whole document -->

<xsl:template match="node()|@*">

  <xsl:copy>

    <xsl:apply-templates select="node()|@*"/>

  </xsl:copy>

</xsl:template>



<!-- Each command is placed twice into index -->

<xsl:template match="command">

  <!-- Copy original element -->

  <xsl:copy-of select="."/>

  <!-- Create new index entries -->

  <indexterm>

    <primary><xsl:value-of select="."/></primary>

  </indexterm>

  <indexterm>

    <primary>commands</primary>

    <secondary><xsl:value-of select="."/></secondary>

  </indexterm>

</xsl:template>



<!-- Each filename is placed into index -->

<xsl:template match="filename">

  <!-- Copy original element -->

  <xsl:copy-of select="."/>

  <!-- Create new index entry -->

  <indexterm>

    <primary><xsl:value-of select="."/></primary>

  </indexterm>

</xsl:template>



</xsl:stylesheet>

The result of applying this stylesheet to a document is a temporary document with added index entries for all commands and filenames. We can process this temporary document as any other DocBook document. The whole process can be easily automated using make, shell scripting, or a similar technique.

The DocBook stylesheets also offer a more sophisticated solution. Index terms can be automatically added even during normal stylesheet processing without need of a temporary file and two transformations. The idea is implemented on top of profiling stylesheets. The profiling stylesheets are special versions of standard stylesheets that can filter content before the real transformation starts.

This can be used for conditional documents where different parts of a document are presented to different, target audiences. The internal implementation of profiling performs a special copying-and-filtering phase before processing. During this phase, a temporary profiled document is created in a memory. We can alter this process to add index terms for semantic Elements, as these elements are rarely used for profiling.


<?xml version="1.0" encoding="utf-8"?>

<xsl:stylesheet xmlns:xsl= 

"http://www.w3.org/1999/XSL/Transform" version="1.0">



<!-- Import of the original stylesheet -->

<xsl:import href = 

"http://docbook.sourceforge.net/release/xsl/current/fo

/profile-docbook.xsl"/>



<!-- Each command is placed twice into index -->

<xsl:template match="command" mode="profile">

  <!-- Copy original element -->

  <xsl:copy-of select="."/>

  <!-- Create new index entries -->

  <indexterm>

    <primary><xsl:value-of select="."/></primary>

  </indexterm>

  <indexterm>

    <primary>commands</primary>

    <secondary><xsl:value-of select="."/></secondary>

  </indexterm>

</xsl:template>



<!-- Each filename is placed into index -->

<xsl:template match="filename" mode="profile">

  <!-- Copy original element -->

  <xsl:copy-of select="."/>

  <!-- Create new index entry -->

  <indexterm>

    <primary><xsl:value-of select="."/></primary>

  </indexterm>

</xsl:template>



</xsl:stylesheet>

Conclusion

DocBook in conjunction with the DocBook XSL stylesheets offer complex solutions for creating and processing indexes. This article has shown how easily you can create and process indexes in DocBook. The stylesheets are also ready to fulfill challenging requirements for internationalized indexes and easy, semantic-markup indexing.

Related Links

[1] Download samples

[2] DocBook XSLStylesheets

[3] DocBook XSL: The Complete Guide from Bob Stayton is a must-read for everyone who wants to hack DocBook XSL stylesheets seriously.

[4] DocBook: The Definitive Guide

[5] XSL-List -- an open forum on XSL.