Electronic Publishing with XML
June 27, 2001
John McKeown and Benjamin Jung
Introduction
In this article, we describe the process of creating electronic publications using XML and related standards. This publishing procedure has been used to generate conference proceedings for the XML Europe 2001 Conference. We will describe the most important steps in this XML-based publishing process and highlight some of its advantages.
XML Europe 2001
Now in its seventeenth year, the XML Europe Conference was held this year in Berlin (May 21-25, 2001). Formerly known as SGML Europe, the conference was renamed SGML/XML Europe in 1998 and subsequently became XML Europe.
In the past, the proceedings for XML Europe have been available in both paper and electronic formats. For various reasons, the conference organizers, GCA, discarded the paper version this year and opted for an electronic publication only. This was distributed on CD-ROM to each of the conference delegates. Additionally, the GCA used this publication as the basis for an online version on their web site. XML technologies were used throughout the creation process.
An XML-based Publishing Process
Producing a publication using XML technologies involves a number of distinct steps: content creation, validation, and publication. These steps are discussed in the following sections and are applicable to the production of any publication (electronic or print) with XML.
Step 1: XML Content Creation
The first step in an XML-based publishing process is the creation or acquisition of content in an appropriate XML vocabulary. The vocabulary should be flexible enough to represent all common features (e.g. headings, sections, sub-sections, paragraphs, links) and advanced features (e.g. tables, figures and bibliography) of a publication. One possible vocabulary is DocBook XML, used to markup documents such as books, articles, and technical documentation in logical sections.
For the XML Europe conference, an XML DTD was developed that defines the structure of a generic conference paper. This is known as the GCAPaper DTD. Each author whose presentation abstract was accepted by the conference program committee was requested to submit the final paper in XML according to the GCAPaper DTD. The use of this DTD ensures a similar structure for each paper. Thus, all papers can be processed in an identical manner by the publishing process. Here is an example document.
<gcapaper id="s01-1" day="Tuesday" attendee="All"> <front> <title>The power of XML</title> <author refid="s01-1auth1"> <fname>John</fname> <surname>Smith</surname> <jobtitle>Senior Consultant</jobtitle> <address> <affil>Global Enterprises</affil> <city>Dublin</city> <cntry>Ireland</cntry> <email>john.smith4@globent.com</email> </address> <bio id="s01-1auth1"> <para> <highlight>John Smith</highlight> - John is a senior consultant for Global Enterprises </para> </bio> </author> <abstract> <para>XML is a powerful language for defining markup languages for specific application domains. The XML Specification has been a W3C recommendation since February 1998.</para> </abstract> </front> <body> <para>Paper unavailable at press time.</para> </body> </gcapaper>
To support authors and facilitate the creation of papers in XML, a variety of tools were provided. These included dedicated XML editors (Epic by Arbortext and XMetal by SoftQuad) and extensions to Microsoft Word that allow content to be exported to XML (WorX by HyperVision and S4/Text by i4i). Each of these tools were made available under an evaluation license and were customized to produce XML content adhering to the GCAPaper DTD.
Step 2: Input Validation
Once the content for a publication is in XML, it needs to be validated against the publication DTD. This type of structural validation is a core feature of XML and can easily be performed using any validating XML parser. In addition to structural validation, it is also necessary to validate the contents of the publication logically. This ensures that elements in the DTD have been used in a consistent and correct manner (e.g. "Dublin" is marked as a city and not as a country). The content validation step is particularly important when the content originates from many sources.
Almost all papers submitted to XML Europe 2001 adhered to the GCAPaper DTD. An exception included Microsoft PowerPoint presentations, which had to be converted to the GCAPaper DTD structure before they could be included in the conference proceedings publication. Further validation of all papers was then required to ensure they adhered to specific authoring guidelines for the DTD.
The authoring guidelines accompanying the GCAPaper DTD specify the correct usage of elements in the DTD and also define naming conventions for cross-references and images used within each paper. Validation of authoring guidelines is especially important for conference proceedings as a variety of authoring tools are used to produce papers. Once all conference papers were received and validated, they were imported into a master document representing the conference proceedings publication.
Step 3: Producing electronic publication formats from XML
XML is predominantly used to define markup vocabularies for a specific application domain and in general has no default formatting styles, as is the case with HTML. Instead, stylesheets are used to associate presentational information with XML documents. The W3C has developed a stylesheet language specifically for XML known as the eXtensible Stylesheet Language (XSL).
XSL consists of a transformation language (XSLT) and a language for high-quality formatting and layout of XML documents known as XSL Formatting Objects (XSL-FO). The XSLT Specification has been a W3C Recommendation since 1999 and is widely used as a means of transforming content from XML to other formats (including, but not limited to, XML). XSLT processors are available in various programming languages. Recent versions of certain Web browsers also support XSLT processing.
The XSL Specification, which defines XSL-FOs, is considerably larger than XSLT and is currently a W3C Candidate Recommendation. As a result, software support for XSL-FOs is not as widespread as XSLT at present. A number of software tools are under development that will support XSL-FOs. One tool already available is FOP, an open source print formatter driven by XSL-FOs, from the Apache XML Project. Although FOP is still in development, and does not currently support the full XSL specification, it can be used to create PDF documents from XML content.
Related Articles |
For this year's XML Europe conference, the GCA chose deepX (the authors' Ireland-based company, specializing in electronic publishing with XML) to create a version of the conference proceedings for distribution on CD-ROM and publication on the GCA web site. These proceedings included HTML as well as printable PDF versions of each conference paper. The creation of these formats (and others such as eBook) was entirely achieved through the use of XSL.
XSLT was used in the production of the XML Europe conference proceedings to generate an HTML version of each conference paper. Additional information pages were generated for efficient navigation within the proceedings, including a table of contents, index pages, and a biography page for each author. FOP was used to create a PDF version of each paper and a single PDF document containing the entire publication. The XSL stylesheets used to create the PDF documents were based on the DocBook XSL stylesheets developed by Norman Walsh.
To demonstrate further the potential of an XML-based publishing process, deepX produced an eBook version of the conference proceedings, based upon the Open eBook (OEB) publication structure. The OEB format describes the content and structure of an electronic publication and is supported by most eBook hardware and software readers. In some cases an OEB publication is compiled into a hardware/software specific eBook format. XSLT was used to generate the OEB version of the conference publication. A variety of eBook implementations were demonstrated at the deepX booth at XML Europe 2001 using eBook reader software on a desktop computer and PocketPC (Microsoft Reader), as well as on a dedicated eBook device (CyBook).
Advantages
A publishing process based on XML has a number of significant advantages. With the XML Europe conference, the GCA has adopted such a procedure not only because of the XML focus of the conference but also for its many advantages. Content defined in XML is platform and software independent. It is also independent of a particular display format, since XML separates content from presentational information. This simplifies the generation of multiple formats from a single source using technologies like XSLT. In addition, this allows the content to be future compatible with emerging publication formats by defining an appropriate transformation to those formats.
Challenges
XML is much easier to learn, use, and process than its parent SGML, and it has been adopted by a wider range of applications domains than SGML. However, current support for XML as a publishing format is only provided by specialized software or as an export format within more common publishing tools. To facilitate the creation of content in XML requires tools that allow authors to produce structured information without having to change their current practices.
Although submissions for XML Europe 2001 were requested in an XML format, a small number did not adhere to this guideline. The submissions received included non-XML documents (e.g. Microsoft PowerPoint) as well as well-formed but invalid XML documents. To keep the publication process consistent, a significant amount of manual work had to be undertaken to correct several submissions. The process for generating the conference proceedings could be made more flexible by providing import filters from common authoring environments such as PowerPoint.
Conclusions
XML is considered an ideal technology around which to build a publishing process. It is a platform and software independent language that can be transformed into a variety of common publishing formats, including content formats used on the Web. The publishing process adopted by the GCA has proven very successful, and it's fitting that a conference promoting XML demonstrates one of its many useful applications.
Links
- Apache - http://xml.apache.org/
- deepX - http://www.deepX.com/
- DocBook XML - http://www.docbook.org/
- DocBook XSL Stylesheets - http://www.nwalsh.com/
- GCA - http://www.gca.org/
- OEBForum - http://www.openebook.com/
- W3C - http://www.w3.org/
- XML Europe 2001 - http://www.gca.org/papers/xmleurope2001/
Acknowledgments
The authors would like to thank Pam Gennusa for her efforts in the coordination and production of the conference proceedings for XML Europe 2001. Thanks also to Hewlett Packard Ireland and Cytale, who provided devices for demonstration purposes at the deepX booth at XML Europe 2001. More information about these devices (Jornada 548 and 720, CyBook) can be found on the companies' respective web sites.