XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

ScrollKeeper: Open Source Document Management

November 28, 2001

The Document Collection Problem

Operating systems are very complex these days, composed of many parts and pieces. Linux, like other Unix-like, free software operating systems, is really just a collection of autonomous and dependent software packages. On my workstation there are about 850 packages at last count. A moderately busy, production Internet server might have as many as 650 packages. And a development server, supporting diverse activities of a complex development team, might have as many as 1,000 packages.

Each of these packages, in turn, contains files numbering from a few to several hundred. My workstation's 850 packages contain about 120,000 files, which means each package contains an average of 140 files. Some of these files, thankfully, are package documentation. If 5% of each package's files are documentation, that's 5,000 documentation files. Together these files form the system documentation, which is one of the many virtues of a free software operating system.

Table of Contents

The Document Collection Problem

ScrollKeeper

OMF: Free Software's Dublin Core Lite

Hacking ScrollKeeper

The Future of ScrollKeeper

Not only is there a lot of system documentation, but it exists in a wide variety of formats, conventions, and standards. For example, on a Linux system you can find documentation in command-line switches (-h, --help); man pages (often produced by groff or similar "first generation" Unix documentation systems); plain text README files, which may or may not follow a folkish layout or structural convention; info and texinfo files, an explicitly tree-shaped, node-based documentation system intended as a "next generation" man system; TeX and LaTeX files, from which DVI files are generated; Adobe PDF and Postscript files with or without the source documents from which they were derived; and lest we overlook the angle-bracket world: DocBook (perhaps versions 2, 3, and 4, in SGML and XML instances), HTML (of various vintage and contemporary DTDs), XHTML, and ad hoc, project-specific markup languages; and then there are many less well-known, less well-adopted documentation systems, with various capabilities and conventions and formats.

Each of these formats has one or several viewing contexts -- applications that are ideal or merely passable for viewing them -- and, perhaps, a compiler-like application that's used to create them.

In short, there is, in principle, a non-trivial document collection problem inside every server or workstation on your network.

The great temptation is to throw it all away in favor of Google.com. And that might actually work in many cases; sometimes you'll want to look at the documentation for a new version of some package in order to decide whether you want to upgrade. It doesn't make much sense to install the package, read the new documentation, only to learn you didn't want to install the package after all, and then remove the package. It's simpler to find the new documentation on the Web first.

But some package maintainers do not use the Web exclusively. And there are virtues, depending on the context of usage, across the variety of document flavors: in some settings, a man page is precisely what you want. Throwing it all away in favor of the Web isn't a real solution.

System administrators and users need a framework for document collection that's evolved within the ecological niche of a free software Unix-like operating system. And that's exactly what the ScrollKeeper project provides. ScrollKeeper is "a cataloging system for documentation on open systems," which "manages documentation metadata...and provides a simple API to allow help browsers to find, sort, and search the document catalog." ScrollKeeper uses the Open Source Metadata Framework (hereafter, OMF) -- a subset adaptation of Dublin Core -- to describe document metadata.

Over the course of its evolution, ScrollKeeper has been guided by the document collection needs of the GNOME project in particular, with Dan Mueth, lead of the GNOME Documentation Project, and Sun's Laszlo Kovacs contributing design ideas and code. This should come as no surprise since one aim of GNOME is to provide a consistent, unified interface for Unix-like systems, and that means not only providing consistent help and documentation tools for GNOME applications, but for the underlying system documentation as well.

ScrollKeeper

The current version of ScrollKeeper (0.2) provides basic support for two different kinds of user: package maintainers who provide system documentation are encouraged to create an OMF file to describe their documentation resources; system integrators and document application developers are encouraged to use ScrollKeeper's metadata API to create a variety of "help browsers" and other collection tools, including integrating help and document functions into existing systems, like the Nautilus file browser or GNOME control panel. ScrollKeeper thus provides a kind of "middleware" between document producers and consumers.

In practice, ScrollKeeper is a tool chain which can be used to create, store, and manages trees of document metadata, especially metadata represented as OMF instances. It serves as a concrete means to promote the use of OMF as a metadata representation. These two goals are mutually reinforcing. Without some standard metadata representation, it is extremely difficult to create a general metadata management API. Imagine, for example, writing metadata extractors for each of the document formats above, some of which don't have any, to say nothing of a standard, way to represent metadata.

A document collection tool that's going to survive in this niche really needs a generalized metadata representation, which is what the OMF provides. But the flip side is equally true. Without some promise that metadata description efforts, however minimal, will bear fruit (by being well-integrated at the user level), it's difficult for independent (often non-commercial) package maintainers to see the point of exerting even minimal effort to describe documentation resources at all.

Since document collections can be conceptualized as trees, and since XML/SGML is very good at representing data as trees, it's unsurprising to learn that ScrollKeeper uses XML extensively. There are three central parts of ScrollKeeper currently -- a contents list, a table of contents, and an extended contents list -- which it creates at document install or uninstall time and stores as XML.

The contents list is a system-wide tree of every document known to ScrollKeeper, often sorted on the OMF subject element, which is ideally constrained by means of a controlled vocabulary of subjects (i.e., an authoritative classification of subject values, in canonical form, which is used to normalize subject data).

At this point the conceptual division of labor is clear. People who write help browsers and other user applications aren't necessarily interested in creating controlled vocabularies. Further, different communities employing OMF may well need to use different domain-specific controlled vocabularies. For example, the controlled subject vocabulary suitable for GNOME application documents wouldn't necessarily be well-suited to describe other kinds of documentation resources. The various users of a metadata representation scheme like OMF may need several controlled vocabularies, without which metadata will, over time, become fragmented, unreliable, and less useful.

The contents list is created as ScrollKeeper examines OMF instances, which are stored in a directory, $OMFDIR, say, /usr/share/omf. Thus, in order for package maintainers to register their resource metadata with ScrollKeeper, they merely have to ensure that an OMF instance is copied to $OMFDIR. There are plans for future versions of ScrollKeeper to create OMF instances on the fly by extracting metadata from document resources that store metadata in predictable, sane ways. DocBook is a good example of a format from which, in principle, metadata may be automatically extracted. In order to avoid name collisions, ScrollKeeper specifies a template for the name of a file-based OMF instances -- [document_title]-[locale].omf.

The table of contents is a per-document tree representing the main structural contents of a document (i.e., sections and subsections). ScrollKeeper creates the table of contents automatically for DocBook resources by extracting section and subsection elements.

The extended contents list is another system-wide tree created by merging the contents list and the table of contents for each document in the contents list. It's simple to imagine a fairly useful system-wide help browser which just gives users a way to navigate a graphical representation of the extended contents list tree.

If you're using ScrollKeeper in an application, locating the various XML representations of the contents list, extended contents list, and tables of contents is as simple as calling scrollkeeper-get-contents-list [language], which returns the file system path of the contents list XML document; scrollkeeper-get-extended-contents-list [language], which returns the file system path of the contents list XML document; scrollkeeper-get-toc-from-docpath [docpath], which returns the file system path of the table of contents of a document; and scrollkeeper-get-toc-from-id [doc_id], which also returns the table of contents path, given a document id.

Pages: 1, 2, 3

Next Pagearrow