ScrollKeeper: Open Source Document Management
November 28, 2001
The Document Collection Problem
Operating systems are very complex these days, composed of many parts and pieces. Linux, like other Unix-like, free software operating systems, is really just a collection of autonomous and dependent software packages. On my workstation there are about 850 packages at last count. A moderately busy, production Internet server might have as many as 650 packages. And a development server, supporting diverse activities of a complex development team, might have as many as 1,000 packages.
Each of these packages, in turn, contains files numbering from a few to several hundred. My workstation's 850 packages contain about 120,000 files, which means each package contains an average of 140 files. Some of these files, thankfully, are package documentation. If 5% of each package's files are documentation, that's 5,000 documentation files. Together these files form the system documentation, which is one of the many virtues of a free software operating system.
• The Document Collection
Not only is there a lot of system documentation, but it exists in a wide variety of
formats, conventions, and standards. For example, on a Linux system you can find
documentation in command-line switches (
pages (often produced by groff or similar "first generation" Unix documentation
systems); plain text README files, which may or may not follow a folkish layout or
structural convention; info and texinfo files, an explicitly tree-shaped,
node-based documentation system intended as a "next generation" man system;
TeX and LaTeX files, from which DVI files are generated; Adobe
PDF and Postscript files with or without the source documents from which they were
and lest we overlook the angle-bracket world: DocBook (perhaps versions 2, 3, and
4, in SGML
and XML instances), HTML (of various vintage and contemporary DTDs), XHTML, and ad
project-specific markup languages; and then there are many less well-known, less
well-adopted documentation systems, with various capabilities and conventions and
Each of these formats has one or several viewing contexts -- applications that are ideal or merely passable for viewing them -- and, perhaps, a compiler-like application that's used to create them.
In short, there is, in principle, a non-trivial document collection problem inside every server or workstation on your network.
The great temptation is to throw it all away in favor of Google.com. And that might actually work in many cases; sometimes you'll want to look at the documentation for a new version of some package in order to decide whether you want to upgrade. It doesn't make much sense to install the package, read the new documentation, only to learn you didn't want to install the package after all, and then remove the package. It's simpler to find the new documentation on the Web first.
But some package maintainers do not use the Web exclusively. And there are virtues, depending on the context of usage, across the variety of document flavors: in some settings, a man page is precisely what you want. Throwing it all away in favor of the Web isn't a real solution.
System administrators and users need a framework for document collection that's evolved within the ecological niche of a free software Unix-like operating system. And that's exactly what the ScrollKeeper project provides. ScrollKeeper is "a cataloging system for documentation on open systems," which "manages documentation metadata...and provides a simple API to allow help browsers to find, sort, and search the document catalog." ScrollKeeper uses the Open Source Metadata Framework (hereafter, OMF) -- a subset adaptation of Dublin Core -- to describe document metadata.
Over the course of its evolution, ScrollKeeper has been guided by the document collection needs of the GNOME project in particular, with Dan Mueth, lead of the GNOME Documentation Project, and Sun's Laszlo Kovacs contributing design ideas and code. This should come as no surprise since one aim of GNOME is to provide a consistent, unified interface for Unix-like systems, and that means not only providing consistent help and documentation tools for GNOME applications, but for the underlying system documentation as well.
The current version of ScrollKeeper (0.2) provides basic support for two different kinds of user: package maintainers who provide system documentation are encouraged to create an OMF file to describe their documentation resources; system integrators and document application developers are encouraged to use ScrollKeeper's metadata API to create a variety of "help browsers" and other collection tools, including integrating help and document functions into existing systems, like the Nautilus file browser or GNOME control panel. ScrollKeeper thus provides a kind of "middleware" between document producers and consumers.
In practice, ScrollKeeper is a tool chain which can be used to create, store, and manages trees of document metadata, especially metadata represented as OMF instances. It serves as a concrete means to promote the use of OMF as a metadata representation. These two goals are mutually reinforcing. Without some standard metadata representation, it is extremely difficult to create a general metadata management API. Imagine, for example, writing metadata extractors for each of the document formats above, some of which don't have any, to say nothing of a standard, way to represent metadata.
A document collection tool that's going to survive in this niche really needs a generalized metadata representation, which is what the OMF provides. But the flip side is equally true. Without some promise that metadata description efforts, however minimal, will bear fruit (by being well-integrated at the user level), it's difficult for independent (often non-commercial) package maintainers to see the point of exerting even minimal effort to describe documentation resources at all.
Since document collections can be conceptualized as trees, and since XML/SGML is very good at representing data as trees, it's unsurprising to learn that ScrollKeeper uses XML extensively. There are three central parts of ScrollKeeper currently -- a contents list, a table of contents, and an extended contents list -- which it creates at document install or uninstall time and stores as XML.
The contents list is a system-wide tree of every document known to ScrollKeeper, often sorted on the OMF subject element, which is ideally constrained by means of a controlled vocabulary of subjects (i.e., an authoritative classification of subject values, in canonical form, which is used to normalize subject data).
At this point the conceptual division of labor is clear. People who write help browsers and other user applications aren't necessarily interested in creating controlled vocabularies. Further, different communities employing OMF may well need to use different domain-specific controlled vocabularies. For example, the controlled subject vocabulary suitable for GNOME application documents wouldn't necessarily be well-suited to describe other kinds of documentation resources. The various users of a metadata representation scheme like OMF may need several controlled vocabularies, without which metadata will, over time, become fragmented, unreliable, and less useful.
The contents list is created as ScrollKeeper examines OMF instances, which are stored in a directory, $OMFDIR, say, /usr/share/omf. Thus, in order for package maintainers to register their resource metadata with ScrollKeeper, they merely have to ensure that an OMF instance is copied to $OMFDIR. There are plans for future versions of ScrollKeeper to create OMF instances on the fly by extracting metadata from document resources that store metadata in predictable, sane ways. DocBook is a good example of a format from which, in principle, metadata may be automatically extracted. In order to avoid name collisions, ScrollKeeper specifies a template for the name of a file-based OMF instances -- [document_title]-[locale].omf.
The table of contents is a per-document tree representing the main structural contents of a document (i.e., sections and subsections). ScrollKeeper creates the table of contents automatically for DocBook resources by extracting section and subsection elements.
The extended contents list is another system-wide tree created by merging the contents list and the table of contents for each document in the contents list. It's simple to imagine a fairly useful system-wide help browser which just gives users a way to navigate a graphical representation of the extended contents list tree.
If you're using ScrollKeeper in an application, locating the various XML representations of the contents list, extended contents list, and tables of contents is as simple as calling scrollkeeper-get-contents-list [language], which returns the file system path of the contents list XML document; scrollkeeper-get-extended-contents-list [language], which returns the file system path of the contents list XML document; scrollkeeper-get-toc-from-docpath [docpath], which returns the file system path of the table of contents of a document; and scrollkeeper-get-toc-from-id [doc_id], which also returns the table of contents path, given a document id.
OMF: Free Software's Dublin Core Lite
OMF, a domain-specific subset of Dublin Core, is the result of work done by members of University of North Carolina's Open Source Research Team, most of whom are affiliated with UNC's School of Information and Library Sciences and with UNC's ibiblio (formerly UNC MetaLab).
The team that produced OMF includes experts in information science, metadata, electronic archives, and digital libraries. The project evolved independently of ScrollKeeper during the most active phase of development; it was meant to serve as an upgrade of the metadata tools used to create MetaLab's Linux Software Maps. Thus it was not originally intended to represent metadata about documentation resources per se but, rather, open source software resources generally. It is a testament to the foresight and ability of both the Dublin Core and OMF teams that ScrollKeeper and the Linux Documentation Project are both able to represent document metadata with OMF.
The OMF is made up of the following 16 elements.
Author or Creator
The person or organization primarily responsible for creating the intellectual content of the resource. Preferred format: firstname.lastname@example.org (Full Name)
The person or organization responsible for publishing the resource in its current form. If left blank, this value defaults to CREATOR.
A person or organization not specified in a CREATOR or MAINTAINER element who has made significant intellectual contributions to the resource but whose contribution is secondary to any person or organization specified in a CREATOR or MAINTAINER element.
The name given to the resource by the CREATOR or MAINTAINER.
The date on which the resource was made available in its current form. (Recommended best practice is an 8 digit number in the form YYYY-MM-DD as defined in http://www.w3.org/TR/NOTE-datetime, a profile of ISO 8601.
VERSION is a multifaceted element, consisting of three attributes. VERSION.identifier consists of a string or number that distinguishes the current revision of the resource from other revisions. VERSION.date records the date the resource was made available in the form specified by VERSION.identifier. VERSION.description summarizes revisions that distinguish VERSION.identifier from other versions of the resource. Repeated instances of VERSION constitute the revision history of a resource.
Subject and Keywords
The topic of the resource. Typically, this element employs keywords that summarize the subject or content of the resource.
A description of the content of the resource (e.g., an abstract, contents note).
The category of the resource. Contents of this element should conform to a domain-specific controlled vocabulary.
FORMAT is a multifaceted element, which describes the implementation of the resource. FORMAT.DTD describes the document type definition used in the resource (if any). FORMAT.MIME should be expressed as a MIME type, as defined in RFC 2046.
A specification of a unique ID by which the resource may be identified and from which the resource may be retrieved. Entries for this field should contain a valid URL which returns the resource in question.
A specification of any previous or alternative publication of the resource in its current form (e.g. a larger work from which the resource is extracted, such as a chapter taken from a book). SOURCE may include a URL, ISBN, or similar device.
Language(s) of the content of the resource. Where practical, the content of this field should coincide with RFC 1766.
A URL that points to the IDENTIFIER element of another resource. Each instance of RELATION links the resource to other resources of similar domain or style.
A multifaceted description of the resource's intellectual scope that consists of five attributes. COVERAGE.geographic identifies regional specificity of the resource. Where practical COVERAGE.geographic should be expressed as an ISO 3166-compliant string of two characters. COVERAGE.distribution identifies a Linux distribution explicitly specified in the resource. COVERAGE.kernel identifies the kernel version treated in the resource. COVERAGE.architecture identifies hardware described in the resource. COVERAGE.os identifies an operating system explicitly specified in the resource.
A multifaceted element indicating the policy under which the resource is distributed. Four attributes form the RIGHTS element. RIGHTS.type identifies the name of the resource's distribution license. Where possible, he value of RIGHTS.type should be selected from a controlled vocabulary. RIGHTS.license identifies the URL for the license referenced in RIGHTS.type, where applicable. RIGHTS.license.version identifies the version number of the resource's license. RIGHTS.holder identifies the person or organization who holds the rights for the resource described in RIGHTS.license.
System administrators often need to write quick and dirty scripts and tiny applications to automate some onerous task or to support some local policy or convention. Most package management frameworks, like RPM and DEB, provide an API for manipulating the installed base of packages. ScrollKeeper provides a similar facility for working with system documentation.
ScrollKeeper's use of lowest common denominator technology makes it ideal for sysadmin hacking, as well as for integrating with a variety of Internet and intranet web tools. If your intranet uses XSLT to render documents for web browsing, presenting a catalog of system documentation is as simple as calling scrollkeeper-get-extended-contents-list and applying a stylesheet to the result.
Likewise, if your organization uses DocBook for documentation, integrating locally produced documents -- for example, a departmental policies and procedures or operations manual -- into the help browser of GNOME desktops becomes a fairly trivial matter. This technique is especially helpful in document-rich organizations -- university academic departments, research laboratories, policy think-tanks -- where easy and ready access to standard documents is crucial.
As a routine part of installing and updating desktop and applications software on user machines, a sysadmin merely creates, say, a Debian package, which contains, for example, the latest operations and safety manuals. She also creates an OMF metadata instance for each manual included in the package. As part of the package installation routine, the package manager registers each manual with ScrollKeeper, which it can do by calling
or, if the package is upgrading an existing package, by calling
It's a good idea to use a directory under /usr/local to store local OMF instances, which can be done by setting the OMF_DIR environment variable.
An OMF instance for a locally-produced documentation resource might look something like the following.
<omf> <resource> <title> US DOD BLU-82 "Daisy Cutter" Operations Manual</title> <creator> <person> <firstName>Dr.</firstName> <lastName>Strangelove</lastName> <email>email@example.com</email> </person> </creator> <subject> <category>System|Other</category> </subject> <description> This document describes the BLU-82 mass area demolition and anti-personnel munition, aka, the "Daisy Cutter". BLU-82 combines a watery mixture of ammonium nitrate and aluminum with air, then ignites the mist for a huge explosion that incinerates everything within 600 yards. The shock wave can be felt miles away. First created during the Vietnam War to quickly clear jungle landing zones, the daisy cutter was used against Iraqi troops during the Gulf War. Recent reports from the ground in Afghanistan indicate the huge bombs have been used against front-line Taliban positions. The BLU-82 costs about $27,000 each. They are dropped from a C-130 cargo plane flying at least 6,000 feet off the ground, to avoid the bomb's massive shock wave. Each is more than 17 feet long and 5 feet in diameter - about the size of a VW Beetle but far heavier. </description> <type>manual</type> <format mime="text/sgml"/> <identifier url="/usr/local/share/really/big/bombs/daisycutter-manual.sgml"/> <language code="C"/> </resource> </omf>
The Future of ScrollKeeper
Like most open source projects, ScrollKeeper has an ambitious plan for future expansion and, from the looks of it, could use more help. In the short term internationalization and localization improvements are planned, as well as improved searching and indexing functionality. In the longer term, ScrollKeeper may expand to deal with non-local documents and resources, including using a remote OMF server for synchronizing LAN, WAN, and Internet-wide document collections, and so on.
The way ScrollKeeper uses XML is neither novel nor cutting-edge. In fact, it's rather ordinary, even pedestrian. And that's exactly the point. ScrollKeeper's utility lies not in the way it uses XML, but that, by using XML, it allows developers, admins, and others to leverage existing XML tools and knowledge to manage document collections; that it makes possible the creation and growth of useful metadata, which aids both document producers and consumers; and that it does so in a relatively simple and easy to understand way. In the final estimation, surely that is what makes XML a good and useful thing.