Menu

JUMBO: An Object-Based XML Browser

October 2, 1997

Peter Murray-Rust

JUMBO

an Object-Based XML Browser

Peter Murray-Rust

Abstract

JUMBO (Java Universal Markup Language) is an object-oriented XML browser/editor and transformation tool, written in Java. It has been developed as a development tool to explore the emerging XML-LANG and XML-LINK specifications,[1] and implements most of the current proposals. Its emphasis is on the management of structured documents; specifically their interpretation as trees. It provides behavior for ELEMENTs by providing Java classes for rendering or transformation. It is particularly aimed at nontextual applications where ELEMENTs (such as those in technical disciplines) require complex processing. JUMBO also implements much of the current XML-LINK spec, including TEI extended pointers and simple aspects of EXTENDED XML-LINKs.

Introduction

The management and "publication" of information in scientific disciplines such as molecular science is difficult; current approaches involve a large number of incompatible legacy files. Because HTML was not developed for managing nontextual information, its success has further highlighted this problem--one that can be solved only by offering specific, machine-dependent plug-ins for each legacy type. What is required is a single information architecture, independent of platform, and easily extensible. I have developed such a system: Chemical Markup Language (CML), with a generic core Technical Markup Language (TecML), originally using SGML. To distribute and process this requires the following package of components:

sgmls

James Clark's validating parser [1]. This is excellent software, but it is not trivial to distribute SGML systems to a community with no SGML experience. (Most applications require an SGML declaration, a CATALOG, a DTD, possibly some entity sets.) Moreover, a different *.exe is needed for each platform.

CoST

Joe English's rewrite of the original CoST for transforming and searching the output of sgmls. [2] CoST is an excellent tel-based tool, but again has to be distributed as an *.exe.

costwish

I wrote a series of tcl/tk-based scripts to render the output of CoST and to package the SGML environment for non-SGML users. This was successful for UNIX systems, but foundered on the difficulties of a simple port of tk to a Windows-based environment.

costwish was successful in that a virtual collaborator was able to use it for a novel application--the tree-structured display of Chomskian analysis of sentence structure. However, it was clear that the absence of platform-independent graphics and the difficulty of packaging SGML applications made this too complex for general use.

This problem has become radically simpler over the last 12 months through the development of two new technologies, which interoperate extremely easily. Both are designed for use over the Inter- and Intranets, and to be platform independent.

Java

A platform-independent object-oriented language supporting graphics at GUI level. Of particular importance is its availability under modern browsers such as Netscape and Microsoft Explorer, which now come with a Java Virtual Machine. This means that installation effort of Java applications can be effectively made nil.[2] Although Java is often promoted as a language for distributing graphical APPLETs, it is also a clean, powerful, easy, fully object-oriented language. Unlike C++ it comes with a very large library of classes that provide functions for textual management, WWW operations, and much more.

XML

XML is described at length in this volume. One particular feature that is valuable for CML, is that it is much simpler than SGML--XML applications should hold few terrors for experienced HTML developers. The SGML declaration is not required, and many applications can dispense with the DTD. This eliminates many of the newcomer's problems to SGML applications and makes packaging documents much easier. Very importantly, since XML is now catching the imagination of the world, the incentive for authors and implementers to learn it will be much greater. We can expect shrink-wrapped tools on which particular applications like CML can be built.

Rationale for JUMBO

The traditional SGML market is very heavily based on processing textual documents, although there are many examples (such as technical manuals) where non-textual objects (diagrams, parts, etc.) occur. Most SGML applications involve customizing tools for a particular purpose and are often site-specific (i.e., for a particular customer). In general, an SGML application usually has a defined "purpose." Because CML covers a much wider range of object types (see Table 1), it needs software that is generic, abstract,, graphically oriented, and freely available. The simplest solution was to port the ideas from costwish to Java.

The port coincided with the development of XML-LANG, and offered an opportunity to re-design SGML browsing software from scratch. (TecML and CML are now fully conformant DTDs of XML.) The fundamental design decision was that XML elements should be closely linked to Java classes. Since JUMBO was designed to deal with any DTD, there must be a mechanism to locate classes appropriate for it. For any DTD in the DOCTYPE statement, a class should be available that loads the appropriate classes for that DTD. JUMBO also anticipates that a document may refer to more than one DTD and has a mechanism for hierarchical loading of classes for DTDs. Thus, CML requires the loading of TecML classes, and TecML (which uses HTML2.0) loads classes related to that.

Technical Markup Language

Although the main motivation for CML was to manage molecular data, much of the material is generic to a wide range of scientific disciplines. After browsing publications from a wide range of journals I compiled Table 1. To support this I developed a generic scientific language, TecML, and wrote a number of tcl and Java-based classes for it. As XML continues to develop, however, standard tools may support some of these components, and I hope that much of TecML may be eventually become redundant.


Table 1

Component TecML ELEMENT XML/WWW Solutions
Text (X)HTML HTML, TEI, DocBook, 12083
hyperlinks Hardcoded, XML-LINK XML-LINK various
Structured graphics FIGURE CGM, VRML
Images (i.e., pixel maps) FIGURE NOTATION (e.g. GIF)
Typed fields XVAR XML-TYPE?
Typed arrays ARRAY  
Generic container XLIST  
Graphs XLIST+ARRAY children CGM?
Tables XLIST+children HTML? CALS?
Bibliography BIB  
Terminology/Glossary TERMENTRY MARTIF/ISO1260?
Person PERSON  
Parsable Mathematics   MathML
Molecules CML, MOL, etc  
Relationships Hardcoded, XML-LINK XML-LINK EXTENDED

Note

The X- prefixes were used to avoid clashes with elements within HTML and other DTDs that might be used. When XML solves the syntax and semantics of namespaces these will change to ElementTypes such as TecML::VAR, TecML::LIST, and so on. When and where the qualifying namespace is required is yet to be decided by the XML-WG.


At an early stage I decided to limit the number of ElementTypes in TecML to about 15, roughly mirroring those in Table 1 along with a few extra subelements. A few categories such as mathematics were deferred--I am delighted that MathML[3] has been developed in XML and can support mathematics semantics (i.e., equations and expressions can be parsed and manipulated by symbolic packages such as Mathematica). The small number of ElementTypes avoids namespace clashes and allows authors to introduce new concepts without having to rewrite the DTD. Flexibility is achieved by widespread use of attributes. For example, to describe a Melting Point we could write:

 <MELTING_POINT>23.4</MELTING_POINT> 

Using this code, the number of ElementTypes could soon reach millions. TecML describes all scalar quantities like the one above with a single ElementType and qualifying attributes, such as the following:

 
        <XVAR CONVENTION="http://www.learned.soc/physics" 
        DICTNAME="mpt" UNITS="Celsius" TYPE="Float">23.4</XVAR> 
      

TYPE is hardcoded in TecML and can take values of Float, Integer, String, Date and Pointer. It may yet be made obsolete by the XML-TYPE proposal from Tim Bray and others. UNITS is hardlinked to a glossary of scientific units distributed with TecML. CONVENTION and DICTNAME locate the glossary and entry within it, ideally provided by an institution of repute and stability such as a learned society.

The preceding example makes it clear that hyperlinking is a powerful means of resolving semantics. It is also the simplest way of avoiding namespace collisions in TecML documents. Thus, equations can be constructed in MathML, stored in separate files, and linked through XML-LINK's HREF rather than being included in the document directly or indirectly (by entities).

Basics of JUMBO

JUMBO has been developed as a prototype XML engine primarily aimed at:

  • Providing a prototyping tool for XML developers.
  • Exploring non-textual uses of XML.
  • Specifically, but not exclusively, supporting Molecular Science.
  • Resolving semantics through hyperlinking to documents or Java methods.

At the time of this writing, JUMBO has tracked most of the draft specifications of XML-LANG and XML-LINK.

JUMBO is built from components; the applications it can be configured for are not limited. At present it consists of the following parts:

An XML parser

The built-in parser is simplistic. JUMBO will also interoperate with Lark,[4] NXP or ESIS input. When the Xapi-J interface (from John Tigue) [3] is stable, it will be implemented so that JUMBO is layered on top of the parsing machinery; this will enable different parsers to be switched under user control.

A TableOfContents/Tree tool

JUMBO's main emphasis is on Structured Documents, and most instances are presented as TOCs. The TOC allows:
  • Control of presentation through PIs (automatic or user-activated)
  • Flexible display of ELEMENT tree (toggling visibility)
  • Editing of tree (move, delete, add Elements, including partial DTD-based validation)
  • Attribute display and editing
  • Element-based Help based on Java inheritance
  • Flexible URL-based navigation to next hyperlinked instance (implementing XML: EMBED/REPLACE/NEW and AUTO/USER)
  • Lookup of DOCTYPE and automatic downloading of ELEMENT-specific Java classes
  • ELEMENT-specific icons leading to display() when clicked
  • Resolution of semantics by links to VirtualHyperGlossary entries
  • TEI searches based on XML-LINK
  • Save contents as XML, HTML GIF or customised format (through Java)

Generic Java class Downloader

Applications as Java classes, including TechnicalMarkupLanguage and ChemicalMarkupLanguage

Inside JUMBO

JUMBO has over 300 classes, the most important of which are SGMLTree, SGMLNode, and SGMLAttribute.[5] Objects are created from the result of parsing either via a stream or from a parsed object in memory. At present JUMBO is limited to objects that will fit in the space available in the Java Virtual Machine. Node is normally subclassed for each element type--an example is MOLNode (see Figure 2). When the document is parsed, a DTD-specific class (e.g., CMLDTD.class) is required to decide what subclass type is required for each ElementType (GI). If none is found, the methods default to those of Node.

If the following example (which has no DTD) is processed by JUMBO the display can be expanded to Figure 1.

 
        <?XML VERSION="1.0"?> 
        <FOO> 
        <BAR TITLE="I am a bar" ID="bar1"> 
        <PLUGH> 
        This is an ASCII string contained as a child of PLUGH
        </PLUGH> 
        <BAR TITLE="younger sibling of PLUGH"> 
        A BAR can contain other BAR elements. 
        </BAR> 
        </BAR> 
        </FOO> 

JUMBO "guesses" a reasonable title from the TITLE attribute, the content or the ElementType. The small circular icons are the default; when clicked they display() a textual debug() of the Node.

Each subclassed Node may have a drawIcon() and a display() method. When the class-specific icon is clicked the appropriate display() is automatically used. Figure 2 shows a datafile for the three-dimensional structure of a protein molecule, which contains a mixture of textual and nontextual records. Despite the input being published as a "flat file," the JUMBO conversion program can create a highly structured TOC (see the left of the diagram). Different ElementTypes can have different icons. Thus, clicking on "D-T-G" (a protein sequence Element) displays the top window, while clicking on an inverted V-shaped ball-and-stick icon displays one of the bottom two windows. The textual records (as in the "Annotation") Node can also be displayed. Note that Nodes labeled "HELIX" etc. use the default SGMLNode display() method.

TecML supports tables, which can contain objects or pointers to objects. In Figure 3 the table contains links to MOLNodes in OBJECT column which, when clicked, display() their contents.


Figure 1

XML-LINK

In early versions of CML and its related classes a lot of semantics were hardcoded. Some of these can now be seen as generic and potentially manageable by the XML-LINK tools. One common use of XML-LINK is to assemble objects into a common display such as EMBEDding them in text. Early versions of JUMBO supported some experimental rendering but since this is a useful generic operation for browsers I have delayed further implementation.


Figure 2

Many technical documents and data have relationships (often implicit) between components. Single Element-based classes cannot support these, but linking through XML-LINK may provide generic support. The mapping ("Assignments") of atoms in a molecule to peaks in a spectrum is shown in Figure 4. This is particularly simple since there is a 1:1 correspondence--for each bar in the spectrum there is an atom. An assignment is thus a link between the two and could be representated as:

 
        <RELATION XML-LINK="EXTENDED" TITLE="Peak1"> 
        <XVAR XML-LINK="LOCATOR" HREF="ATOM(3)" BEHAVIOR="highlight">
        <XVAR XML-LINK="LOCATOR" HREF="LINE(17)" BEHAVIOR="highlight"> 
        </RELATION> 

Clicking on Peak1 sends signals to the children of the RELATION to display themselves and to highlight the particular feature. As different peaks are clicked, the highlights are updated in both windows. If it is possible to catalogue a variety of such behaviors, XML-LINK can provide very powerful support.


Figure 3

How to Use JUMBO

JUMBO can be used in several modes:

  • As a standalone Java application (see Figure 5). This simply requires the user to install a Java interpreter. (Note that XML-LINK is used to transmit the effect of clicking the ??? icon (near Pyrrole) to display the groups of atoms in the molecule). In this mode JUMBO can read and write local files and also connect to servers. Here's an example:
        java jumbo.sgml.SGMLTree myfile.xml
      
  • Applets downloaded from a server to a traditional Java-enabled browser. The XML document is referenced inside an APPLET element in a *.html document:
 
        <APPLET CODE="jumbo.sgml.SGMLTree.class"><PARAM NAME="commandline" 
        VALUE="myfile.xml"></APPLET> 
  • Locally, within a Java-enabled browser, with the classes under the document tree.

The last two bullet items are very convenient since many modern browsers support Java.

Extending JUMBO

JUMBO is distributed as a set of classes. Since Java is designed for extensibility, developers can modify its function without needing source code. The most common way to extend JUMBO will be to create a set of classes for a new DTD. In specialized cases (e.g., molecules) this requires one class per element. Where many Elements share common features, however, they can inherit methods. It should be straightforward to extend JUMBO to support stylesheets.

Recently Javasoft published a vastly improved set of classes (Swing [4]) for creating GUIs. Some of these support generic tree functionality and its display, and this is an obvious way to make JUMBO more robust and efficient.

Using JUMBO, TecML/CML, and VHG

JUMBO was developed as reusable code and is available for collaboration. Use of JUMBO for molecular purposes is likely to be in conjunction with the Open Molecule Foundation. CML and TecML will soon be published in the chemical community and others on CDROM (including Java-based demos).


Figure 4


Figure 5

JUMBO and CML rely heavily on adding semantics through hyperlinks to glossaries (as in the "melting point" example above). To systematize the format and creation, we have developed the Virtual HyperGlossary project [5]. The project is communicating with providers of high-quality terminological content to create stable, curated XML-based glossaries to which XML documents can be linked. The glossaries have a simple syntax based either on TecML or ISO12200 (MARTIF). In either case they use attribute values from ISO12620 data categories. Hierarchy (superordinate concepts) and other entailment (e.g., "related term") are provided through XML-LINKs. There is support for ADMINistrative details and for VirtualHyperMarkup (the linking of documents to glossaries). XML's addressing and naming schemes allow for distributed glossary servers.

Acknowledgments

The OMF has supported the creation of this demo, and many of them (especially Henry Rzepa, Richard Kinder, Andrew Payne and Adam Precious) have given me encouragement. I am particularly grateful to Jon Bosak for his virtual encouragement and presentation of a JUMBO demo at WWW6. Lesley West has partnered me in the creation of the Virtual HyperGlossary.

  1. http://www.jclark.com
  2. http://www.art.com/~joe/cost/index.html
  3. http://www.datachannel.com/
  4. http://www.javasoft.com
  5. http://www.venus.co.uk/vhg
  6. TecML and CML: http//www.venus.co.uk/omf/cml/
  7. JUMBO code/DTDs/examples/tutorials are at http://www.vsms.nottingham.ac.uk/vsms/java/jumbo
  8. The VHG is at http://www.venus.co.uk/vhg/
  9. The OMF is at http://www.ch.ic.ac.uk/omf/

About the Author

Peter Murray-Rust
Virtual School of Molecular Sciences
Nottingham University, UK
pazpmr@unix.ccc.nottingham.ac.uk

Peter Murray-Rust is the Director of the Virtual School of Molecular Sciences at the University of Nottingham, where he is participating in a new venture in virtual education and communities. Peter is also a visiting professor at the Crystallography Department at Birkbeck College, where he set up the first multimedia virtual course on the WWW (Principles of Protein Structure).

Peter's research interests in molecular informatics include participation in the Open Molecule Foundation--a virtual community sharing molecular resources; developing the use of Chemical MIME for the electronic transmission of molecular information; creating the first publicly available XML browser, JUMBO; and developing the Virtual HyperGlossary--an exploration of how the world community can create a virtual resource in terminology.