Menu

Chemical Markup Language

October 2, 1997

Peter Murray-Rust

Chemical Markup Language

A Simple introduction to Structured Documents

Peter Murray-Rust

Abstract

Structured documents in XML are capable of managing complex documents with many separate information components. In this article, we describe the role of the XML-LANG specification in supporting this. Examples are supplied explaining how components can be managed and how documents can be processed, with an emphasis on scientific and technical publishing. We conclude that structured documents are sufficiently powerful to allow complex searches simply through the use of their markup.

Historical Overview

Originally published as an HTML file, this paper was part of the CDROM e-publication ECHET96 ("Electronic Conference on Heterocyclic Chemistry"), run by Henry Rzepa, Chris Leach, and others at Imperial College, London, U.K. The CDROM was sponsored by the Royal Society of Chemistry, who (along with Cambridge, Leeds, and IC) are participants in the CLIC project. This is one of the projects under E-Lib, a U.K.-based program to promote electronic publishing. CLIC makes substantial use of SGML and Chemical Markup Language (CML). As part of this project I have been developing CML, one of the first applications of XML. CML, and its associated software JUMBO, probably represented one of the first complete XML applications (authoring tools, documents, and browser) in any discipline. Although the CML component was essentially a proof-of-concept, it was robust enough to be distributed as a standalone Java-based XML application. A wide variety of examples could therefore be viewed using JUMBO running under a Java-enabled browser.[1]

The audience for this paper need not be acquainted with SGML or XML; it serves as an introduction to the concept of document structure. As such, we assume no knowledge about markup languages, other than a familiarity with HTML. Though some parts may be trivially obvious to some readers, they may still find it useful as a tutorial aid for their colleagues. It is primarily aimed at those who are interested in authoring or browsing documents with the next generation of markup languages, especially those created with XML. CML [1] is part of the portfolio of the Open Molecule Foundation [2], which is a newly constituted open body to promote interoperability in molecular sciences. The latest versions of JUMBO can be found under the Virtual School of Molecular Sciences [3], which has also recently run a virtual course on Scientific Information Components using Java and XML [4].

The paper alludes to various software tools, but does not cover their operation or implementation. However, with the exception of stylesheets, most of the operations described here for CML have already been implemented as a prototype using the JUMBO browser and processor. The paper does not require any knowledge of chemistry or specific understanding of CML.

Finally, I should emphasize that SGML can be used in many ways; my approach does not necessarily do justice to the most common use, which is the management and publication of complex (mainly textual) documents. Projects in this area often involve many megabytes of data and industrial strength engines. I hope, however, that the principles described here will generally be of use.

Introduction

Two years ago I had never heard of structured documents, and have since come to see them as one of the most effective and cheapest ways to manage information. Though the basic idea is simple, when I first came across it I failed to see its importance. This paper is written as a guide to what is now possible. In particular, it explains XML--the simple new language being developed by a working group (WG) of the W3 Consortium. I have used this language as the basis for a markup language in technical subjects (Technical Markup Language, TecML) and particularly molecular sciences (Chemical Markup Language, CML).

The paper was originally written as a simple structured document, using HTML, although it could have been written in CML. I shall slant it towards those who wish to carry precise, possibly nontextual, information arranged in (potentially quite complex) data structures. While I use the term document, this could represent a piece of information without conventional text, such as a molecule. Moreover, documents can have a very close relation to objects; if you are comfortable with object-oriented languages you may like to substitute "object" for "document." In practice, XML documents can be directly and automatically transformed into objects, although the reverse may not always be quite so easy.

The markup I describe essentially uses the same syntax as HTML; it is the concepts, rather than the syntax that may be new. Although this paper is written in the context of document delivery over networks, markup is also ideally suited to the management of "traditional" documents. Markup languages are often seen as key tools in making them "future-proof" and interchangeable between applications (interoperability).

The important point about the XML approach is that it has been designed to separate different parts of the problem and to solve them independently. I'll explain these ideas in more detail below, but one example is the distinction between syntax (the basic rules for carrying the information components) and semantics (what meaning you put on the components and what behavior a machine is expected to perform). This is a much more challenging area than people realize, since human readers don't have problems with it.

One of the great polymaths of this century, J.D.Bernal, inspired the development of information systems in molecular science. In 1962 he urged that the problems of scientific information in crystallography (his own field) and solid state physics should be treated as one in communication engineering. Thirty years later we have most of the tools that are required to get the best information in the minimum quantity in the shortest time, from the people who are producing the information to the people who want it, whether they know they want it or not.[2]

Many scientists are unaware of the research during the last thirty years into the management of information.[3] In this review, Schatz shows that previous research in the analysis of complex documents, including hyperlinking, concept analysis, and vocabulary switching between disciplines, is now possible on a production scale. Much of his emphasis is on analysis of conventional documents produced by authors who have no knowledge of markup and who do not use vocabularies (domain ontologies). For that reason, complex systems such as natural language processing (NLP) are required to extract implicit information from the documents, and they rely on having appropriate text to analyze. Automatic extraction of numerical and other nontextual information will be much more difficult.

Structure and Markup

We often take for granted the power of the human brain in extracting implicit information from documents. We have been trained over centuries to realize that documents have structure (Table Of Contents [TOCs], Indexes, Chapters with included Sections, and so on). It probably seems "obvious" to you that you are reading the fourth section ("Structure and Markup") in the paper ("A Simple Introduction to Sructured Documents"). The HTML language and rendering tools that you are using to read [the online version] provide a simple but extremely effective set of visual clues; for instance, "Chapter" is set in larger type. However, the logical structure of the document is simply:

        HTML
          HEAD
            TITLE
          BODY
            H1 (Chapter)
            H2 (Section)
            H3 (Subsection)
            H3
            H2     
            P  (Paragraph)
            P
            P
            P
            P
            H2
            P
            P
            P
            H2
            P
            P
            ... and so on ...
            ADDRESS
      

where I have used the convention of indentation to show that one component includes another. This is a common approach in many TOCs, and human readers will implicitly deduce a hierarchy from the above diagram. But a machine could not unless it had sophisticated heuristics, and it would also make mistakes.

The formal structure in this document is quite limited, and that is one of the reasons that HTML has been so successful but also increasingly insufficient. Humans can author documents easily and human readers can supply the implicit structure. But if you look again at the TOC diagram you will see that Chapters do not include Sections in a formal manner, nor do Sections include Paragraphs. The first occurrence of H2 and H3 is used for the author and affiliation, which is not a "Section."

An information component (an Element) contains another if the start-tag and end-tag of the container completely enclose the contained. Thus the HEAD element contains a TITLE element, and the TITLE element contains a string of characters (the SGML/XML term is #PCDATA). There is a formal set of rules in HTML for which elements can contain which other Elements and where they can occur. Thus, it's not formally allowed to have TITLE in the BODY of your document. These rules, which are primarily for machines and SGML gurus to read, are combined in a Document Type Definition (DTD).


Note

If you have already come across SGML and been put off for some reason, please don't switch off here. XML has been carefully designed to make it much easier to understand the concepts and there are many fewer terms. For example, you don't even have to have a DTD if you don't want.


This document has an inherent structure in the order of its Elements. Most people would reasonably assume that an H2 element "belongs to" the preceding H1, and that P elements belong to the preceding H2. It would be quite natural to use phrases like "the second sentence of the second paragraph in the section called Introduction." Although humans can do this easily, it's common to get lost in large documents. The important news is that XML now makes it possible for machines to do the same sort of thing with simple rules and complete precision. The Text Encoding Initiative (a large international project to mark up the world's literature) has developed tools for doing this, and they will be available to the XML community.

In HTML there are no formal conventions for what constitutes a Chapter or Section, and no restriction as to what elements can follow others. Therefore, you can't rely on analyzing an arbitrary HTML document in the way I've outlined. This highlights the need for more formal rules, agreements, and guidelines. In XML we are likely to see communities such as users of CML develop their own rules, which they enforce or encourage as they see fit. For example, there is no restriction on what order Elements can occur in a CML document, but there is a requirement that ATOMS can only occur within a MOL (molecule Element). (In CML I use the term "ChemicalElement" to avoid confusion!)

In the Schatz reference that is footnoted earlier, you will probably "know automatically" what the components are. The thing in brackets must be the year, "pp." is short for "pages," the bold type must be the volume, and the italics are the journal title. But this is not obvious to a machine; trying to write a parser for this is difficult and error-prone. Many different publishing houses have their own conventions. The Royal Society of Chemistry might format this as:

        B.R. Schatz, Science, 1997, 275, 327.
      

Any error in punctuation such as missing periods causes serious problems for a machine, and conversions between different formats will probably involve much manual crafting.

The precise components of the reference, which are well understood and largely agreed within the bibliographic community, are a good example of something that can be enhanced by markup. Markup is the process of adding information to a document that is not part of the content but adds information about the structure or elements. Using the Schatz citation as an example, we can write:

        <BIB>
          <TITLE>
            Information Retrieval in Digital Libraries: Bringing Search to the Net
          </TITLE>
          <JOURNAL>Science</JOURNAL>
          <AUTHOR>
            <FIRSTNAME>Bruce</FIRSTNAME>
            <INITIAL>R</INITIAL>
            <LASTNAME>Schatz</LASTNAME>
          </AUTHOR>
          <VOLUME>275</VOLUME>
          <YEAR>1997</YEAR>
          <PAGES>327-334</PAGES>
        </BIB>
      

A scientist never having seen markup before would implicitly understand this information. The advantage is that it's also straightforward to parse it by machine. If the tags (<...>) and their content are ignored, then the remainder (content) is exactly the same as it was earlier (except for punctuation and rendering). It's often useful to think of markup as invisible annotations on your document. Many modern systems do not mark up the document itself, but provide a separate document with the markup. For example, you may not be allowed to edit a document but can still point to, and comment on, a phrase, section, chapter, etc. This is a feature of hypermedia systems, and one of the goals of XML is to formalize this through the development of linking syntax and semantics in XML-LINK (XLL), but this is outside the scope of this paper.

What is so remarkable about this? In essence we have made it possible for a machine to capture some of those things that a human takes for granted.

  • Punctuation and other syntax are no longer a problem, as there are extremely carefully defined rules in XML. If your markup characters are <...>, how do you actually send < and > characters without them being mistaken for markup? One way is to encode them as &lt; and &gt;.
  • Character encoding and other character entities have received a huge amount of attention and many entity sets have been developed, some by ISO. For example, the copyright symbol (©) is number 169 in ISO-Latin &#169;. It also has a symbolic representation (&copy;). XML itself has only a very few built-in character entities, but will support Unicode and other approaches to encoding characters. Most browsers do not yet support a wide range of glyphs for entities, but this is likely to change very rapidly, especially since languages like Java have addressed the problem.
  • The role of information elements is defined. In the previous example, you can see what the precise components are and what their extent is. Note how the AUTHOR element is divided into three components. What you do with this information is the remit of semantics, and XML separates syntax precisely from semantics in a way that very few other non-SGML systems can do.
  • Documents can be reliably restructured or filtered by machine. An author might enter the LASTNAME, FIRSTNAME, and INITIAL sequentially, but the machine could be asked to sort them into a different order. This may not appear very important, but to those implementing programs it is an enormous help. If the house style was initials-only, the program could easily turn Bruce into B.
  • Documents can be transformed, merged, and edited automatically. This is a great advance in information management. For example, it would be straightforward to write a citation analyzer that found all BIB elements in a document and abstracted parts of them by JOURNAL or YEAR.
  • It's easy to convert from one structured document to another. The bibliographic example above is not in strict CML, but it's very easy to convert it to CML, without losing any information.
  • All information in a document can be precisely identified. The above example is marked down to the granularity of a single character (the INITIAL). It is conceptually easy to extend this to markup of numbers, formulae, and parts of things such as regions in diagrams or atoms in molecules.

Rules, Meta-Languages, and Validity

I started writing Chemical Markup Language because I wanted to transfer molecules precisely using ATOMS, BONDS, and related information. It was always clear that "chemistry" involved more than this and that we needed the tools to encapsulate numeric and other data such as spectra. I looked at a wide variety of journals in the scientific area to see what sort of information was general to all of them, and whether a markup language could be devised which could manage this wide range. It required a meta-language, and this section is an explanation of what that involves.

I'll explain the "meta-" concept using XML and then show how it extends to applications such as TecML. XML, despite its name, is not a language but a meta-language (a tool for writing languages). XML is a set of rules that enable markup languages to be written; TecML and CML are two such languages. For example, one rule in XML is: every non-empty element must have a start-tag and an end-tag; so that the <AUTHOR> tag must be balanced by a </AUTHOR> tag. This is not a strict requirement of HTML, which uses a more flexible set of rules (but is also harder to parse or read by machine). Another rule is: all attribute values must occur within quotes (" or '). Writing a markup language is somewhat analogous to writing a program, and the relation of XML to CML is much the same as C to hello.c. We say that CML "is an application of XML," or "is written in XML," just as "hello.c is written in C." XML is a little stricter than HTML in the syntax it allows, but the benefit is that it's much easier to write browsers and other applications.

XML allows for two sorts of documents: valid and well-formed. Validity requires an explicit set of rules in a DTD. This is usually a separate file, but part or all can be included in the document itself. An example of a validity criterion in HTML is that LI (a ListItem) must occur within a UL or OL container. Well-formedness is a less strict criterion and requires primarily that the document can be automatically parsed without the DTD. The result can be represented as a tree structure. The bibliographic example above is well-formed, but without a DTD, it may not be valid. It might have been an explicit rule, like "the author must include an element describing the language that the article was written in, such as <LANGUAGE>EN</LANGUAGE>"; in this case, the document fragment would be invalid.

The importance of validity will depend on the philosophy of the community using XML. In molecular science all *.cml documents will be expected to be valid and this is ensured by running them through a validating parser such as NXP.[4] If a browser or other processing application such as a search engine can assume that a certified document was valid (perhaps from a validation stamp) there would be no need to write a validating parser. Being valid doesn't mean the contents are necessarily sensible; further processing may be needed for that purpose.

Where, and how, you enforce validity depends on what you are trying to do. If you are providing a form for authors to submit abstracts, you will enforce fairly strict rules. ("It must have one or more AUTHORs, exactly one ADDRESS for correspondence, and the AUTHOR must contain either a FIRSTNAME or INITIALS but not both.") This can be enforced in a DTD. But this would be too restricting for a general scientific document, which need not always have an AUTHOR. The two forces of precision and flexibility often conflict, but can be reconciled to a large extent by providing different ways of processing documents.

Processing Documents

At this stage it's useful to think about how an XML document might be created and processed. At its simplest level a document can be created with any text editor; this was how the BIB example was written. It can then be processed with the human brain. This isn't a trivial point; there is no fundamental requirement for software at all or any stages of managing XML documents. In practice, however, software adds enormously to the value. CML documents such as those including atomic coordinates only make sense when rendered by computers.

XML documents can be created, processed, and displayed in many ways. The schematic diagram in Figure 1 (which emphasizes the tree structure) shows some of the possible operations.

The lefthand module shows parts of the editing process. Legacy documents can be imported and converted on the fly, and the tree can be edited. There will normally also be a module for editing text. The editor may have access to a DTD and can therefore validate the document as it is created. An important aspect of XML-LINK is that editors should be able to create hyperlinks, either internally or to external files.

The complete document will then be mounted on a server. This will associate it with stylesheets, Java classes, the DTD, entities, and other linked components. The packaged documents are then delivered to the client where the application requires an XML parser. If the client wishes to validate the document the DTD is required.

Many XML applications will then hold the parsed document in memory as a tree (or grove) which can then be further processed. A frequent method will be the delivery of DSSSL stylesheets with the document (or provided client-side), or other transformation tools (perhaps written in Perl). Alternatively, the components of the document may be associated with Java classes either for display or transformation (as in the JUMBO browser). All of these methods may involve semantic validation (such as "does the document contain sensible information?").

Some of the operations required in processing XML are now explained in more detail:

Authoring

One of the hardest problems is to write the authoring tools for an SGML/XML system. A good tool has to provide a natural interface for authors, most of whom won't know the principles of markup languages. It may also have to enforce strict and complex rules, possibly after every keystroke. Many current authoring tools are therefore tailored to a limited number of specific applications, one of the most versatile of which is an SGML add-on to Emacs. Sometimes a customer will approach an SGML house and, after agreeing on a DTD, a specific tool will be built. For some common document types--such as military contracts--there is enough communality that commercial tools are available.
Conversion

In some cases authoring involves conversion of legacy documents; if these are well understood, conventional programs can be written in Perl or similar languages. Where the XML documents represent database entries or the output from programs, the authoring process is particularly simple--many CML applications will fall in that category. XML makes it particularly easy to reuse material either by "cut-and-paste" of sections, or preferably through entities. Classes written for JUMBO can already convert 15 different types of legacy files into CML.

Figure 1

Editing and merging
Editing and merging affects the structure of the document and therefore may require validation. To write programs that do this on the fly is again difficult; and it may be useful, where possible, to divide documents into "chunks" or entities. SGML has a very powerful concept or entities and can describe documents whose components are distributed over a network. For example, if I have an address, it is extremely useful to refer to that chunk by a symbolic name, such as &pmraddress;. With appropriate software I can include this at appropriate places and the software will include the full content of the entity. (If the entity contains references to other entities, they are also expanded, and so on.)
The server: assembly and queries
The server has a vital role to play in many XML applications. It is possible to mount sophisticated SGML systems that retrieve document components and assemble them on the fly into XML documents. Alternatively, the components could be retrieved from databases, as with chemical and biological molecules or data, and converted into XML files. Since XML maps onto object storage, it is particularly attractive for those developing object-based systems such as CORBA. Whether the complete document is assembled at the server or the addresses of the fragments are sent to the client will depend on bandwidth, the preference of the community, the availability of software, and many other considerations.
Parsing
Parsing is the process of syntactic analysis and validation. It normally produces a standardized output either on file or in memory. Whether you need to validate documents when you receive them will depend on your community's requirements. For example, if I receive a database entry from a major molecular data center I can rely on its validity, but a publisher getting a hand-edited XML manuscript will probably want to validate it. A validating parser requires that the document be valid against a specified DTD. Finding this DTD normally requires interpretation of the DOCTYPE statement at the head of an XML document. Some authors/servers are prepared to distribute the DTDs when documents are downloaded. While this adds precision in that the correct DTD is used, it can add to the burden of server maintenance and can increase bandwidth. If a community agrees on a DTD, they may find it useful to distribute it with the browsing software. The result of parsing is usually a parse-tree. If this is an unfamiliar concept, think of it as a table of contents with every Element corresponding to a chapter or (sub. . .sub) section. Trees are easy to manipulate and display; JUMBO displays the tree as a TOC. There are already two freely available XML parsers written in Java (NXP and Lark)[5] and I have used both. Lark creates a parse tree in memory that can be subclassed, while NXP produces it on the output stream.
Postprocessing, rendering, and validation
Most documents require at least some postprocessing, and many need a lot. Most users of XML applications will think of "browsers" or "plug-ins" as the obvious tools to use on a document. This will probably be true, but because it's machine processable XML is so powerful that many completely new applications will be developed. An XML document might consist of an airline reservation and the postprocessor could decide to order a taxi to the airport. A chemical reaction in a CML document could trigger the supply of checmicals and interrogate the safety databases.
Semantics and the postprocessor
An XML document carries no semantics with it, and there has to be an explicit or implicit agreement between the author and reader. Most authors understand roughly the same thing by the TITLE in HTML documents, although they might try and use them in different ways. TITLE is valuable for indexers such as AltaVista, which abstract their content separately from the body of the document. This emphasizes the value of structural markup. However, some widely used element names are ambiguous (A is variously used in different DTDs for author, anchor, etc.), and for some, such as LINK, it's unclear what their role is. Clarifying this for each DTD requires semantics. Traditionally, semantics have been carried in documentation: if this is not done clearly then implementers may provide different actions for the same Element. The XML project is actively investigating formal automatic ways of delivering semantics, such as stylesheets and Java classes.
Validation at the postprocessor
The DTD/validating-parser cannot deal with some aspects of validation, which must be tackled by a conventional program/application. Common examples of validation are content ("is this number in the allowed range?"), and occurrence counts ("no more than five sections per chapter"). This is likely to need special coding for each application, and will be most important where high precision and low flexibility is the intention.
Stylesheets
Stylesheets are sets of rules that accompany a document.[6] They can be used to filter or restructure the document ("as in extract all footnotes and put them at the end of a section"). Their most common use is in formatting or providing typesetting instructions ("all subsections must be indented by x mm and typeset in this font"). ISO has produced a standard for the creation of stylesheets (DSSSL), which allows their description in Scheme (a derivative of LISP). Stylesheets are generally written to produce a transformed document, rather than to create an object in memory; Java classes are more suitable for this. I expect to see the technologies converge--which is used will depend on the application and the community using it. There are at least four ways that stylesheets might be used; the technology exists for each one. Which overrides which is a matter of politics, not technology.
  • By the author. If an author wishes to impart a particular style to a document, he can attach or include a stylesheet. This can be invoked at the postprocessor level, unless it has been overridden.
  • By the server. If an organization such as publishing house is running the server, it may impose a particular style, such as for bibliographic references. XML would give the author the freedom to prepare them in a standard way (e.g., using CML), while the journals could transform this by sending their stylesheets to the reader.
  • By the client software (browser). The software manufacturer has an interest in providing a common look-and-feel to the display. It reduces training and documentation costs and might provide a competitive market edge.
  • By the reader. She may have personal preferences concerning the presentation of material, perhaps because of her education. Alternatively, her employer may require a common house style to facilitate training and internal communication.
Java classes
Every Element can be thought of as an object and have methods (or behavior) associated with it. Thus, a LIST object might count and number the items it contains. Most elements will have a display() method, which could be implemented differently from object to object. Thus, in JUMBO, MOLNode.display() brings up a rotatable screen display of the molecule, while BIB.display() displays each citation in a mixture of fonts. As with stylesheets, Java classes can be specified at any of the four places listed above, and the appropriate one downloaded from a Web site if required. One of the problems the XML-WG is tackling and solving is how to locate Java classes. Because Java is a very powerful programming language with full WWW support, it offers almost unlimited scope for XML applications. A document need not be passive, but could awake the client to take a whole series of actions--mailing people, downloading other data, and updating the local database are examples.
Manifests and addressing on the WWW
Most XML "documents" will consist of several physical files or streams, and these may be distributed over more than one server. An important attraction of XML is that common document components such as citations, addresses, boilerplate, etc. can be reused by many authors. Packaging these components is a challenge that the W3C and others are tackling. It involves:
  • Methods of locating components. XML uses URLs or their future evolution (such as URNs).
  • Labeling a file with its type. XML has provision for NOTATION, which may be linked to a reference URL or a MIME type.
  • Creating a manifest of all the components required in a package (perhaps through a Java archive file [*.jar]).

Attributes

So far I have used only Element names (sometimes called GIs) to carry the markup. XML also provides attributes as another way of modulating the element. Attributes occur within start-tags, and well-known examples are HREF (in A) and SRC (in IMG):

        <a href="http://www.venus.co.uk/omf/cml/">
        
        <IMG SRC="mypicture.gif" WIDTH="500" HEIGHT="100">
        
      

Attributes are semantically free in the same way as Elements, and can be used with stylesheets or Java classes to vary their meaning.

Whether Elements or attributes are used to convey markup is a matter of preference and style, but in general the more flexible the document the more I would recommend attributes. As a point of style, many people suggest that document content should not occur in attributes, but this is not universal. Here are some simple examples of the use of attributes:

  • Describing the type of information (e.g., what language the Element is written in)
  • Adding information about the document or parts of it (who wrote it, what its origins are)
  • Suggestions for rendering, such as recommended sizes for pictures
  • Help for the postprocessor (e.g., the wordcount in a paragraph)

In XML-LINK attributes are extensively used to provide the target, type, and behavior of links.

Flexibility and Meta-DTDs

As discussed earlier, when developing an XML application, the author has to decide whether precision and standardization is required, or whether it is more important to be flexible. If precision is required, then the DTD will be the primary means of enforcing it, and as a consequence, may become large and complex. It implies that the "standard" is unlikely to change. When new versions are produced, the complete pipeline from authoring to rendering will need to be revised. Because this is a major effort and cost, careful planning of the DTD is necessary.

If flexibility is more important, either because the field is evolving or because it is very broad, a rigid DTD may restrict development. In that case a more general DTD is useful, with flexibility being added through attributes and their values.

In TecML I created an Element type, XVAR, for a scalar variable. Attributes are used to tune the use and properties of XVAR, and it's possible to make it do "almost anything"! For example, it can be given a TYPE such as STRING, FLOAT, DATE, and TITLE. In this way, any number of objects can be precisely described. Here are three examples:

 
        <XVAR TYPE="STRING" TITLE="Greeting">Hello world!</XVAR>
        <XVAR TYPE="DATE">2000-01-01</XVAR> 
        <XVAR TYPE="FLOAT" DICTNAME="Melting Point" UNITS="Fahrenheit">451</XVAR> 

The last is particularly important because it uses the concept of linking to add semantics. This is an important feature of XML; the precise syntax is being developed in XML-LINK. CML uses DICTNAME to refer to an entry in a specified glossary that defines what "Melting Point" is. This entry could have further links to other resources, such as world collections of physical data. Similarly, UNITS is used to specify precisely what scale of temperature is used. Again, this is provided by a glossary in which SI[7] units are the default.

By using this approach it is possible to describe any scalar variable simply by varying the attributes and their values. Note that the attribute types must be defined in the DTD but their values may either be unlimited or can be restricted to a set of possible values.


Note

In the preceding example the links are implicit; later versions of CML will probably use the explicit links provided by XML-LINK.


The TecML DTD uses very few Element types, and these have been carefully chosen to cover most of the general concepts that arise in technical subjects. They include ARRAY, XLIST (a general tool for data structures such as tables and trees), FIGURE (a diagram), PERSON, BIB, and XNOTATION. (NOTATION is an XML concept which allows non-XML data to be carried in a document, and is therefore a way of including "foreign" file types.) With these simple tools and a wide range of attributes it is possible to mark up most technical scientific publications. There has to be general agreement about the semantics of the markup, of course, but this is a great advance compared with having no markup at all.

Entities and Information Objects

When documents have identifiable components it is often useful to put them into ENTITYs in separate files or resources. For example, although a citation might be used by many documents, only one copy is needed as long as all documents can address it. Chapters in an anthology might all be held as separate entities, allowing each to be edited independently. If the entity is updated (it might be an address, for example) all references to the entity will automatically point to the correct information. Entities in XML can be referenced through URLs allowing truly global hyperdocuments.

Many documents involve more than one basic discipline. For example, a scientific paper may include text, images, vector graphics, mathematics, molecules, bibliography, and glossaries. All of these are complex objects and most have established SGML conventions. Authors of these documents would like to reuse these existing conventions without having to write their own (very complicated) DTDs. The XML community is actively creating the mechanisms for doing this. If components are mixed within the same document, their namespaces must be identified (e.g., "this component obeys the MathML DTD and that one obeys CML"). For example, all the mathematical equations could be held in separate entities, and so could the molecular formulae. This would also support another method of combining components through XML-LINK, where the components are accessed through the HREF syntax.

Searching

Realizing the power of structured documents (SD) for carrying information was a revelation for me. In many disciplines, data map far more naturally into a tree structure than into a relational database (RDB). An SD has a concept of sequential information while an RDB does not. The exciting thing is that the new object databases (including the hybrid Object-Relational Databases {ORDBS]) have the exact architecture needed to hold XML-like documents, and suppliers now offer SGML interfaces. (For any particular application, of course, there may be a choice between RDBs and ORDBs.) The attraction of objects overRDBs is that it is much easier to design the data architecture with objects.

In many cases simply creating well marked-up documents may be all that is required for their use in the databases of the future. The reason for this confident statement is that SDs provide a very rich context for individual Elements. Thus we can ask questions like:

  • "Find all MOLECULEs which contain MOLECULEs." (e.g., ligands in proteins)
  • "Which DATASET contains one MOLECULE and one SPECTRUM whose attribute TYPE has a value of nmr?"
  • "Find all references to journals not published by the Royal Society of Chemistry."

Despite their apparent complexity, these can all be managed with standard techniques for searching structured documents. Because of this power, a special language (Structured Document Query Language--SDQL) has been developed and will interoperate with XML. If simple application-specific tools are developed then queries like the following are possible:

  • "Find all XVARs whose DICTNAME value is Melting Point; retrieve the value of the UNITS attribute and use it to convert the content to a floating point number representing a temperature on the Celsius scale. Then include all data with values in the range 150-170."

The XML-LINK specification has borrowed the syntax of extended pointers (XPointers) from the Text Encoding Initiative (TEI). Although primarily intended to access specific components within an XML document, the syntax is quite a powerful query language. The first two queries might be represented as:

 
        ROOT,DESCENDANT(1,MOLECULE) DESCENDANT(1,MOLECULE)
        ROOT,DESCENDANT(DATASET)CHILD(1,MOLECULE)ANCESTOR(1,DATASET)CHILD(1,SPECTRUM,TYPE,"nmr") 

The first finds the first MOLECULE, which is a descendant of the root of the document, and then the first MOLECULE, which is somewhere in the subtree from that. The second is more complex, and requires the MOLECULE and SPECTRUM to be directly contained within the DATASET element. (The details of TEI Xpointers in XML may still undergo slight revision and are not further explained here.)

Summary, and the Next Phase

This document has described only part of what XML can offer to a scientific or publishing community. XML has three phases; only the first has been covered here in any depth. XML-LINK defines a hyperlinking system and XML-STYLE defines how stylesheets will be used. Hyperlinking can range from the simple, unverified link (as in HTML's HREF attribute for Anchors) to a complete database of typed and validated links over thousands of documents. XML-LINK is addressing all of these and has the power to support complex systems.

How will XML develop in practice? A natural impetus will come from those people who already use SGML and see how it could be used over the WWW. It is certainly something that publishers should look at very closely, as it has all the required components--including the likelihood that solutions will interoperate with Java.

XML is the ideal language for the creation and transmission of database entries. The use of entities means it can manage distributed components, it maps well onto objects, and it can manage complex relationships through its linking scheme. Most of the software components are already written.

How would it be used with a browser? Assuming that the bulk of tools are written in Java, we can foresee helper applications or plug-ins, and perhaps there will be more autonomous tools that are capable of independent action. It's an excellent approach to managing legacy documents rather than writing a specific helper for each type.

I hope enough tools will be available for XML to provide the same creative and expressive opportunities as HTML provided in the past. However, it's important to realize that freely available software is required--any tools for structured document management, especially in Java, will be extremely welcome. The accompanying paper describes my own contribution through the JUMBO browser.

  1. http://www.venus.co.uk/omf/cml
  2. http://www.chic.ac.uk/omf
  3. http://www.vsms.nottingham.ac.uk/vsms
  4. http://www.vsms.nottingham.ac.uk/vsms/java
  5. Robin Cover's SGML Home page: http://www.si/.org/sgml
  6. FAQ for XML run by Peter Flynn: http://www.ucc.ie/xml

About the Author

Peter Murray-Rust
Virtual School of Molecular Sciences
Nottingham University, UK
pazpmr@unix.ccc.nottingham.ac.uk

Peter Murray-Rust is the Director of the Virtual School of Molecular Sciences at the University of Nottingham, where he is participating in a new venture in virtual education and communities. Peter is also a visiting professor at the Crystallography Department at Birkbeck College, where he set up the first multimedia virtual course on the WWW (Principles of Protein Structure).

Peter's research interests in molecular informatics include participation in the Open Molecule Foundation--a virtual community sharing molecular resources; developing the use of Chemical MIME for the electronic transmission of molecular information; creating the first publicly available XML browser, JUMBO; and developing the Virtual HyperGlossary--an exploration of how the world community can create a virtual resource in terminology.


[1] An accompanying article by Peter Murray-Rust, "JUMBO: An Object-based XML Browser," is included in this issue as well. The JUMBO paper is more technical, and describes novel work in relating XML document structure to Java classes.
[2] Bernal's words, quoted in Sage, Maurice Goldsmith, p. 219.
[3] A recent and valuable review is, "Information Retrieval in Digital Libraries: Bringing Search to the Net," Bruce R. Schatz, Science, 275, pp. 327-334 (1997). (I shall comment on the format of the last sentence shortly.)
[4] Norbert Mikula's validating XML parser at http://www.edu.uni-klu.ac.at/~rmikula/NXP.
[5] See the article entitled "An Introduction to XML Processing with Lark," by Tim Bray.
[6] For more information on stylesheets, and particularly on W3C's cascading stylesheets, see the article entitled "XML and CSS" (Culshaw, Leventhal, and Maloney) in this issue. Also see the Winter 1997 issue of the W3J for the CSS1 specification as well as an implementation guide to the spec by Norman Walsh.
[7] Systèm Internationale: the international standard for scientific units.