Chemical Markup Language
The audience for this paper need not be acquainted with SGML or XML; it serves as an introduction to the concept of document structure. As such, we assume no knowledge about markup languages, other than a familiarity with HTML. Though some parts may be trivially obvious to some readers, they may still find it useful as a tutorial aid for their colleagues. It is primarily aimed at those who are interested in authoring or browsing documents with the next generation of markup languages, especially those created with XML. CML [1] is part of the portfolio of the Open Molecule Foundation [2], which is a newly constituted open body to promote interoperability in molecular sciences. The latest versions of JUMBO can be found under the Virtual School of Molecular Sciences [3], which has also recently run a virtual course on Scientific Information Components using Java and XML [4].
The paper alludes to various software tools, but does not cover their operation or implementation. However, with the exception of stylesheets, most of the operations described here for CML have already been implemented as a prototype using the JUMBO browser and processor. The paper does not require any knowledge of chemistry or specific understanding of CML.
Finally, I should emphasize that SGML can be used in many ways; my approach does not necessarily do justice to the most common use, which is the management and publication of complex (mainly textual) documents. Projects in this area often involve many megabytes of data and industrial strength engines. I hope, however, that the principles described here will generally be of use.
The paper was originally written as a simple structured document, using HTML, although it could have been written in CML. I shall slant it towards those who wish to carry precise, possibly nontextual, information arranged in (potentially quite complex) data structures. While I use the term document, this could represent a piece of information without conventional text, such as a molecule. Moreover, documents can have a very close relation to objects; if you are comfortable with object-oriented languages you may like to substitute "object" for "document." In practice, XML documents can be directly and automatically transformed into objects, although the reverse may not always be quite so easy.
The markup I describe essentially uses the same syntax as HTML; it is the concepts, rather than the syntax that may be new. Although this paper is written in the context of document delivery over networks, markup is also ideally suited to the management of "traditional" documents. Markup languages are often seen as key tools in making them "future-proof" and interchangeable between applications (interoperability).
The important point about the XML approach is that it has been designed to separate different parts of the problem and to solve them independently. I'll explain these ideas in more detail below, but one example is the distinction between syntax (the basic rules for carrying the information components) and semantics (what meaning you put on the components and what behavior a machine is expected to perform). This is a much more challenging area than people realize, since human readers don't have problems with it.
One of the great polymaths of this century, J.D.Bernal, inspired the development of information systems in molecular science. In 1962 he urged that the problems of scientific information in crystallography (his own field) and solid state physics should be treated as one in communication engineering. Thirty years later we have most of the tools that are required to get the best information in the minimum quantity in the shortest time, from the people who are producing the information to the people who want it, whether they know they want it or not.[2]
Many scientists are unaware of the research during the last thirty years into the management of information.[3] In this review, Schatz shows that previous research in the analysis of complex documents, including hyperlinking, concept analysis, and vocabulary switching between disciplines, is now possible on a production scale. Much of his emphasis is on analysis of conventional documents produced by authors who have no knowledge of markup and who do not use vocabularies (domain ontologies). For that reason, complex systems such as natural language processing (NLP) are required to extract implicit information from the documents, and they rely on having appropriate text to analyze. Automatic extraction of numerical and other nontextual information will be much more difficult.
HTML
HEAD
TITLE
BODY
H1 (Chapter)
H2 (Section)
H3 (Subsection)
H3
H2
P (Paragraph)
P
P
P
P
H2
P
P
P
H2
P
P
... and so on ...
ADDRESS
where I have used the convention of indentation to show that one component includes another. This is a common approach in many TOCs, and human readers will implicitly deduce a hierarchy from the above diagram. But a machine could not unless it had sophisticated heuristics, and it would also make mistakes.The formal structure in this document is quite limited, and that is one of the reasons that HTML has been so successful but also increasingly insufficient. Humans can author documents easily and human readers can supply the implicit structure. But if you look again at the TOC diagram you will see that Chapters do not include Sections in a formal manner, nor do Sections include Paragraphs. The first occurrence of H2 and H3 is used for the author and affiliation, which is not a "Section."
An information component (an Element) contains another if the start-tag and end-tag of the container completely enclose the contained. Thus the HEAD element contains a TITLE element, and the TITLE element contains a string of characters (the SGML/XML term is #PCDATA). There is a formal set of rules in HTML for which elements can contain which other Elements and where they can occur. Thus, it's not formally allowed to have TITLE in the BODY of your document. These rules, which are primarily for machines and SGML gurus to read, are combined in a Document Type Definition (DTD).
This document has an inherent structure in the order of its Elements. Most people would reasonably assume that an H2 element "belongs to" the preceding H1, and that P elements belong to the preceding H2. It would be quite natural to use phrases like "the second sentence of the second paragraph in the section called Introduction." Although humans can do this easily, it's common to get lost in large documents. The important news is that XML now makes it possible for machines to do the same sort of thing with simple rules and complete precision. The Text Encoding Initiative (a large international project to mark up the world's literature) has developed tools for doing this, and they will be available to the XML community.
Note
If you have already come across SGML and been put off for some reason, please don't switch off here. XML has been carefully designed to make it much easier to understand the concepts and there are many fewer terms. For example, you don't even have to have a DTD if you don't want.
In HTML there are no formal conventions for what constitutes a Chapter or Section, and no restriction as to what elements can follow others. Therefore, you can't rely on analyzing an arbitrary HTML document in the way I've outlined. This highlights the need for more formal rules, agreements, and guidelines. In XML we are likely to see communities such as users of CML develop their own rules, which they enforce or encourage as they see fit. For example, there is no restriction on what order Elements can occur in a CML document, but there is a requirement that ATOMS can only occur within a MOL (molecule Element). (In CML I use the term "ChemicalElement" to avoid confusion!)
In the Schatz reference that is footnoted earlier, you will probably "know automatically" what the components are. The thing in brackets must be the year, "pp." is short for "pages," the bold type must be the volume, and the italics are the journal title. But this is not obvious to a machine; trying to write a parser for this is difficult and error-prone. Many different publishing houses have their own conventions. The Royal Society of Chemistry might format this as:
B.R. Schatz, Science, 1997, 275, 327.Any error in punctuation such as missing periods causes serious problems for a machine, and conversions between different formats will probably involve much manual crafting.
The precise components of the reference, which are well understood and largely agreed within the bibliographic community, are a good example of something that can be enhanced by markup. Markup is the process of adding information to a document that is not part of the content but adds information about the structure or elements. Using the Schatz citation as an example, we can write:
<BIB>
<TITLE>
Information Retrieval in Digital Libraries: Bringing Search to the Net
</TITLE>
<JOURNAL>Science</JOURNAL>
<AUTHOR>
<FIRSTNAME>Bruce</FIRSTNAME>
<INITIAL>R</INITIAL>
<LASTNAME>Schatz</LASTNAME>
</AUTHOR>
<VOLUME>275</VOLUME>
<YEAR>1997</YEAR>
<PAGES>327-334</PAGES>
</BIB>
A scientist never having seen markup before would implicitly understand this information. The advantage is that it's also straightforward to parse it by machine. If the tags (<...>) and their content are ignored, then the remainder (content) is exactly the same as it was earlier (except for punctuation and rendering). It's often useful to think of markup as invisible annotations on your document. Many modern systems do not mark up the document itself, but provide a separate document with the markup. For example, you may not be allowed to edit a document but can still point to, and comment on, a phrase, section, chapter, etc. This is a feature of hypermedia systems, and one of the goals of XML is to formalize this through the development of linking syntax and semantics in XML-LINK (XLL), but this is outside the scope of this paper.What is so remarkable about this? In essence we have made it possible for a machine to capture some of those things that a human takes for granted.
©. It also has a symbolic representation (©). XML itself has only a very few built-in character entities, but will support Unicode and other approaches to encoding characters. Most browsers do not yet support a wide range of glyphs for entities, but this is likely to change very rapidly, especially since languages like Java have addressed the problem.
I'll explain the "meta-" concept using XML and then show how it extends to applications such as TecML. XML, despite its name, is not a language but a meta-language (a tool for writing languages). XML is a set of rules that enable markup languages to be written; TecML and CML are two such languages. For example, one rule in XML is: every non-empty element must have a start-tag and an end-tag; so that the <AUTHOR> tag must be balanced by a </AUTHOR> tag. This is not a strict requirement of HTML, which uses a more flexible set of rules (but is also harder to parse or read by machine). Another rule is: all attribute values must occur within quotes (" or '). Writing a markup language is somewhat analogous to writing a program, and the relation of XML to CML is much the same as C to hello.c. We say that CML "is an application of XML," or "is written in XML," just as "hello.c is written in C." XML is a little stricter than HTML in the syntax it allows, but the benefit is that it's much easier to write browsers and other applications.
XML allows for two sorts of documents: valid and well-formed. Validity requires an explicit set of rules in a DTD. This is usually a separate file, but part or all can be included in the document itself. An example of a validity criterion in HTML is that LI (a ListItem) must occur within a UL or OL container. Well-formedness is a less strict criterion and requires primarily that the document can be automatically parsed without the DTD. The result can be represented as a tree structure. The bibliographic example above is well-formed, but without a DTD, it may not be valid. It might have been an explicit rule, like "the author must include an element describing the language that the article was written in, such as <LANGUAGE>EN</LANGUAGE>"; in this case, the document fragment would be invalid.
The importance of validity will depend on the philosophy of the community using XML. In molecular science all *.cml documents will be expected to be valid and this is ensured by running them through a validating parser such as NXP.[4] If a browser or other processing application such as a search engine can assume that a certified document was valid (perhaps from a validation stamp) there would be no need to write a validating parser. Being valid doesn't mean the contents are necessarily sensible; further processing may be needed for that purpose.
Where, and how, you enforce validity depends on what you are trying to do. If you are providing a form for authors to submit abstracts, you will enforce fairly strict rules. ("It must have one or more AUTHORs, exactly one ADDRESS for correspondence, and the AUTHOR must contain either a FIRSTNAME or INITIALS but not both.") This can be enforced in a DTD. But this would be too restricting for a general scientific document, which need not always have an AUTHOR. The two forces of precision and flexibility often conflict, but can be reconciled to a large extent by providing different ways of processing documents.
XML documents can be created, processed, and displayed in many ways. The schematic diagram in Figure 1 (which emphasizes the tree structure) shows some of the possible operations.
The lefthand module shows parts of the editing process. Legacy documents can be imported and converted on the fly, and the tree can be edited. There will normally also be a module for editing text. The editor may have access to a DTD and can therefore validate the document as it is created. An important aspect of XML-LINK is that editors should be able to create hyperlinks, either internally or to external files.
The complete document will then be mounted on a server. This will associate it with stylesheets, Java classes, the DTD, entities, and other linked components. The packaged documents are then delivered to the client where the application requires an XML parser. If the client wishes to validate the document the DTD is required.
Many XML applications will then hold the parsed document in memory as a tree (or grove) which can then be further processed. A frequent method will be the delivery of DSSSL stylesheets with the document (or provided client-side), or other transformation tools (perhaps written in Perl). Alternatively, the components of the document may be associated with Java classes either for display or transformation (as in the JUMBO browser). All of these methods may involve semantic validation (such as "does the document contain sensible information?").
Some of the operations required in processing XML are now explained in more detail:
Authoring
Figure 1
&pmraddress;. With appropriate software I can include this at appropriate places and the software will include the full content of the entity. (If the entity contains references to other entities, they are also expanded, and so on.)
display() method, which could be implemented differently from object to object. Thus, in JUMBO, MOLNode.display() brings up a rotatable screen display of the molecule, while BIB.display() displays each citation in a mixture of fonts. As with stylesheets, Java classes can be specified at any of the four places listed above, and the appropriate one downloaded from a Web site if required. One of the problems the XML-WG is tackling and solving is how to locate Java classes. Because Java is a very powerful programming language with full WWW support, it offers almost unlimited scope for XML applications. A document need not be passive, but could awake the client to take a whole series of actions--mailing people, downloading other data, and updating the local database are examples.
Attributes are semantically free in the same way as Elements, and can be used with stylesheets or Java classes to vary their meaning.<A HREF="http://www.venus.co.uk/omf/cml/"><IMG SRC="mypicture.gif" WIDTH="500" HEIGHT="100">
Whether Elements or attributes are used to convey markup is a matter of preference and style, but in general the more flexible the document the more I would recommend attributes. As a point of style, many people suggest that document content should not occur in attributes, but this is not universal. Here are some simple examples of the use of attributes:
If flexibility is more important, either because the field is evolving or because it is very broad, a rigid DTD may restrict development. In that case a more general DTD is useful, with flexibility being added through attributes and their values.
In TecML I created an Element type, XVAR, for a scalar variable. Attributes are used to tune the use and properties of XVAR, and it's possible to make it do "almost anything"! For example, it can be given a TYPE such as STRING, FLOAT, DATE, and TITLE. In this way, any number of objects can be precisely described. Here are three examples:
<XVAR TYPE="STRING" TITLE="Greeting">Hello world!</XVAR> <XVAR TYPE="DATE">2000-01-01</XVAR> <XVAR TYPE="FLOAT" DICTNAME="Melting Point" UNITS="Fahrenheit">451</XVAR>The last is particularly important because it uses the concept of linking to add semantics. This is an important feature of XML; the precise syntax is being developed in XML-LINK. CML uses DICTNAME to refer to an entry in a specified glossary that defines what "Melting Point" is. This entry could have further links to other resources, such as world collections of physical data. Similarly, UNITS is used to specify precisely what scale of temperature is used. Again, this is provided by a glossary in which SI[7] units are the default.
By using this approach it is possible to describe any scalar variable simply by varying the attributes and their values. Note that the attribute types must be defined in the DTD but their values may either be unlimited or can be restricted to a set of possible values.
The TecML DTD uses very few Element types, and these have been carefully chosen to cover most of the general concepts that arise in technical subjects. They include ARRAY, XLIST (a general tool for data structures such as tables and trees), FIGURE (a diagram), PERSON, BIB, and XNOTATION. (NOTATION is an XML concept which allows non-XML data to be carried in a document, and is therefore a way of including "foreign" file types.) With these simple tools and a wide range of attributes it is possible to mark up most technical scientific publications. There has to be general agreement about the semantics of the markup, of course, but this is a great advance compared with having no markup at all.
Note
In the preceding example the links are implicit; later versions of CML will probably use the explicit links provided by XML-LINK.
Many documents involve more than one basic discipline. For example, a scientific paper may include text, images, vector graphics, mathematics, molecules, bibliography, and glossaries. All of these are complex objects and most have established SGML conventions. Authors of these documents would like to reuse these existing conventions without having to write their own (very complicated) DTDs. The XML community is actively creating the mechanisms for doing this. If components are mixed within the same document, their namespaces must be identified (e.g., "this component obeys the MathML DTD and that one obeys CML"). For example, all the mathematical equations could be held in separate entities, and so could the molecular formulae. This would also support another method of combining components through XML-LINK, where the components are accessed through the HREF syntax.
In many cases simply creating well marked-up documents may be all that is required for their use in the databases of the future. The reason for this confident statement is that SDs provide a very rich context for individual Elements. Thus we can ask questions like:
ROOT,DESCENDANT(1,MOLECULE) DESCENDANT(1,MOLECULE) ROOT,DESCENDANT(DATASET)CHILD(1,MOLECULE)ANCESTOR(1,DATASET)CHILD(1,SPECTRUM,TYPE,"nmr")The first finds the first MOLECULE, which is a descendant of the root of the document, and then the first MOLECULE, which is somewhere in the subtree from that. The second is more complex, and requires the MOLECULE and SPECTRUM to be directly contained within the DATASET element. (The details of TEI Xpointers in XML may still undergo slight revision and are not further explained here.)
How will XML develop in practice? A natural impetus will come from those people who already use SGML and see how it could be used over the WWW. It is certainly something that publishers should look at very closely, as it has all the required components--including the likelihood that solutions will interoperate with Java.
XML is the ideal language for the creation and transmission of database entries. The use of entities means it can manage distributed components, it maps well onto objects, and it can manage complex relationships through its linking scheme. Most of the software components are already written.
How would it be used with a browser? Assuming that the bulk of tools are written in Java, we can foresee helper applications or plug-ins, and perhaps there will be more autonomous tools that are capable of independent action. It's an excellent approach to managing legacy documents rather than writing a specific helper for each type.
I hope enough tools will be available for XML to provide the same creative and expressive opportunities as HTML provided in the past. However, it's important to realize that freely available software is required--any tools for structured document management, especially in Java, will be extremely welcome. The accompanying paper describes my own contribution through the JUMBO browser.
Peter's research interests in molecular informatics include participation in the Open Molecule Foundation--a virtual community sharing molecular resources; developing the use of Chemical MIME for the electronic transmission of molecular information; creating the first publicly available XML browser, JUMBO; and developing the Virtual HyperGlossary--an exploration of how the world community can create a virtual resource in terminology.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.