Cataloging XML Vocabularies
I've been involved recently in many discussions and projects oriented around a simple and common question: "how do I create an XML vocabulary?" The formulation was often different -- "how do I create a namespace?" or "how do I publish an XML schema?" -- but the central issue was always about what infrastructure to create and which methods should be used to advertise the newly created vocabulary.
Analyzing the various organizational, technical, and marketing facets of this question, I realized that the development and publication of "XML vocabularies" (or namespaces or schemas) is just a variation on the better known issue of web publishing; that is, web publishing tools and techniques should be used and adapted to the publication of XML vocabularies.
Among these technologies, web crawlers and search engines are probably those which are missing the most from the XML community. The purpose of this article is to show a proof of concept of how such tools might be used.
Choosing an XML Vocabulary
Choosing a XML vocabulary today is a very challenging task, quite similar to finding a web page before the development of the big search engines. The main difference is that it can be much more harmful to choose a "wrong" vocabulary than a "wrong" page.
When I need to choose a XML vocabulary, I want first to have a comprehensive list of vocabularies which could meet my needs. Ideally, I would be using a search engine like Google or AltaVista, but unfortunately, there is no specialized search engine for XML vocabularies.
To choose between those candidates, I need as much information as possible and a directory such as DMoz or Yahoo would be of great value. Unfortunately, there are lots of "schema repositories" covering vocabularies developed by a number of disjoint communities, and this really doesn't help in comparing those vocabularies. Furthermore, these repositories often publish the descriptions provided by the authors, which usually lacks the critical touch brought by the DMoz or Yahoo editors.
Finally, I find it very difficult to judge the dynamics of a vocabulary: to distinguish between a two years old specification abandoned by its authors whose usage is slowly declining, and a brand new one with a sharply rising market adoption. Statistics such as those provided by the Netcraft surveys would be invaluable for this purpose.
If the bad news is that none of the tools I have mentioned are available, the good news is that most, if not all, the information I need is available somewhere on the Web. In what follows I describe a solution to retrieve and present this information.
The Data Model
In order to capture information about XML vocabularies, I have opted for a very simple data model with only two classes (Document and Namespace), linked by a variety of relations:
- quotes: indicates that a namespace is "quoted" in a document but not declared or used in a way which conforms to the XML 1.0 and Namespaces in XML Recommendations.
- declares: indicates that a namespace is declared as specified in Namespaces in XML, without being used to qualify any element or attribute.
- uses: indicates a namespace is declared and used to qualify elements or attributes, but not used to qualify the root element.
- usesAsRoot: indicates that the namespace is declared and used to qualify the root element (and eventually other elements and attributes).
Other relations such as isSchemaFor or isTransformationFor will be defined in future releases.
The number of properties has been minimized, too; the Document class has only two properties: wellFormedXml and lastVisit. The Namespace class has no properties.
The RDF Schema for the data model reflects this simplicity: the relations are translated as sets of RDF properties included directly within their domain classes, without using any containers (which would have added triples and made the RDF querying more complicated than it should be).
For example, RDF describing the latest RDDL specification is as simple as
<?xml version="1.0" encoding="UTF-8"?> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <document xmlns="http://xmlns.info/descriptions/" rdf:about="http://www.openhealth.org/RDDL/20020218/rddl-20020218.html"> <lastVisit>2002-03-26T10:28:35Z</lastVisit> <wellFormedXml>yes</wellFormedXml> <usesAsRoot rdf:resource="http://www.w3.org/1999/xhtml"/> <uses rdf:resource="http://www.w3.org/XML/1998/namespace"/> <uses rdf:resource="http://www.rddl.org/"/> <uses rdf:resource="http://www.w3.org/1999/xlink"/> </document> </rdf:RDF>
The engine I used to gather XML vocabulary information is the multi-threaded C++ open source crawler Larbin. Larbin makes sure we behave as a good web citizen and takes care of the crawling itself, including link detection and duplicate management, and calls back a user-defined routine when a page has been retrieved.
Got a comment or question on this article? Share it in our forum.
The central piece of this process is the actual namespace discovery, which is done in two steps. The documents are first processed through a regular expression that detects constructs which look like namespaces declarations in documents, even when they are not well formed. When such occurrences are found, an attempt is made to parse the document using libxml, and if this attempt succeeds, an XSLT transformation is run using libxslt to perform a finer analysis on the document.
In both cases when namespaces are found, an RDF document similar to the one above is generated and stored. These documents can then be loaded in a RDF database or repository like 4Suite for later use.
The RDF Database
We have already mentioned the RDF schema used for this proof of concept. The information gathered by the crawler is almost complete for use, but I prefer to give no description for the namespaces discovered in the documents to avoid creating redundant descriptions in multiple documents mentioning the same namespace. As a consequence, the namespaces discovered are not yet typed as namespaces after we've loaded the documents into a RDF database. A batch program needs to be written to add a type to untyped namespaces.
For this proof of concept, I have been using 4Suite and its query language Versa. Some features of Versa, such as its ability to define aggregates, were missing from the RDF query languages which I have been using in the past and they are needed for this project.
Versa supports RDF Schema and can be used to provide a list of untyped namespaces.
filter(traverse(type(info:document), info:mentions, vtrav:any, vtrav:forward, vtrav:transitive), "not(. - rdf:type -> *)")
In this Versa query, the first argument of the filter function (traverse(type(info:document), info:mentions, vtrav:any, vtrav:forward, vtrav:transitive)) relies on our RDF schema to give a list of object resources linked to subjects with type info:document by any predicate which is a subproperty of info:mentions. The second argument restricts the results to those which have no type.
This is already quite useful; however, the main application of a RDF query language in this project is to retrieve the data and present it to users.
Statistics about XML vocabularies are the first results that can be obtained from this proof of concept.
The first trial run retrieved 7693 documents, a number too low to be significant (especially since the starting point given to the crawler was http://xmlfr.org, a specialized site which should lead to an overestimated proportion of "XML namespaces aware" pages.) 241 of these documents contained a mention of an XML namespace, and 85 different namespaces were found.
These statistics should be thought of as an example of the kind of conclusions that could be drawn, rather than representative.
Overall statistics and top 10 namespaces
Documents Proportion Proportion Total documents 7693 100.0% Namespace aware 241 3.1% 100.0% Well formed 74 1.0% 30.7% Not well formed 167 2.2% 69.3% XHTML 1.0 161 2.1% 66.8% MS Office 23 0.3% 9.5% HTML 4.0 20 0.3% 8.3% VML 13 0.2% 5.4% RDF 12 0.2% 5.0% Xlink 12 0.2% 5.0% MS Word 11 0.1% 4.6% XSLT 11 0.1% 4.6% Saxon 10 0.1% 4.1% Uuid (*) 9 0.1% 3.7%
A "namespace aware" document is any document where the text xmlns[:xxx]]='anything' or one of its variations has been found. Those documents are passed through a XML parser and can be well formed or not.
The "Uuid" namespaces is the namespace "uuid:C2F41010-65B3-11d1-A29F-00AA00C14882", mainly found in association with the MS Office namespace on the http://www.omg.org website.
Detailed Statistics for the Top 10 Namespaces
XHTML 1.0 161 100.0% Uses 72 44.7% Quotes 89 55.3%
As expected, XHTML 1.0 (http://www.w3.org/1999/xhtml) is the top namespace found during our crawl. The documents that are just quoting the namespace are not well-formed, and the huge proportion (55%) clearly shows that publishing well-formed XML is far from trivial with the tools available today.
Interestingly enough, even specialized and professional sites such as the Dutch web site from the W3C, IBM services, Dublin Core, Infoteria, O'Reilly, xmlhack and my own XMLfr (to name a few) have been seen serving XHTML pages that are not well-formed.
MS Office 23 100.0% Uses 0 0.0% Quotes 23 100.0%
MS Office (urn:schemas-microsoft-com:office:office) is known to be, as Bob DuCharme has put it, "an ugly mix of XML, ill-formed HTML, scripts, and if statements inside of square braces", and it's no surprise that none of those documents are well-formed.
Sites exposing this namespace include institutional sites such as the District Court of Maine, the European Commission, the United States Department of Agriculture, or France's Ministere de l'Education.
HTML 4.0 20 100.0% Uses 0 0.0% Quotes 20 100.0%
Using the location of the HTML 4.0 Recommendation (http://www.w3.org/TR/REC-html40) as a namespace to identify "well formed HTML" used to be a common practice before the publication of XHTML 1.0. Some sites such as Zvon, different US District Courts, or crossref.org expose this namespace. This is probably a "leak" during XSLT transformations generating HTML documents which are, by definition, not well-formed XML. Generally speaking, these namespace leaks may prove quite useful for guessing the namespaces used internally to construct web pages.
VML 13 100.0% Uses 0 0.0% Quotes 13 100.0%
VML (urn:schemas-microsoft-com:vml) is another namespace used by Microsoft. Sites exposing this namespace include the National Defense Industrial Association and the United Nations Environment Program.
RDF 12 100.0% Uses 0 0.0% Quotes 12 100.0%
The documents exposing RDF (http://www.w3.org/1999/02/22-rdf-syntax-ns#) found during this crawl are either HTML documents exposing it as a leak, such as pages from the well known Zvon XSLT Tutorial or CNN Arabic, or RDF islands in HTML documents such as pages from the U.S. Equal Employment Opportunity Commission. These documents are either HTML (and thus not well formed) or just quoting the RDF namespace.
Xlink 12 100.0% Uses 6 50.0% Quotes 6 50.0%
All the documents simply quoting the XLink namespace (http://www.w3.org/1999/xlink ) found during this crawl happened to be documentation mentioning XLink, such as Tim Bray's XNRL, Sean Palmer's XNGloss, or the University of Bath's guidelines for implementing Dublin Core. The six well-formed documents using XLink are RDDL documents including RDDL.org itself, my own Examplotron, and XSLTUnit.
MS Word 11 100.0% Uses 0 0.0% Quotes 11 100.0%
The MS Word namespace (urn:schemas-microsoft-com:office:word) is found associated with the MS office namespace mentioned above and the sites exposing it are pretty much the same.
XSLT 11 100.0% Uses 2 18.2% Quotes 9 81.8%
The two well formed documents exposing the XSLT namespace (http://www.w3.org/1999/XSL/Transform) are two XSLT transformations, while the other documents are mostly documentation mentioning XSLT, such as the netcrucible FAQ or James Clark's XT page.
Saxon 10 100.0% Uses 0 0.0% Quotes 10 100.0%
The Saxon namespace (http://icl.com/saxon) is most of the time a namespace leak in HTML documents, such as on the Systinet web site or David Carlisle's TEX manual. The remaining references are from documentation mentioning the namespace.
Uuid 9 100.0% Uses 0 0.0% Quotes 9 100.0%
Last of our top 10, uuid:C2F41010-65B3-11d1-A29F-00AA00C14882, is used in FrontPage web pages such as those of the OMG web site.
Pages: 1, 2