XML.com

Topic Maps Now

June 23, 2017

Michel Biezunski

Topic maps is a way of organizing information that is seldom in the news these days. One of the foremost practitioners describes topic maps, the relationship to some other technologies, and his assessment of the current status.

This article is my assessment on where Topic Maps are standing today. There is a striking contradiction between the fact that many web sites are organized as a set of interrelated topics — Wikipedia for example — and the fact that the name "Topic Maps" is hardly ever mentioned. In this paper, I will show why this is happening and advocate that the notions of topic mapping are still useful, even if they need to be adapted to new methods and systems. Furthermore, this flexibility in itself is a guarantee that they are still going to be relevant in the long term.

I have spent many years working with topic maps. I took part in the design of the initial topic maps model, I started the process to transform the conceptual model into an international standard. We published the first edition of Topic Maps ISO/IEC 13250 in 2000, and an update and a couple of years later in XML. Several other additions to the standard were published since then, the most recent one in 2015. During the last 15 years, I have helped clients create and manage topic map applications, and I am still doing it.

What are Topic Maps?

A topic map is a computer-readable graph or network of interconnected subjects. Every subject is described by a computer proxy called a topic. It provides independence between various information sources and the knowledge management layer. The Topic Maps Standard has two main purposes: support interoperability between tools used to create and manage topic maps, and enable the interchange of topic maps across organizations.

The Topic Maps data model is a structured way to describe a topic as a composite object. Topics are computer representations of objects uniquely describing a subject of conversation. Altogether, they constitute a topic-mapped space, which aims at capturing the semantic organization of an information repository. Topics have properties such as names, types, occurrences in sources. Topics can be related together through a graph of relationships whose semantics can be entirely defined by the users. For example, an index appears as a list of topic names, alphabetically sorted, accompanied by a pointer to the occurrences in the source documents (e.g., page numbers, or URLs). A glossary can be described as a list of topic names together with the list of occurrences playing the role of "definitions", where the content of each occurrence is shown. A cross-reference can be described as a link between two occurrences of the same topic. The data model ends up as a method for considering the traditional navigational aids as pre-resolved queries in a topic database.

An essential feature of topic mapping is the independence between the semantic assignment of topics and the sources to which they refer. The topics are not part of the sources, instead they point to the sources, and their properties, although they may be inherited from those coming from the sources, have become their own. Therefore if the sources are modified, the topics may retain most of their characteristics.

Several tools have been created to help create and edit topic maps. Some of them were free, some open source, some proprietary. The committee that has worked on developing the standard, which included information owners and developers, progressively became attended by a higher proportion of developers. Their focus has been to work on optimizing access to the topic maps interchange format, known as XTM (XML Topic Maps), and create stricter operability with a modified version called "Canonical XTM", and a lighter version called "Compact XTM". XTM version 2 replaced XTM version 1. The reasoning was that once you create a topic map compliant with the standard, you can interchange your topic map with others.

The Topic Maps standard enjoyed some success in early adoption ratings. It has been experimented and used in various countries, including in Europe and particularly in Norway, which hosted a Topic Maps conference for several years, and in America and Asia. Topic Mapping became a research domain in academia: a Topic Maps lab was created in Leipzig, Germany. Many presentations were made on Topic maps at various XML conferences. The standard has been used in governmental or intergovernmental organizations throughout the world. But overall, the success was somewhat limited, and use of Topic Maps seems to be declining, which seems surprising given the fact that many web sites are organized around topics. Most of these tools are no longer maintained, and the activity in the ISO standard working group has decreased accordingly.

Many things indeed have changed. First, when we started to brainstorm in 1992 about what became the Topic Maps model, the Web didn't even exist. HTML was a draft then, and the Internet was used only by a handful of scientists and technologists. Most publications were still published on paper or CD-ROMs. We started designing topic maps in an informal working group called Davenport, which turned out to also be at the origin of the SGML/XML-based "Docbook" document architecture. At the time, we were focusing on a generic model to provide publishers with interoperable navigational aids, such as indexes, glossaries and thesaurii. We used HyTime, a Hypermedia/Time-based Structuring Language that was designed to become a successor of SGML, but failed to deliver on that promise, and instead became an inspiration for several models or languages, including the Document Object Model and XPath. We focused on the concept of "independent linking" which enabled a link to be described independently of its sources, and that became the founding principle of the "topic" model.

Important things were happening at that time. The Web emerged, and changed the world. The need for improving communication and to rationalize information sharing gave rise to the Semantic Web built on top of the Resource Description Framework (RDF), a graph-based architecture describing the semantics of information. RDF and Topic Maps partially overlapped, but the general philosophy differed: the RDF community favored automatic semantic data processing using techniques inherited from artificial intelligence, whereas the Topic Maps community focused on empowering subject matter experts. However, this distinction does not reflect completely what happened, because the RDF community reached out to the librarians with the "Dublin Core" metadata set, while some Topic maps applications were built using automatic acquisition processes.

The Open Linked Data specification, based on RDF, assigns addresses to topics in the Web space by means of unique Uniform Resource Identifiers (URI). This specification aims at creating a wide range of topics that can be shared across many different web sites and applications. Taxonomy and ontology software implementing RDF and the Web Ontology Language (OWL), for example "Protégé") have been widely adopted.

Other approaches have been created to manage information according to topics, such as the Darwin Information Typing Architecture (DITA), an XML architecture used in the industry for technical documentation. (https://www.oasis-open.org/standards#ditav1.3). DITA is preferred over Docbook when documents are broken into "micro-documents", describing one specific topic. Docbook is used for book-like documents, with chapters, where the information is presented linearly.

Although XML has become a lingua franca for publishing and data interchange, its usage has decreased among information technology professionals, who now tend to prefer JSON for data interchange, especially in situations where the data structure is straightforward. On the publishing side, the decreased need to print, the increased quality of web rendition with iterations of the CSS specification, combined with the possibility to add custom attributes in HTML5, enables the creation of web-based publishing applications that offer similar features as XML for structuring. The web-based interfaces are merging with the applications created for the mobile platforms, and have become the dominant publishing architecture. The trend towards smaller content pages, compared to the book, accentuates topic-centric content publishing units. In a way, this is a fulfillment of the Topic Maps promise, which consists in grouping all content and relevant links on one given subject in one location. In the domain of technical documentation, the DITA approach is popular, because it offers ready-to-use structures and tools for organizing topic pages. The openness of Topic Maps enables more flexibility and variations than DITA, but it imposes more modular work upfront with tools that are less focused on the specificities of technical documentation.

The relation between XML and Topic Maps is not as direct than other approaches. Docbook and DITA rely on XML because they are XML schemas. As Topic Maps use XML as one possible interchange syntax, the dependency is not that tight, especially if the interchange syntax is an output format.

The burgeoning of non-standard topic maps

The abundance of non-structured information available online has created a need for searching information across web pages. Search technologies were developed, ultimately dominated by Google. These technologies use powerful automated algorithms, which started with full text indexing, and developed into much more sophisticated products, involving complex natural language processing algorithms. Search engines provide results which are organized in a topic map-like fashion. For example, let's look for a subject, "San Francisco", using three major search engines, Google, Bing, and DuckDuckGo. The search engines deliver more than a list of links to relevant web pages.

The Google search page displays, in a top menu, filtered lists of hits, organized by source types ("Maps", "News", "Images", "Videos"), by categories: "More" -- including "Shopping", "Books", "Flights" and "Personal", or by date, or language (under "Search tools"). It also displays a box giving basic information about San Francisco, including hyperlinked information on related topics, for example neighborhoods. This box is, as far as we can tell, part of the "Google Knowledge Graph", which was added to Google Search after Google acquired a company called Metaweb in 2010. Metaweb developed "Freebase", an "open, shared database of the world knowledge", which was explicitly constructed using the concepts of the Topic Maps standard. Freebase was absorbed a couple of years ago into Wikidata, an open source collaborative knowledge repository used to populate Wikipedia, among other sites.

Bing displays a page similar to Google. The box containing information about San Francisco is extracted from Wikipedia and Freebase.

Duckduckgo is somewhat different. The filtered lists contain as for Google and Bing "Web", "Images", "Video" and "News", but also "Meanings", which open a submenu containing: "Top", "Places Within San Francisco, California", "Places", "Popular Culture", "Games", "Music", "Transportation", "Other uses" and "See Also". Each item is followed by the number of corresponding occurrences, which are displayed below as a link within a box in a carousel.

Graph databases

Wikidata defines itself as a "document-oriented database", or "document store" powered by the "Wikibase" software. It also has the characteristics of a "graph database", because it uses nodes and edges to represent and store data.

The Wikidata page for San Francisco displays a rich amount of information. A first box shows the multiple names used for San Francisco in multiple languages, as well as alternative names used, including for example "SFO" or "The Golden City". Then a number of related topics appear under the label "statements", and the semantic of each relation is displayed together with the topic related. The list of too long to be cited here, but it contains "instance of", "coordinate location", "sister city", "shares border with", "population", "head of government". Some of the related topics are further qualified. For example, the population of the city has multiple values depending on the point in time, and the determination method (estimation, census).

This example implements the concept of topic mapping at a very large scale, and is widely used. However, again, it doesn't refer to the topic maps standard per se. On the other end of the spectrum, a project of an "open source topic map-based graph database" is developed in Norway as a personal research project. The author, Brett Alistair Kromkamp, indicates in his blog that TopicDB is based on the topic maps paradigm, but it is not an implementation of the ISO/IEC 13250 Topic Maps standard (http://www.storytechnologies.com/blog/).

Our experience reinforces this trend. When we started to develop Taxmap, a topic map project for the US Internal Revenue Service, we provided, as output, an XML representation of the topic maps, which was compliant with the standardized format. But we explicitly designed it as output only, and we used various techniques to create and manage a graph-based application that was capturing our customers' data, independently of the standard data model representation. Two things happened. First, we realized after a couple of years that the interchange format was a non-requirement. Second, we experienced that, despite its genericity, the Topic Maps model failed to capture some subtleties in the information we had to describe.

We started to experience a greater distance between the standard data model and the semantic structure of the information we were dealing with. The data model proved in some cases too rigid to capture the essence of the information being described. The distinction between name, topic type, occurrence, and association, which seemed pretty clear when we designed the standard, got blurry in specific situations that we encountered. For example, a person considered as a topic can have a first name, a middle name, a last name, a suffix, a given name, a surname. Which of those should be used as the topic name for the topic representing the person? In the Topic Maps model, we could use interchangeably variant names, the "scope" property, or specialized occurrences. There was no real guidance towards a "canonical" standard representation optimized for interchange, because of the openness of the standard. Although this gives more freedom for designing each specific application, it further endangers the feasibility of interchange between different topic maps, if the need would exist.

In the topic map application we are working on for the Internal Revenue Service, some documents are forms. In common topic maps terms, a document appears as an occurrence of a topic. For example, "Form 1040" is an occurrence of the topic "Individual Income Tax Return". But we decided to consider the forms themselves as topics. Form 4868, which is used to apply for an automatic extension of time to file an individual tax return, is related to "Form 1040" as a topic. In applicability terms, if as a taxpayer you are in a process a filing a 1040 you should see immediately that another form, 4868, is related to it. Eventually, four topics are involved in this example: the subjects themselves (e.g., "automatic extension to file..." and "individual income tax return"), and the forms ("1040" and "4868"). Since the forms are the "user interface" by which taxpayers communicate with the IRS, the availability of handling forms as topics and creating a semantic network of related forms is one of the most interesting, and unique, aspects of this topic map. And that was something that was not anticipated as a feature in the standard.

The interchangeability of topic maps across companies and organizations was one of the main assets provided by the standard. Most of the tools were built to provide this feature. The possibility of merging topic maps was a side effect of this ability. So far, we have not seen much use of this feature. It doesn't mean that it has not been used, but in our extensive experience with topic maps we have not seen it happening. One of the reasons may be because companies that do business with others don't necessary want to interchange the core of their knowledge with each other, but limit their information exchange to what is strictly necessary. If they have invested a lot into their knowledge assets to be competitive, they may simply not want to share them. Government agencies could be a good candidate for more openness in intimate information interchange. In some situations, for the sake of preserving individual rights, information exchange is strictly prohibited. For example, the US Census Bureau cannot divulge any information about people. The IRS cannot share tax returns with anyone, except in very special circumstances. Moreover, when information is sharable, there is a big gap between general declarations of intent about openness and transparency and the reality. Most of the times, information is very complex. If every government agency would be using topic maps, it is very unlikely that they could share them with other institutions. And even if they want to, the way information is organized would often be too specific to be easily exchanged.

In other words, the case for topic maps interchange still needs to be made, and does not look as desirable as previously thought. This reason explains why focusing on interchangeability of topic maps has probably been a factor in its low adoption ratings. The main value of the topic maps paradigm seems therefore not in be the interchangeability of topic maps, but rather in the independence between the sources and the knowledge layer.

Independence from the sources.

The most important promise of the topic maps design is to guarantee that the knowledge representation of information as topics be kept independent from the information sources. In other words, users should be able to point to any information sources, including and especially when they change, and create and manage semantic from outside. This offers the guarantee that when sources change, only the occurrences of the topics are redefined, but the other properties of the topic remain, including the relations to other topics, the various names used to designate the topics, including the other languages equivalents, and the types to which the topics belong. The flexibility of the topic map, its ability to survive modifications in the information sources repository, is accompanied by a wide open ability to use various tools to create topic maps. It is therefore possible to manage topic maps using XML systems, but also instead spreadsheets, databases --relational, object-oriented, XML-based, NoSQL, graph-based--, and content management systems including tagging capabilities, taxonomy management features, etc. Web-based frameworks are commonly equipped with features that emulate what can be done with topic maps, and should be usable as well.

Small Data vs. Big Data

In the information environment in which we live, with the scary amount of information available, why would anyone spend time and effort to proactively organize information around topics? The answer to that question depends on the environment. In many cases, the search technologies yield to transient topic maps (i.e., search hits) which are considered "good enough". They are not perfect, especially if the number of hits is astronomical. Besides the world of "Big Data", another world exists, which is less visible, but still very present, that we will call "Small Data". This is a world where the creators or publishers of information know what their content is, and their business is based on guaranteeing that their content has a high quality, high value, so that they can be relied upon. This is the traditional role of the publishing industry, but other industries or activities also depend on the reliability of the information they make public: media, government, international organizations, healthcare, intelligence, finance, manufacturing, research and development, are examples of such sectors. And they represent a non-negligible part of the economy. In these sectors, guaranteeing access to relevant pieces of information is of paramount importance. Sometimes, the purpose of proper information management is to hide information rather than showing it, but it is even more important to use a solid methodology to describe and qualify the information items.

There are situations where automated search capabilities do not return the information a user is looking for. To take a simplistic example, if a search engine is based on strict full text recognition, looking for "George Washington" would not return content that contains "General Washington". The problem with this situation is that the user may think that the information is not there, and the publisher may lose traction and even the trust of its customers if they are not fully confident that they can find the information that matters to them. In some cases, it may even a life-or-death issue. It would be unthinkable for an airplane pilot cabin equipment manufacturer to leave its users --pilots--, rely on unvetted search algorithms to find critical information in case of an emergency.

Other domains have similar requirements: information collected by intelligence agencies must be organized according to complex, not always repeatable, combinations of algorithms and hand-made editing, in order to be able to "connect the dots" and not take a chance to miss an important piece of information, only because it is not tagged exactly like another similar one. When information is published in multiple languages, the ability to synchronize is important. These sophisticated requirements are also playing an important role in finance, healthcare, science and in academic work. In other words, subject matter experts still have a role to play, whether it is to index books, or to perform similar activities on digital information.

Taxonomies

Librarians use taxonomies to organize knowledge, according to a hierarchy of categories and subcategories. All relevant materials are related to a branch or a leaf of the tree using a common terminology. The "authority terms" comprising it are expected to be used outside the library catalog, as metadata in the sources, enabling links to the taxonomy. For example, every book traditionally gets assigned Library of Congress authority headings as part of its metadata. Taxonomies have further evolved into ontologies, containing rules that facilitate automatic processing for retrieving subjects based on computed properties. The Semantic Web community has developed an ontology language (OWL) that is used on top of RDF to help use artificial intelligence techniques to retrieve data based on various user queries.

The main challenge with taxonomies, and any knowledge organization scheme, is the cost for creating them and for maintaining them over time. Experts need to meet and agree upon a common way to describe knowledge, down to a very detailed level. This seems like a reasonable endeavor, but in reality, it turns out to be a very complicated task. Besides common ground, the devil is in the details. Experts may integrate explicitly different world views and find ways to account for multiple ways of modeling and qualifying terms. When no prevalent worldview is asserted, disagreements may result in misunderstandings, imperfect compromises, therefore jeopardizing the integrity of the knowledge description. Any ambiguity or lack of clear definition will result in the future to be further deepened by newcomers, who may not have a full understanding of the background context. Subsequent taxonomy editors may eventually mischaracterize some topics, and the overall quality of the taxonomical organization will decrease. Just the passing of time will also take its toll. New information sources may not be describable using the existing categories. Modifying the taxonomies may not be easy, especially if it served as foundations for customized tools providing user interfaces relying on the existing content. The procedures for submitting a request for changing, adding, or deleting taxonomy terms may involve many steps that users will be reluctant to take. They may prefer to slightly tweak descriptions to fit an existing term rather than getting into the trouble of adjusting the taxonomy. By doing so, they may underestimate the long term effects of semantic drift, and slowly the taxonomy will become out of sync with the content of the information, causing it to become progressively irrelevant, until the time when these effects will become impossible to ignore. The company may decide that it makes more sense to start from scratch and build a brand new taxonomy, that will eventually go to the same degradation process over time again.

Crowd-source tagging is sometimes considered an alternative. It leaves full freedom to each contributor to create their own terminology, but with the risk of creating semantic inconsistencies. Recently we have been working for the NYU Library on a project of integration of about one hundred book indexes made by different authors, at different times, published by different publishers. Although each index is internally consistent, mixing them together reveals how delicate semantic integration can be. Fixing variant spellings or presentation for similar terms is the easy part, and can be handled using a variety of policies that enforce consistency after the fact (for example, person names could be harmonized as "last name" followed by "first name"). The most difficult part is the level of semantic granularity that is needed. For example, the word "heart surgery" is a valid index entry on a book describing a range of medical techniques, but it is irrelevant in a book that is entirely devoted to the subject of heart surgeries.

Automation vs. Curation

So far, we have shown the limitations of various semantic approaches to find information. Using automated algorithms may result in missing crucial information, because the way the information item appears is outside of the range of what the algorithm can grasp. Using a strict organization of knowledge may result in a process which is so hard to maintain that over time it becomes progressively irrelevant. Leaving full freedom for tagging information results in the creation of inconsistencies. The difficulties involved seem out of reach for many confronted to these challenges, and, either looking for lowering costs or out of despair, some companies are outsourcing many aspects of their information technology assets, including the knowledge management itself. But when companies or organizations defer to third parties the management of their core knowledge assets, they take the risk to lose their raison d'être.

There should be a better way. There are two directions to look for: first, using the principle of independence between the sources and the knowledge management layer, and second, fine tuning the balance between automatic processing and manual curation.

The independence between the sources and the knowledge management layer is what is at the core of the Topic Maps paradigm. But it has been somewhat relegated to the back burner by methods insisting on privileging merging of topic maps, and imposing its users to author topic maps using the syntactic constructs of the standard. Instead, our experience has been to be as pragmatic as possible in terms of how topics are organized. It should not matter whether they appear in a database, in a content management system, in XML elements, in HTML metadata, in RDF-Dublin Core metadata, MARC format, spreadsheets, full text, index entries, etc. There are ways to extract those topics after the fact, and organize them with powerful tools, providing a comfortable user interface. Once the topics are extracted, they live independently of the sources from where they come from. Therefore, any changes in the way sources are handled does not affect the entirety of the knowledge layer. For example, if a company decides to replace its content management system by a new one, the knowledge layer just needs to disconnect from the old system and connect to the new one. Everything else is preserved. In that sense, the topic maps paradigm offers a way to preserve the longevity of the work done on the knowledge layer. For example, the fact that Manhattan is a borough of New York City has nothing to do with the source formats in which the topic "Manhattan" is found. The problem arises if the knowledge management layer is handled inside a particular product. For example, if this information is only present in a content management system, and the content management system is replaced, the information gets lost and has to be recreated, potentially at high cost.

Given the amount of information available at our fingertips, it is as unrealistic to rely exclusively on manual qualification of findable information. The back-of-the-book indexes are extremely useful tools, because they have been crafted by hand, as an intellectual work, providing more value to the book. But this activity is not scalable, except in specific contexts. It is not advisable neither to exclusively rely on automated processes, because of numerous exceptions that would be missed. There is no magical answer to that question, but our experience is to empirically find the fragile equilibrium point between these two poles, knowing that this equilibrium point may change over time. Some automatic processes can be added, others need to be removed, and manual tweaking should be possible at various levels. Sometimes, it's more convenient to edit the results of an automatic processing than to do everything manually. Sometimes, it's easier to do everything manually and often more accurate; there is a limit to the accuracy automatic processes can add in decisions about semantic meaning. There is no absolute limit how to decide where the tipping point is.

The combination of both ways is what has proven to work best. Extracting knowledge into an independent layer, and enabling processing at that knowledge layer, with a feedback loop going back to the sources, doesn't seem like the most direct and efficient way to do this. This process is comparable to the publishing workflow where authors insist on using Word, but the publishers want XML. Round-tripping the conversion between the two formats is not efficient, but it's sometimes necessary. Furthermore, this level of indirection is precisely what provides us with the power and freedom to handle knowledge in a way that can be preserved over time, regardless what happens to the source information, and more specifically, to the systems used to handle. All the work which was done to describe information, type the topics, create relationships, manage multilingual equivalences, still works. Because it has been managed independently, upgrading a system simply mean disconnect from the old system and reconnect to the new one.

The lessons learned from working with Topic Maps for more than two decades are contrasted: because the rapid pace of technological advances, we have been overwhelmed by the success of information technologies. Looking for the immediate next big thing has obscured our capacity of thinking about the fundamental nature of what we are doing. The notions of trust, reliability, high quality content, are still central to the long-term success of our enterprises. We need to adjust to the changing nature of the ways information we are dealing with presents itself. It's just the beginning. When we created the Topic maps standard, we created something that turned out to be a solution without a problem: the possibility to merge knowledge networks across organizations. Despite numerous expectations and many efforts in that direction, this didn't prove to meet enough demands from users. But we also developed the concept of independence between information sources and the knowledge management layers. This may turn out to be what remains on the long term, even if the fact that this idea once went by the name of topic maps may fall into oblivion.