Standard Data Vocabularies Unquestionably Harmful

May 29, 2002

Walter Perry

At the onset of XML four long years ago, I commenced a jeremiad against Standard Data Vocabularies (SDVs), to little effect. Almost immediately after the light bulb moment -- you mean, I can get all the cool benefits of web in HTML and create my own tags? I can call the price of my crullers <PricePerCruller>, right beside beside <PricePerDonutHole> in my menu? -- new users realized the problem: a browser knows how to display a heading marked as <h1> bigger and more prominently than a lowlier <h3>. Yet there are no standard display expectations or semantics for the XML tags which users themselves create.

That there is no specific display for <Cruller> and, especially, not as distinct from <DonutHole> has been readily understood to demonstrate the separation of data structure expressed in XML from its display, which requires the application of styling to accomodate the fixed expectations of the browser. What has not been so readily accepted is that there should not be a standard expectation for how a data element, as identified by its markup, should be processed by programs doing something other than simple display. To a new user of XML bidding for a contract the newfound advantage of calling a <DonutHole> a <DonutHole> disappears if the pastry procurement protocol expects the bids to be

<PricePerPastry DonutType="hole" SpecializedType="brown sugar rolled" 

    PricingQuantityStandardRange="50K/day - 150K/day" />

Now, clearly, a mom-and-pop shop wanting to leverage the Web into a supplier contract to General Motors is perfectly happy to label its wares however GM expects. Over the past three years GM's expectations, and the expectations of the dominant players in more than a thousand vertical markets, have been codified as Standard Data Vocabularies. Where the dynamics of a vertical industry are like the automobile sector, this makes some sense and has resulted in marketplaces narrowly specific to vertical industries growing up around SDVs, as with Covisint (for automobile parts) or NewView (for steel).

There have also emerged a few "horizontal" data vocabularies, intended for expressing business communication in more general terms. One of these is the eXtensible Business Reporting Language (XBRL), about which more below. Most recently, governments and governmental organizations have begun to suggest and eventually mandate particular SDVs for required filings, a development which expands what troubles me about these vocabularies by an order of magnitude.

At the just-concluded XML Europe 2002 conference in Barcelona I delivered a presentation which explained how an Enron-like organization could disseminate its particular spin and its own, perhaps not generally accepted, interpretations in financial reportings.

Adhering to standard vocabularies has recently meant all too often that an item properly labeled and conforming to an expected form is naively accepted as being actually what it purports to be. My talk was scheduled in the Legal and Government track of the conference, which makes sense given the topic, but it was not what an audience which came for news on the latest governmental initiatives in standard vocabularies might have wanted, and I found myself with a room of fourteen people. That number includes the session chair, me, and the two previous speakers in this track: an official of the European Patent Office and one from the Japan Patent Office now at the World Intellectual Property Organization, where they are working toward promulgating an SDV of 500 elements intended to express patent filings to 180+ patent offices worldwide. These patent office officials immediately understood the import of my argument to their work, and by question time the session had become a discussion of how firmly rooted in the nature of SDVs themselves is the problem of misstatement, of misdirection of naive interpretation, and the potential for fraud.

I have argued for years that, on the basis of their mechanism for elaborating semantics, SDVs are inherently unreliable for the transmission or repository of information. They become geometrically less reliable when the types or roles of either the sources or consumers of that information increase, ending at a nightmarish worst case of a third-order diminution of the reliability of information. And what is the means by which SDVs convey meaning? By simple assertion against the expected semantic interpretations hard-coded into a process consuming the data in question.

One recurring theme at the Barcelona conference was the need to break down "silos of information". Clearly new uses for data and the realization of synergies between previously unrelated functions require that information be released from a single vertical path of use within narrowly-defined areas of expertise. The uncritically accepted assumption is that this laudable goal should or can be reached through bisecting the silos of expertise with a horizontal common denominator which offers access to different narrow areas of expertise through a single shared vocabulary. Conceptually, that solution misunderstands what expertise is based on and how it operates.

Expert analysis or other processing depends at least as much on knowing what to process, where and how to find it, what form to expect those inputs to exhibit if they are valid, and what form of output most precisely conveys the effects of the expert process, as it does on the detail of how those inputs are manipulated into those outputs. In short, the bulk of expertise is in understanding the detail of connections between data and the processes which produced it or must consume it. It is precisely these expert connections which standard data vocabularies are intended to sever.

Patent filing

In the case of the SDV for worldwide patent filings, the presenters at the Barcelona conference lamented that, once what had seemed the hard work of designing the vocabulary was finished, they were surprised and frustrated by how much salesmanship and evangelism was required to encourage patent filers to use the vocabulary and governments to mandate its use.

In my opinion that will change quickly as filers realize that power to shape the outcome of a patent process has been shifted to them by the SDV. By design, the patenting process will begin with the filer's own assertions conveyed in the SDV. Filers can learn to effect particular outcomes by these assertions (or perhaps by unexpected combinations of assertions), which they submit to trigger hard-coded semantics from the patenting process. In effect, the SDV hands the general public a patenting process API, capable of significant remote imperative invocation of particular outcomes, precisely because the semantic outcome of the process is, by design, conditioned on the submission of specific items from the standard data vocabulary.

Security measures which generally protect remote invocation interfaces cannot be used to screen out submitters where the interface is intended, even mandated for use by the general public. General Motors might simply refuse bids from a particular submitter, but governmental organizations face steeper barriers to discriminating against individuals using a mandated vocabulary for an official communication.

In fact, precisely identifying the submitter, which would be the basis of discouragement in many security systems, is in this case a chief goal of a submitter seeking to be granted an individual right of entitlement by governmental authority. Submitters who want to game such a system have a better perspective on how it works than do the designers of its standard data vocabulary. Particular combinations of components from the SDV which might seem illogical to designers of that vocabulary may be found to result in process outcomes which benefit the submitters in ways never anticipated by designers.

Remember that what is at stake is control of intellectual property and the lucrative fruits of its use, obtainable by asserting effective incantations from the standard vocabulary. The gamesters have every incentive, while the guardians of the system can at best run to patch their process code whenever they discover it has yielded an unanticipated result. The vulnerability itself can never be removed so long as the principal design premise of the system is open access to the process code for anyone who uses the SDV to convey established semantics.

Worrisome as this is, it gets worse. The patent filing SDV only standardizes what is already the case: patent application in current practice begins from a submitter's formulation of its own claims. At present there is a human examiner in the patent office to restate those claims (if they seem to have some initial merit) into the terms on which they will be evaluated in the patent application process. Rather than the simple mechanical mapping of the semantics of the SDV to the execution of various processes, there is a complex expertise embodied in a human being which transforms a variety of incoming vocabularies into a functional one internal to the expert domain of the patent office. The proposed patent filing SDV will replace that expertise at the door of the patent office with a single fixed mapping of vocabulary items to the specific semantics of process outcomes.


The eXtensible Business Reporting Language (XBRL) carries the consequences of such mechanically mapped semantics to another order of magnitude and effectively dumbs down the professional expertise of accountancy to the generalities of a SDV. Rather than the starting point, as in the patent filing process, the SDV of XBRL is the midpoint and interface between complex expert processes which acquire and prepare data and other processes which report and otherwise render that data through professional expertise.

XBRL bisects the closed silo of accountancy with a general-purpose common denominator SDV, which by design lacks the specificity required to proceed from input to output with the precision and nuance which both sides require. The rationale of the design is to open the silo so that other expert data collection processes may submit their product to a generalized repository, out of which reports and renderings in many areas of expertise might be generated. Unfortunately, this design ignores an inevitable outcome; generalizing data between the specific demands of domain expertise in collection and corresponding domain expertise in reporting will introduce vagueness, ambiguity, doubt, and error, wasting the expertise of the collection effort and reducing the reporting to meaninglessness or worse.

Again, the stakes are considerable. The United Kingdom has mandated XBRL for corporate tax reporting beginning in 2006, and the XBRL consortium is actively lobbying for other such government support. Again, however, the users of XBRL, and their purposes in using it, may not be what the designers of the SDV expect. Within the silo of accounting, the reasonable assumption is that the data is prepared by the same or equivalent experts to those who report it. The very rationale of the SDV is to break down that assumption. By design, data expressed in the SDV will be reported, rendered, and otherwise manipulated by those specifically inexpert in, and quite possibly unaware of the nature of its collection. Data integrity in such circumstances is simply unachievable.

This methodology itself strikes at the heart of domain expertise, which demands intimate knowledge of the details of the data which defines the field. Instead we have an open invitation -- indeed a government mandate -- to gamers of the system to concatenate those specific items of the SDV which will produce desired outcomes in reports ranging across taxation, securities regulation, investment analysis, and other high profit opportunities for fraud. What is not at all clear is that this gaming of the outcomes is in fact fraud, for the SDV itself severs the connection between input and output which would allow a reasonable inference of intent from the result.

It didn't and doesn't have to be this way. Instead of the static mapping of process semantics to particular items of the SDV, we can have processes which demonstrate specific expertise in their instantiation of data for their own unique purposes. They exhibit, that is, the crucial expertise of understanding their own data needs.

That expertise permits a process to operate upon data from a variety of sources, in each case available in a form particular to the expertise that created it and without regard to the nature or needs of the process -- or multiple very different processes -- which might consume or manipulate it. Each process produces only one expert rendition or other process outcome. Yet taken together with the variety of similarly expert processes which supply their input data, the group of such processes more than meet the ostensible goal of SDVs in opening the silo to the sharing of data on a many-to-many basis among different expert domains. That goal is not achieved without the effort of a strict discipline in designing process intercommunication and interaction, which I shall describe in a subsequent article, "The Natural Process Model of XML".