Standard Data Vocabularies Unquestionably Harmful
At the onset of XML four long years ago, I commenced a jeremiad against Standard Data Vocabularies (SDVs), to little effect. Almost immediately after the light bulb moment -- you mean, I can get all the cool benefits of web in HTML and create my own tags? I can call the price of my crullers <PricePerCruller>, right beside beside <PricePerDonutHole> in my menu? -- new users realized the problem: a browser knows how to display a heading marked as <h1> bigger and more prominently than a lowlier <h3>. Yet there are no standard display expectations or semantics for the XML tags which users themselves create.
That there is no specific display for <Cruller> and, especially, not as distinct from <DonutHole> has been readily understood to demonstrate the separation of data structure expressed in XML from its display, which requires the application of styling to accomodate the fixed expectations of the browser. What has not been so readily accepted is that there should not be a standard expectation for how a data element, as identified by its markup, should be processed by programs doing something other than simple display. To a new user of XML bidding for a contract the newfound advantage of calling a <DonutHole> a <DonutHole> disappears if the pastry procurement protocol expects the bids to be
<PricePerPastry DonutType="hole" SpecializedType="brown sugar rolled"
PricingQuantityStandardRange="50K/day - 150K/day" />
Now, clearly, a mom-and-pop shop wanting to leverage the Web into a supplier contract to General Motors is perfectly happy to label its wares however GM expects. Over the past three years GM's expectations, and the expectations of the dominant players in more than a thousand vertical markets, have been codified as Standard Data Vocabularies. Where the dynamics of a vertical industry are like the automobile sector, this makes some sense and has resulted in marketplaces narrowly specific to vertical industries growing up around SDVs, as with Covisint (for automobile parts) or NewView (for steel).
|
Related Reading
XML Schema |
There have also emerged a few "horizontal" data vocabularies, intended for expressing business communication in more general terms. One of these is the eXtensible Business Reporting Language (XBRL), about which more below. Most recently, governments and governmental organizations have begun to suggest and eventually mandate particular SDVs for required filings, a development which expands what troubles me about these vocabularies by an order of magnitude.
At the just-concluded XML Europe 2002 conference in Barcelona I delivered a presentation which explained how an Enron-like organization could disseminate its particular spin and its own, perhaps not generally accepted, interpretations in financial reportings.
Adhering to standard vocabularies has recently meant all too often that an item properly labeled and conforming to an expected form is naively accepted as being actually what it purports to be. My talk was scheduled in the Legal and Government track of the conference, which makes sense given the topic, but it was not what an audience which came for news on the latest governmental initiatives in standard vocabularies might have wanted, and I found myself with a room of fourteen people. That number includes the session chair, me, and the two previous speakers in this track: an official of the European Patent Office and one from the Japan Patent Office now at the World Intellectual Property Organization, where they are working toward promulgating an SDV of 500 elements intended to express patent filings to 180+ patent offices worldwide. These patent office officials immediately understood the import of my argument to their work, and by question time the session had become a discussion of how firmly rooted in the nature of SDVs themselves is the problem of misstatement, of misdirection of naive interpretation, and the potential for fraud.
I have argued for years that, on the basis of their mechanism for elaborating semantics, SDVs are inherently unreliable for the transmission or repository of information. They become geometrically less reliable when the types or roles of either the sources or consumers of that information increase, ending at a nightmarish worst case of a third-order diminution of the reliability of information. And what is the means by which SDVs convey meaning? By simple assertion against the expected semantic interpretations hard-coded into a process consuming the data in question.
One recurring theme at the Barcelona conference was the need to break down "silos of information". Clearly new uses for data and the realization of synergies between previously unrelated functions require that information be released from a single vertical path of use within narrowly-defined areas of expertise. The uncritically accepted assumption is that this laudable goal should or can be reached through bisecting the silos of expertise with a horizontal common denominator which offers access to different narrow areas of expertise through a single shared vocabulary. Conceptually, that solution misunderstands what expertise is based on and how it operates.
|
|
| Post your comments |
Expert analysis or other processing depends at least as much on knowing what to process, where and how to find it, what form to expect those inputs to exhibit if they are valid, and what form of output most precisely conveys the effects of the expert process, as it does on the detail of how those inputs are manipulated into those outputs. In short, the bulk of expertise is in understanding the detail of connections between data and the processes which produced it or must consume it. It is precisely these expert connections which standard data vocabularies are intended to sever.
Patent filing
In the case of the SDV for worldwide patent filings, the presenters at the Barcelona conference lamented that, once what had seemed the hard work of designing the vocabulary was finished, they were surprised and frustrated by how much salesmanship and evangelism was required to encourage patent filers to use the vocabulary and governments to mandate its use.
In my opinion that will change quickly as filers realize that power to shape the outcome of a patent process has been shifted to them by the SDV. By design, the patenting process will begin with the filer's own assertions conveyed in the SDV. Filers can learn to effect particular outcomes by these assertions (or perhaps by unexpected combinations of assertions), which they submit to trigger hard-coded semantics from the patenting process. In effect, the SDV hands the general public a patenting process API, capable of significant remote imperative invocation of particular outcomes, precisely because the semantic outcome of the process is, by design, conditioned on the submission of specific items from the standard data vocabulary.
Security measures which generally protect remote invocation interfaces cannot be used to screen out submitters where the interface is intended, even mandated for use by the general public. General Motors might simply refuse bids from a particular submitter, but governmental organizations face steeper barriers to discriminating against individuals using a mandated vocabulary for an official communication.
In fact, precisely identifying the submitter, which would be the basis of discouragement in many security systems, is in this case a chief goal of a submitter seeking to be granted an individual right of entitlement by governmental authority. Submitters who want to game such a system have a better perspective on how it works than do the designers of its standard data vocabulary. Particular combinations of components from the SDV which might seem illogical to designers of that vocabulary may be found to result in process outcomes which benefit the submitters in ways never anticipated by designers.
Remember that what is at stake is control of intellectual property and the lucrative fruits of its use, obtainable by asserting effective incantations from the standard vocabulary. The gamesters have every incentive, while the guardians of the system can at best run to patch their process code whenever they discover it has yielded an unanticipated result. The vulnerability itself can never be removed so long as the principal design premise of the system is open access to the process code for anyone who uses the SDV to convey established semantics.
Worrisome as this is, it gets worse. The patent filing SDV only standardizes what is already the case: patent application in current practice begins from a submitter's formulation of its own claims. At present there is a human examiner in the patent office to restate those claims (if they seem to have some initial merit) into the terms on which they will be evaluated in the patent application process. Rather than the simple mechanical mapping of the semantics of the SDV to the execution of various processes, there is a complex expertise embodied in a human being which transforms a variety of incoming vocabularies into a functional one internal to the expert domain of the patent office. The proposed patent filing SDV will replace that expertise at the door of the patent office with a single fixed mapping of vocabulary items to the specific semantics of process outcomes.
XBRL
The eXtensible Business Reporting Language (XBRL) carries the consequences of such mechanically mapped semantics to another order of magnitude and effectively dumbs down the professional expertise of accountancy to the generalities of a SDV. Rather than the starting point, as in the patent filing process, the SDV of XBRL is the midpoint and interface between complex expert processes which acquire and prepare data and other processes which report and otherwise render that data through professional expertise.
XBRL bisects the closed silo of accountancy with a general-purpose common denominator SDV, which by design lacks the specificity required to proceed from input to output with the precision and nuance which both sides require. The rationale of the design is to open the silo so that other expert data collection processes may submit their product to a generalized repository, out of which reports and renderings in many areas of expertise might be generated. Unfortunately, this design ignores an inevitable outcome; generalizing data between the specific demands of domain expertise in collection and corresponding domain expertise in reporting will introduce vagueness, ambiguity, doubt, and error, wasting the expertise of the collection effort and reducing the reporting to meaninglessness or worse.
Again, the stakes are considerable. The United Kingdom has mandated XBRL for corporate tax reporting beginning in 2006, and the XBRL consortium is actively lobbying for other such government support. Again, however, the users of XBRL, and their purposes in using it, may not be what the designers of the SDV expect. Within the silo of accounting, the reasonable assumption is that the data is prepared by the same or equivalent experts to those who report it. The very rationale of the SDV is to break down that assumption. By design, data expressed in the SDV will be reported, rendered, and otherwise manipulated by those specifically inexpert in, and quite possibly unaware of the nature of its collection. Data integrity in such circumstances is simply unachievable.
This methodology itself strikes at the heart of domain expertise, which demands intimate knowledge of the details of the data which defines the field. Instead we have an open invitation -- indeed a government mandate -- to gamers of the system to concatenate those specific items of the SDV which will produce desired outcomes in reports ranging across taxation, securities regulation, investment analysis, and other high profit opportunities for fraud. What is not at all clear is that this gaming of the outcomes is in fact fraud, for the SDV itself severs the connection between input and output which would allow a reasonable inference of intent from the result.
It didn't and doesn't have to be this way. Instead of the static mapping of process semantics to particular items of the SDV, we can have processes which demonstrate specific expertise in their instantiation of data for their own unique purposes. They exhibit, that is, the crucial expertise of understanding their own data needs.
That expertise permits a process to operate upon data from a variety of sources, in each case available in a form particular to the expertise that created it and without regard to the nature or needs of the process -- or multiple very different processes -- which might consume or manipulate it. Each process produces only one expert rendition or other process outcome. Yet taken together with the variety of similarly expert processes which supply their input data, the group of such processes more than meet the ostensible goal of SDVs in opening the silo to the sharing of data on a many-to-many basis among different expert domains. That goal is not achieved without the effort of a strict discipline in designing process intercommunication and interaction, which I shall describe in a subsequent article, "The Natural Process Model of XML".
Have the implications of standard vocabularies been properly considered? Share your opinions in our forum.
(* You must be a member of XML.com to use this feature.)
Comment on this Article
| Titles Only | Titles Only | Newest First |
- Condeming all E-forms
2002-06-13 20:23:13 Alan Hoyland [Reply]
Mr Perry please explain how SDV differ from any other paper form the is merely filed or insuffiently checked.
Like Y2K we should warn people about the problem, but this is the same one that has always been with governments. I would suggest that this problems dates back as far as writing and forms.
- my2cents (in SDV101)
2002-06-10 16:42:16 thomas flood [Reply]
I agree, a set of simple conceptual examples
would be helpful for us novices in the audience.
It seems to me there are 2 basic dangers
that WP is rightly warning us of:
1) technological facilitation of fraud (or error)
due to the elimination of human-expert
filters on the inputs of consumer processes.
2) loss of information due to the constraints
of a mandated SDV applied to an inappropriately
broad cross-product of producers & consumers.
I agree with commentors that it seems extreme
to say that these are inherent insoluble problems
in the nature of SDVs. But maybe we don't yet get the whole point -- i await WP's next installment
where he seems to be promising an alternative.
Meanwhile, creative responses to these dangers:
1) assuming not case(2) loss of information,
then assure that a consumer include
in its automated processing the functions
of validity-checking previously done by
the human expert.
2) don't do it. well, the interesting challenge
is to come up with guidelines/ meta-processes
for designing a SDV, that can estimate the
"range of applicability", attempt to fully
specify for a given set of candidate
producers & consumers, or detect if this
would be impractical and insist on narrowing
the candidate set, to avoid information loss.
Finally,
Based on my meager experience of human nature
and technology, i would say that WP is
performing an extremely valuable service here:
The defects of human nature are rather constant,
but technological advances tend to remarkably
accelerate & amplify their consequences!
PS:
I wonder if it is helpful to think in terms
of categorization-along-multiple-dimensions
as the major sort of processing going on.
Knowledge-domain-A and Knowledge-domain-B
(the data producers) are sharing the terms
of a vocabulary, but of course, the meaning
of those terms depends specifically upon
their context within the domain.
Using a simple Hue-Saturation-Brightness
scheme, we can conjecture that in Domain-A,
"aquamarine" and "teal" might reasonably
be collapsed into the SDV category "green",
which the official consumer process equates
with "GOOD" (its output); while Domain-B
is mandated also to collapse both colors
into GREEN, it happens that actually "teal"
should be considered "blue"="BAD" in this domain.
An individual of type B would not be legally
fraudulent when he converts his teal to SDV green, he is just following the rules, right?
- SDV's are clearly not the issue
2002-06-01 04:48:17 Mike Willis [Reply]
Walter’s points here on the SDV are interesting and his technical semantic perspective is clear. What is not clear is the relationship of this perspective to the technical standard based concepts outlined for corporate reporting by the FASB, SEC or the IASC. It is also not clear why a company’s failure to comply with certain of these existing reporting standards is a reasonable basis to consider the SDV as a central issue. Finally, Walter does not address the opacity of the current reporting formats in which the existing reporting standards are represented in electronic stone tablets presented by public companies to the investors in the capital markets.
Why he has chosen XBRL as an example for SDV is of particular question due to the extensibility of the taxonomies and the clear intent that companies would leverage this for clarity in the supply chain providing information to investors in the capital markets. How the XBRL model enhances the current normalization processes of infomediaries (clearly an existing SDV consideration) is absent from his comments.
- SDV's are clearly not the issue
2002-06-01 06:02:13 W. E. Perry [Reply]
mwillis: Walter's points here on the SDV are interesting and his technical semantic perspective is clear.
That semantic perpective is one of the two premises on which the argument here is based. The other is the dependence of expertise in the reporting or rendition of data upon equivalent expertise in its collection and selection, which leads to the corollary that the expert process of rendition must determine how data is to be instantiated for its particular use.
mwillis: What is not clear is the relationship of this perspective to the technical standard based concepts outlined for corporate reporting by the FASB, SEC or the IASC.
Those standard based concepts are more related to the other premise--the nature of expertise. Domain expertise in the reporting, manipulation or other rendition of data in order to implement the standards of these authorities requires equally expert knowledge of the data inputs on which that rendition relies. The SDV severs, by design, the direct, specific and intimate connections of between the data produced by one expert process and that consumed by another. 'Breaking down the silo' in this way--by interposing the SDV between one expert process and another--frustrates the expertise of both. The data produced by an expert process cannot be made available in a form native to that expertise and specifically designed to particular perspective of that process, without regard to what other processes might consume some portion of the same data further downstream. Instead, the expert process must confine its output to the pre-agreed dialect of the SDV, just as every process which would consume that data might obtain it in no form more particular to the circumstances of its production than the SDV allows.
mwillis: It is also not clear why a company's failure to comply with certain of these existing reporting standards is a reasonable basis to consider the SDV as a central issue.
My point is rather that the SDV, as the only possibility for input to reporting processes, allows gaming the outcome of those processes by the choice of specific items of the SDV as input to that reporting. Whether the instance values of the SDV chosen to effect a particular outcome in fact reflect an otherwise-defensible 'truth' about a company's circumstances is a separate question. Yes, the instance facts submitted as input to a given reporting might be lies, but then again they might not, or they might be arguably defensible half truths. The issue that concerns me here is which items of the SDV--largely regardless of their instance values--are submitted as the basis for a given reporting because of the discovery that that particular combination of inputs is the magic incantation to trigger an intended output.
mwillis: Finally, Walter does not address the opacity of the current reporting formats in which the existing reporting standards are represented in electronic stone tablets presented by public companies to the investors in the capital markets.
I think that this is a question of how--which is to say, with what expertise and to accomplish which goals, of transparency or accessibility, for example--a particular data rendition process is implemented. Indeed there are inadequate, opaque and downright misleading reportings out there. But that is, I think, separate from the question of what can be done with expertise in reporting given only the inputs of an SDV versus given detailed access to the data produced by another expert process.
mwillis: Why he has chosen XBRL as an example for SDV is of particular question due to the extensibility of the taxonomies and the clear intent that companies would leverage this for clarity in the supply chain providing information to investors in the capital markets.
I have chosen XBRL because it is the example, par excellence, of a generalized, horizontal SDV interposed between expert processes producing data and other expert processes intended to consume it. The extensibility mechanisms do not apply to this particular case, where the (only!) relationship of expert process to expert process--on a many-to-many basis, and where most processes may not know of the existence of the others, let alone their specifics--is through a pre-ordained fixed vocabulary.
mwillis: How the XBRL model enhances the current normalization processes of infomediaries (clearly an existing SDV consideration) is absent from his comments.
To the extent that such normalization results in dumbing down the output of expert processes to the common denominator of the SDV, and thereby depriving other expert processes of the quality of data which they require, it is the very target of my essay.
--Walter Perry
- SDV's are clearly not the issue
2002-06-13 20:11:18 Alan Hoyland [Reply]
Can you two speak English or take your discussion off line.
Paragraphs like:
That semantic perpective is one of the two premises on which the argument here is based. The other is the dependence of expertise in the reporting or rendition of data upon equivalent expertise in its collection and selection, which leads to the corollary that the expert process of rendition must determine how data is to be instantiated for its particular use.
should be replaced with:
That was one of my points. The other is that to correctly fill out a form you must be an expert not just in the subject of the form but in the form itself.
Your use of the English languge is an attempt to prove that you are an expert not by your command of the subject but a command of it's vocabulary. Something that backfires on most intelligent people.
- SDV's are clearly not the issue
2002-06-24 14:18:43 W. E. Perry [Reply]
No. Your simplification conflates (into a hopeless muddle, I'm afraid) the two points which my more exact language is intended to distinguish. Filling out a form, as an example of submitting data from one expert domain to another, introduces through the experience of most readers of that example, the two assumptions which I feel are unwarranted and which I feel underlie the dangers of SDVs. One is that the submitter and the receiver of the data share semantic understanding broad enough to subsume all of the salient points of expertise on both sides. That is, there is nothing in the preparer's expert understanding of the data, nor in the recipient's, not covered by their prior understanding of each other's domains. This 'white box' view of interoperability contradicts the very nature of the specialization which defines expertise, and in the real world is only to be found within an homogenous and monolithic organization. Forms are intended to provide the (for efficiency's sake, smallest possible) common denominator for transmitting data between expert domains where, by the very nature of expert specialization, it will be differently used with different semantics in each location. Precisely what we cannot assume is that whoever submits the form 'must be an expert not just in the subject of the form but in the form itself' when what is meant by 'the subject of the form' is the subject as understood by the recipient, who will process that form according to his own expertise within his own specialized domain.
The other thing we should not assume is the form itself: it is the very SDV which I am arguing against. In other words, data submitted by an expert who collected and formatted it to the full expression of his expertise will not be data in the specific form which an expert in a different domain would expect to process it for his own particular purposes.The form which you use in your example is precisely the compromised common denominator of an SDV which does not adequately serve the expertise of the domain from which the data comes nor that to which it goes.
Your simplification and example therefore utterly misstate my point, and illustrate it with the very thing I claim that a proper understanding of my point cannot countenance. Granted my language may be heavy going and demanding, but it makes the point as simply as possible, if no simpler.
--Walter Perry
- SDV's are clearly not the issue
- SDV's are clearly not the issue
- SDV's are clearly not the issue
- xml-politics discussion list
2002-05-31 08:39:18 Kendall Clark [Reply]
Hi, folks, I'm Kendall Clark, one of the XML.com editors; just want to let you know that Walter Perry has started a mailing list, xml-politics, for discussion of the kinds of issues he raises in this article.
You can find more details about the list at
- proof
2002-05-31 00:43:34 bryan rasmussen [Reply]
As others have pointed out it's possible to defraud using non-markuped up data, such as the data formulated in a letter etc.
It seems to me that the defrauding capacity of SDV's have to do with how they are processed, I believe you touch on this, so yes if the SDV is processed by a program written by an Xml expert without input from a domain specific expert the SDV could enable defrauding, I suppose this means that processing of SDV's(outside of processing for display) will need specialized programs to handle the data - just as it will no doubt require specialized programs in most cases to generate the data - and the one's which lead to the best security/verification of data will be the one's that triumph in that area. As an aside it seems to me that any program which helps you automate generation of XBRL will want to prevent it's use as an instrument of fraud in order to get good recommends from government orgs etc and thus get ahead of competition in area.
- Example(s) PLEASE!
2002-05-30 19:48:46 Susan Jolly [Reply]
What more is the expert now doing in addition to normalizing/canonicalizing the data?
What you really seem to be saying is that the data provider is stupid and didn't know previously how the data was processed to produce the output so, thus, didn't know how to "cook the books" to get a desired result.
- Humans are harmful not vocabularies
2002-05-30 12:16:46 Jeff Gruszynski [Reply]
Technology isn't the issue per se - accounting fraud happens even with paper-based books. Fraud is a quality of humans, not technology. In so far humans will always look for a low energy path to everything, there will always be fraud regardless of the technology.
Technology does have the ability to "oil the works". One solution is to simply stop, which Walter seems to advocate: 'let's take a "stasist" strategy of no adoption.' Being a "dynamist" by nature, I reject that strategy without argument or justification.
To address the real issue though we could change the question to: do those who rely on technology have checkpoints, validations and processes for detecting problems. What could be put in place? Is the "system" self-correcting? Correction need not be technological. This is the role of legal, sociological and economic systems.
While using XML for financial reporting could encourage fraud, XML would also enable automated "sanity checks" to be applied which can't be done now with standard SEC, et al. filings, even the so-called XML ones. Being able to cook the books and make things fishy actually becomes *more difficult* with XML filing!
Of course, someone could outright lie with a complete set of consistent books that have no connection to reality, but you can do that today especially if your auditor chooses to overlook it. Outside boundary data will trip that up eventually. Not a technology issue.
There is naturally an issue of semantic dissonance: I mean this tag is this while you mean it to be that. Well, yeah, but look at accounting in general - very loosy-goosy by Techie standards. But that's the story of all human communication regardless of technology .
Will semantics be a source of friction? Yes, until we can connect our brains directly, but most cognitive theories suggest that wouldn't help anyway. In other words: it comes with the territory; we have to just deal with it. Compared to what passes for filings today, it's all up from here!
J
- example(s)?
2002-05-30 12:04:57 John Wiersba [Reply]
The topic sounds interesting and I think I understand it, but one or more down-to-earth examples would be helpful

