XML.com

Invisible XML

March 1, 2022

Norm Tovey-Walsh

Norm Tovey-Walsh introduces Invisible XML, a language for describing the implicit structure of data, and a set of technologies for making that structure explicit as XML markup.

Invisible XML is a language for describing the implicit structure of data, and a set of technologies for making that structure explicit as XML markup. It allows you to write a declarative description of the format of some text and then leverage that format to represent the text as structured information. That sounds a bit abstract, so let’s start with an example:

Suppose you have a document, contacts.txt, where you keep the names, email addresses, and other details of friends and colleagues. It might look something like this:

John Doe
john@example.com
555-1234

Mary Smith
m.smith@estaff.example.com
+1-222-555-2344

Jane Doe
(512) 555-9999

Nancy Jones
nancy@example.org

This is obviously a toy example, but it will work to illustrate a few points. First of all, those of us who are used to working with marked up data (such as XML or JSON) are likely to think of this as “unstructured” data. But that’s not really true. There is structure there, it’s just indicated with whitespace and other informal conventions, rather than with angle brackets or curly braces.

The real world is full of data marked up using different conventions. We use files like this every day: email headers, ical and vcard files, CSS, Java property files, Windows “ini” configuration files, BibTeX, Markdown, etc. At finer levels of granularity, we find even more examples: ISO 8601 dates and times, XPath expressions, CSS selectors, and so on. And then we have our own ad hoc formats in countless text files: diaries, todo lists, calendars, event planners, etc. There are probably a dozen or so files like this within easy reach wherever you’re reading this.

This is fine when the important consumers are human readers looking at the text. But what happens when we want to use that data in some other system? What happens when we want to extract the information out of these documents structured by conventions of spacing and punctuation?

A common approach would be to write a script to read the format:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
#!/usr/bin/env python3

contacts = []
with open('contacts.txt', 'r') as infile:
    contact_id = 0
    expect_name = True
    for line in infile.readlines():
        line = line.strip()
        print(contact_id, line)
        if line == "":
            contact_id += 1
            expect_name = True
        elif expect_name:
            contacts.append({ "name": line })
            expect_name = False
        elif "@" in line:
            contacts[contact_id]["email"] = line
        else:
            contacts[contact_id]["phone"] = line

That’s horrible, and not just because it’s a bit of sloppy code I banged out in a few short minutes. (Way more of the world’s infrastructure relies on code someone put together for a prototype or a demo than you’d care to think about.) The real problem here is that this tells you nothing about the file format beyond what you can glean from the source code. These kinds of procedural or imperative definitions of a file format, even if they’re backed up by some sort of prose description (that you hope is up-to-date with the latest version of the code, even though you know it isn’t), are difficult to understand, difficult to test, and difficult to reason about.

Can I put an address in this file? Is it ok if I have two blank lines? If I have a phone number but not an email address, can I just put in a blank line for the email address? For anything bigger than a toy example, these are not easy questions to answer.

You would be much better off if you had a declarative description of the format. Declarative descriptions are more accessible, easier to reuse, and generally require less coding.

So why, you might ask, don’t we use them all the time?

We do. Your XML processor, your JavaScript engine, your JSON tooling, the compilers that built your web browser and your operating system, are all driven by parsers that take declarative descriptions of a system plus some input, construct an abstract representation of that system, and operate on it. Under the hood, our favourite software is making constant use of declarative descriptions.

Okay, but why don’t we use them all the time?

There are a couple of reasons. Historically, it was expensive. I mean computationally expensive. But Moore’s Law has pretty much sorted that out. The other reason, the real reason, is because it’s hard work. Off-the-shelf tools are mostly designed to build incredibly efficient parsers that can analyze huge amounts of data, unambiguously and efficiently, with only a character or two of lookahead. Writing format descriptions for those tools is a specialized skill that few of us have.

But it doesn’t actually have to be hard. Or at least, nowhere near that hard. What makes it a specialized skill, and what makes it hard, is fitting your declarative description into those very tight requirements: no ambiguity and no (or only a small, fixed amount of) lookahead.

Luckily, that’s not the only way to approach the problem. Techniques for parsing that are tolerant of ambiguity and prepared to engage in (more-or-less) arbitrary lookahead have been around for decades. They are not as fast, or as memory efficient, as traditional parsers, but they can be competitive in many cases. And we have Moore’s Law on our side! Those problems are no longer significant in many environments.

Invisible XML provides a specific syntax for declarative description that’s easy to use and easy to understand. Combined with a parser that understands Invisible XML, you can apply those descriptions to your data and get structured data out. No (imperative) coding required.

Before we dig in deeper, let’s establish a little bit of vocabulary. An Invisible XML document is a grammar. As a term of art in computer programming, a grammar is essentially a set of rules. Each rule describes a symbol in terms of some other symbols.

If you’ve ever used regular expressions, you’re familiar with these ideas, even if you didn’t use this exact vocabulary. If you’re matching data in Python or XPath, or any language with regular expressions, and you say that an integer is a string that matches [-+]?[0-9]+, you’ve created a grammar rule: integer: [-+]?[0-9]+.

Invisible XML collects these rules together. Each rule has a “left hand side” and a “right hand side”. The left hand side is a single symbol, the one being defined, and the right hand side is a list of one or more symbols that define it. A symbol is either the name of something, in which case there must be a further rule that defines it, one that has it as the rule’s left hand side; or it’s something that literally matches characters in your input. With these rules, the processor will work out whether it’s possible to match the whole input string that you gave it by applying these rules in some order.

Let’s take a very small example. Suppose we want to match sentences of three letter words like “see cat sat”. We could write an Invisible XML grammar like this:

sentence  : word+" " .
word      : consonant, vowel, consonant;
            consonant, vowel, vowel.
vowel     : ["aeiouy"] .
consonant : ["bcdfghjklmnpqrstvwxyz"] .

We’ll come back to the specific details about syntax and how to write an Invisible XML grammar in the next article, for now we’ll just do a little hand waving.

That grammar has four rules and you can read them like this:

  1. A sentence is one or more occurrences of word separated by a single space.

  2. A word is a consonant, followed by a vowel, followed by a consonant or a consonant followed by two consecutive vowels.

  3. A vowel is literally “a”, “e”, “i”, “o”, “u”, or “y”.

  4. A consonant is literally any one of the other lowercase, English language letters, and also “y”.

Note that there’s nothing procedural here. There’s no attempt to say how you do anything with a word or any of its constituent parts. The grammar just declaratively describes the format. It’s easy to answer questions about this format. Are numbers allowed? Can words be more or less than three letters long? Is punctuation allowed? No, no, and no, respectively. And it’s easy to imagine writing tests to ensure that this grammar does match what you want.

Of course, there is software that’s going to process it. The first thing a processor, or parser, is going to do is determine whether the input string you gave it matches the grammar (you’ll sometimes see this written as “checking if the input is a sentence in the grammar”). In order to do that, it has to know where to start, so you need to nominate one of the symbols as the “start symbol”. In Invisible XML, that’s the symbol in your first rule, so “sentence” in this case.

We can imagine a parser, given this grammar and the one word sentence “cat”, doing something like this:

  1. Does the input match sentence? I don’t know, what’s a sentence?

  2. A sentence is one or more words separated by spaces. Does the input match word? I don’t know, what’s a word?

  3. A word is either:

    • A consonant, followed by vowel, followed by a consonant, or

    • a consonant, followed by a vowel, followed by another vowel.

    Does the input match that? I don’t know, what’s a consonant?

  4. A consonant is one of a set of letters. Ok. Does the first letter match one of those? Yes, “c” is one of those letters. Great.

  5. Does the rest of the input match the rest of the “right hand side”? The next thing is vowel. Does the rest of the input match that? I don’t know, what’s a vowel?

  6. A vowel is one of a set of letters. Ok. Does the next letter match one of those? Yes, “a” is one of those letters. Great.

  7. And on it goes, matching symbols against rules until it runs out of symbols or runs out of input. In this case, it will run out of input after it matches the “t”.

  8. There’s no input left. If we stop here, are we finished with the right hand side of the rule for the start symbol? Yes? Great. Yes, “cat” is a sentence.

As you can see, the parser uses the rules to replace symbols with what those symbols can be in a kind of recursive process that “bottoms out” when it reaches something that has to match the input. Sometimes you’ll see the class of symbols that can be replaced by other symbols identified as “nonterminals” as distinct from the symbols that literally match against the input, the “terminals”.

What happens if it doesn’t match? What happens, for example, if we give this grammar the input “frog”? In that case, the parser will tell you it couldn’t match your grammar (“the input is not a sentence in the grammar”). Something like this:

Parse failed. At position 2, found unexpected “r”. Would have permitted: ["aeiouy"].

So your grammar also functions as a validator for your input!

On a successful parse, the other thing an Invisible XML parser has to do is tell you how it matched the input. This is the part that gives you back the structured information. For our word parser, it’s pretty simple and obvious:

<sentence>
   <word>
      <consonant>c</consonant>
      <vowel>a</vowel>
      <consonant>t</consonant>
   </word>
</sentence>

Broadly speaking, Invisible XML makes each nonterminal in your grammar into an XML element. In fact, it offers you a number of ways to control how the results are constructed, including which nonterminals should be output, which things should be elements and which should be attributes, and even which things to elide altogether. We’ll come back to those topics when we talk about writing grammars in the next article.

There are two other things to bear in mind. First, the grammar only validates against its rules. According to this grammar “xeq bei” is a perfectly fine sentence. Second, an input may be ambiguous with respect to a grammar. Consider the sentence “hey bee”. If you parse that with our sentence grammar, you’ll get:

<sentence xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
   <word>
      <consonant>h</consonant>
      <vowel>e</vowel>
      <vowel>y</vowel>
   </word>
   <word>
      <consonant>b</consonant>
      <vowel>e</vowel>
      <vowel>e</vowel>
   </word>
</sentence>

Or maybe you’ll get:

<sentence xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">
   <word>
      <consonant>h</consonant>
      <vowel>e</vowel>
      <consonant>y</consonant>
   </word>
   <word>
      <consonant>b</consonant>
      <vowel>e</vowel>
      <vowel>e</vowel>
   </word>
</sentence>

There are two different ways to parse the input because our grammar says that a “y” can be either a consonant or a vowel. The word “hey” therefore matches both the pattern “consonant-vowel-vowel” and the pattern “consonant-vowel-consonant”. That makes the results ambiguous. The Invisible XML processor is required to tell us that, and it’s required to return one of the valid parses. (Implementations may give you more control than that, but that’s all that’s required for conformance.)

You might now be asking yourself, what about the word “gym”? Would that be ambiguous too? It is, in some sense “ambiguous” because it’s either “consonant-vowel-consonant” or “consonant-consonant-consonant”. But it’s not ambiguous in this grammar because “consonant-consonant-consonant” isn’t a possible match.

Ambiguity is ok, and sometimes it’s necessary. The ambiguity in this grammar is an inherent property of the English language rule that “y” sometimes represents the sound of a vowel and sometimes it represents the sound of a consonant.

That said, the more ambiguity there is, the more possibilities the parser may have to consider when it’s looking for matches. That can make the process slower. We’ll look more at ambiguity when we start writing grammars in the next part.

Here, finally, is a grammar for the contacts format:

contacts: (contact, NL*)+ .
contact: name, NL, (email, NL)?, (phone, NL)? .

name: letter, ~[#a; "@"]* .
email: username, "@", domainname .
phone: ["+0123456789()- "]+ .

-username: (letter; ["+-."])+ .
-domainname: (letter; ["+-."])+ .

-letter: [L] .
-NL: -#a ; -#d, -#a .

It can be read like this:

  1. A contacts file consists of one or more contact items followed by zero or more newlines.

  2. A contact is a name followed by a newline, optionally followed by an email followed by a newline, optionally followed by a phone followed by a newline.

  3. A name is a letter followed by any characters except newline or “@”.

  4. An email is a username followed by “@” followed by a domainname.

  5. A phone is one or more digits or the “+”, “(“, “)”, “-”, and space punctuation characters.

  6. A username is one or more occurrences of letter or any of the characters “+”, “-”, and “.”.

  7. A domainname is one or more occurrences of letter or any of the characters “+”, “-”, and “.”.

  8. A letter is any character in the Unicode character class “L” (letters).

  9. A NL is a newline or the sequence carriage return followed by newline.

If you give that grammar and the contacts file to an Invisible XML processor, it will return:

<contacts>
   <contact>
      <name>John Doe</name>
      <email>john@example.com</email>
      <phone>555-1234</phone>
   </contact>
   <contact>
      <name>Mary Smith</name>
      <email>m.smith@estaff.example.com</email>
      <phone>+1-222-555-2344</phone>
   </contact>
   <contact>
      <name>Jane Doe</name>
      <phone>(512) 555-9999</phone>
   </contact>
   <contact>
      <name>Nancy Jones</name>
      <email>nancy@example.org</email>
   </contact>
</contacts>

Or the author of the processor might allow you to ask for the output in JSON instead:

{
  "contacts": {
    "contact": [
      {
        "name": "John Doe",
        "email": "john@example.com",
        "phone": "555-1234"
      },
      {
        "name": "Mary Smith",
        "email": "m.smith@estaff.example.com",
        "phone": "+1-222-555-2344"
      },
      {
        "name": "Jane Doe",
        "phone": "(512) 555-9999"
      },
      {
        "name": "Nancy Jones",
        "email": "nancy@example.org"
      }
    ]
  }
}

Or maybe in CSV:

"name","email","phone"
"John Doe","john@example.com","555-1234"
"Mary Smith","m.smith@estaff.example.com","+1-222-555-2344"
"Jane Doe",,"(512) 555-9999"
"Nancy Jones","nancy@example.org",

Every Invisible XML processor begins by creating a basic XML representation of your input data. That’s why it’s called Invisible XML. But what’s really going on here is that we’re turning input with implicit structure into explicitly structured data that can be transformed into whatever output format we require. And we’re doing it by describing that format in a compact, understandable, reusable, testable way.

Join us next time for a detailed look at the syntax of Invisible XML.