XML.com

Writing Invisible XML grammars

March 28, 2022

Norm Tovey-Walsh

Norm Tovey-Walsh gives us a tour of the syntax of Invisible XML documents and how to write (and debug) grammars.

This article is a follow-up to the introduction to Invisible XML published at XML.com on 1 March 2022. In this article, we’re going to focus on the syntax of Invisible XML documents and how to write (and debug) grammars. If you’re not already familiar with the principles behind Invisible XML, you may want to read the introductory article first.

1 Introduction

If you’re familiar with writing grammars in any BNF (or EBNF) form, you’ll find Invisible XML easy to use. If you’ve never written a grammar like this before, you’re in for a treat, this is fun!

You’ll soon discover that writing an Invisible XML grammar shares a lot with writing an XML schema (defining a hierarchy of symbols) or a regular expression (figuring out how to match an item and what to replace it with). As we said last time, a grammar is a collection of rules. Each rule has a “left hand side” and a “right hand side”. The left hand side is a single symbol, the one being defined, and the right hand side is a list of symbols that define it. A symbol is either the name of a symbol, in which case there must be a further rule that defines it, or it’s something that literally matches characters in your input.

This article describes the syntax of Invisible XML as published in the draft specification of 22 February 2022. In the event that the community group introduces grammar changes, we’ll try to make sure this article stays up-to-date.

Invisible XML is defined by an Invisible XML grammar. It is an instance of itself. The complete grammar is only about 45 rules. Printed, it would fit on one side of a sheet of standard office paper.

Let’s dig in. We’re going to bounce around a little bit at first, but eventually we’ll settle in to writing some longer, less contrived grammars.

1.1 Play along at home!

As you read this article and look at the examples it contains, we encourage you to play along at home. You’ll learn a lot from trying out the examples! As you learn new concepts, extend the examples and try matching your own inputs.

For an even more interactive introduction to Invisible XML, check out Steven Pemberton’s Invisible XML (ixml) Tutorial presented at Declarative Amsterdam in 2021.

Steven’s tutorial includes an online form where you can upload a grammar and an input file to be processed. Alternatively, you can download CoffeePot, a command line Invisible XML processor that’s part of the NineML family of Invisible XML tools. (It’s a Java application and should run easily on Linux, MacOS, and Windows computers.)

1.2 Input or output?

Writing grammars is about creating rules that match against things in the input. Those rules are, in turn, used to create XML elements and attributes that will appear in the output. These aren’t completely independent. Annotations that you put in the grammar will influence how the XML is created.

In this article, we’ve organized the two sections separately. First, we’ll look at the input, and then we’ll talk about the output. If you’re working through examples and starting to experiment with grammars on your own, you may want to know something about the output sooner rather than later. Feel free to skip back and forth. If you’ve never written grammars of this type before, do read the first few sections of “The input” about rules and matching first.

Bear in mind as you go that the goal of Invisible XML is to produce “visible XML”. That’s not necessarily the final output form that you want. It’s “text to XML”, not “text to DocBook 5.2 with embedded MathML and SVG.” There are other, better tools to transform one flavor of XML into another, Invisible XML doesn’t have to do all the lifting.

2 The input

Before we begin looking at rules, we need to think about what is being processed by them. If you have any experience with parsing technologies, you may be used to the idea that processing an input has two phases: a lexical analysis phase (“the lexer”) and a parsing phase (“the parser”).

The lexer decides what parts of the input are the actual tokens. It might remove insignificant whitespace, strip out comments, normalize strings, etc. The parser then operates on those tokens. The lexer does a lot of important work and the lexer for one kind of parsing (Java, for example) simply won’t work for another kind (Python, for example).

If Invisible XML was going to work this way, it would have to provide two sets of rules, one for lexing and one for parsing. That would add all kinds of complexity to the language. Instead, Invisible XML has the simplest possible lexer: your input string becomes a sequence of characters.

What you’re matching in rules are the individual characters of the input: every single one of them. This is one of the reasons that writing an Invisible XML grammar feels a little bit like writing a regular expression. As we’ll see, this makes some things a little more complicated, but overall it greatly simplifies Invisible XML.

2.1 Rules

A rule has the form of a name (the left hand side) followed by a colon, followed by zero or more symbols (the right hand side) followed by a full stop to end the rule. If more than one symbol appears on the right hand side, they must be separated by commas:

symbol-name: defining, symbols, here .

Invisible XML allows only a single rule for any given name. If you want to express that a symbol can have two or more definitions, separate the alternatives with semicolons. This rule says that a “thing” is a “this”, followed by a “that”:

thing: this, that .

This rule says that a “thing” is a “thisor a “that”:

thing: this; that .

Whitespace around the punctuation is insignificant, “this;that” is the same as “this; that” is the same as “this ; that”. That’s true of all the rule punctuation except that there must be at least one whitespace character between consecutive rules.

Most grammars contain more than one rule. Rules are separated by whitespace. It’s good practice, simply for readability, to start each rule on its own line, but it’s not strictly required.

2.2 Organizing the “right hand side”

The Invisible XML specification introduces a few nonterminals (remember, Invisible XML is defined in Invisible XML!) to organize how we think about the symbols that appear on the right hand side of a rule. It will be convenient later to talk about how these are combined, so let’s take a moment to lay out some vocabulary.

At the highest level, what appears on the right hands side of a rule is a series of alternatives. A series of alternatives is composed of one or more individual alternatives separated from each other by “;” (or “|”). Each alternative is composed of a series of terms separated by “,”. A term is composed of factors. A factor is either a terminal; a nonterminal; or another set of alternatives surrounded by parentheses. In other words, like content models or regular expressions, you can put parentheses around a set of alternatives and then use that group as a factor, a part of another grouping.

For example, in the following rule:

memo: recipient, (date, sender ; sender, date), content .

The left hand side “memo” is defined as a single alternative composed of three terms: “recipient”, a grouped alternative in parentheses, and “content”. The first and third terms contain a single factor each (and each of those factors is a nonterminal not defined in this example). The second term is a set of two alternatives enclosed in parenthesis; each of those alternatives is composed of two terms (each a single nonterminal factor) separated by commas.

2.3 Matching literal characters

A grammar consists of nonterminals, the symbols you define with rules, and terminals, symbols that match characters explicitly in your input. There are a few options for matching characters in your input.

2.3.1 Matching strings

You can match a literal string of text by using its quoted value. The following rule says that the nonterminal symbol jan matches the literal string “January”.

jan: "January".

This rule says that “month” matches any of the Gregorian calendar month names in English:

month: "January"; "February"; "March"; "April";
       "May"; "June"; "July"; "August";
       "September"; "October"; "November"; "December" .

This might be a good time to try out a grammar. Save those three lines in a plain text file named month.ixml and see if you can make your first XML document with Invisible XML.

You can do this with CoffeePot at the command line like this:

coffeepot -g:month.ixml March

(Depending on how you’ve installed CoffeePot, you may need to use java -jar or some other variation to run CoffeePot. See the documentation for details. You may also get some additional log messages before the output shown below, depending on how you have things configured).

Which will print:

<month>March</month>

Or you can do this using Steven’s form-based interface at https://homepages.cwi.nl/~steven/ixml/tutorial/run.html:

Form to run ixml

Which returns:

Result

Strings can be delimited by either single quotes (') or double quotes ("). There’s no difference between the two forms except that it’s easy to put double quotes inside single ones and single quotes inside double ones. The string '"hello"' matches the string "hello" with the quote marks; the string "don't" matches the string don't with the apostrophe.

If the string you want to match contains both double and single quotes, you can escape one inside the other by doubling them. The string "I said ""there""." matches I said "there".. Similarly, 'don''t' matches don't. Note that in each case, these are the ordinary, non-typographic quotes, " (U+0022) and ' (U+0027). If your string contains typographic quotes, those are just ordinary characters as far as Invisible XML is concerned.

String literals in Invisible XML are not allowed to break across lines. You cannot put literal line breaks in a string. To match those characters, you must use encoded characters.

You might be wondering what happens if your input doesn’t match. You can try that out by, for example, trying to match the month “Marsh” in the previous examples.

coffeepot -g:month.ixml Marsh

CoffeePot will print:

<fail xmlns:ixml="http://invisiblexml.org/NS" ixml:state="failed">
   <line>1</line>
   <column>4</column>
   <pos>3</pos>
   <unexpected>r</unexpected>
   <permitted>'c'</permitted>
   <also-predicted>'A', 'D', 'F', 'J', 'M', 'N', 'O', 'S'</also-predicted>
</fail>

An Invisible XML processor always produces an XML result. In this case, the result has been annotated with a “failed” state and some attempt has been made to describe what went wrong. Like most error messages from computers, it needs to be taken with a grain of salt. The processor doesn’t know what’s wrong in any meaningful sense, it just reports what state it was in when it gave up.

Steven’s form offers a similar report:

Bad Result

2.3.2 Matching individual characters

You can match any literal character, including end of line characters and other control characters, with an encoded character. An encoded character is a number sign (“#”) followed by hex digits. An encoded character can represent any Unicode character.

For example, the line feed character used to separate lines in text files on MacOS and Linux is “#a” (or in capitals, “#A”, or with leading zeros, “#000a”, if you prefer).

In the discussion of strings, we noted that you cannot put a line break in a string. So how would you match “one” followed by “two” separated by a single line feed?

One way is like this:

onetwo: "one", #a, "two" .

The commas that you put in to separate tokens don’t imply any delimiters in the input.

Line endings in XML are normalized, but Invisible XML just operates on the input that you give it. The preceding example would be more portable written this way:

onetwo: "one", #d?, #a, "two" .

That will work on filesystems, like Windows, where a new line is often a carriage return (“#d”) followed by a line feed. This might be easier to read if the encoded characters were identified with their own nonterminals:

onetwo: "one", cr?, lf, "two" .
-cr: -#d .
-lf: -#a .

But it will match the same input either way.

2.3.3 Character sets

Strings and encoded characters match against the input exactly as they are written. Another way to match is with character sets which can be inclusive (all these characters) or exclusive (any characters except these).

Character sets are delimited by square brackets and contain literal strings, encoded characters, character ranges, or Unicode character classes. If the opening square bracket is preceded by a “~” (U+007E), the set is exclusive, otherwise it’s inclusive.

The following rule:

digit: ["0123456789"] .

Doesn’t match the string zero through nine, as it would if it wasn’t surrounded in square brackets, instead it matches exactly one character, any one of “0”, “1”, “2”, …, “9”.

Another way to write that would be with a range:

digit: ["0"-"9"] .

A range consists of a character string containing a single character (or an encoded character) followed by a hyphen-minus (-, U+002D), followed by another single character string (or encoded character). Unlike ranges in most regular expressions, the members of a character set are always quoted or encoded, so you don’t need extra escaping. A character set that includes left square bracket, both kinds of quote marks, and all of the characters from “-” to “\” could be written:

["[", "'", '"', "-"-"\"]

though it might be a little easier to read as

["[", "'", '"', #2d-"\"]

or even

[#5b, "'", '"', #2d - "\"]

Finally, a set can identify a Unicode character class. Unicode groups similar characters together into classes identified by one- or two-letter codes. The code “P”, for example, identifies all punctuation characters while “Pd” identifies those punctuation characters that are dashes. You can use these classes in character sets. To define the nonterminal “punct” that matches any single Unicode punctuation character, you can write this:

punct: [P] .

You can combine these various forms in a single set:

hexplus: ["0123456789"; "A"-#46; #61-"f"; Nd].

That defines “hexplus” as matching any single character zero through nine, or any upper-case latter “A” through “F”, or any lowercase letter “a” through “f”, or any character in the Unicode character class “Number, Decimal Digit” (which contains more than 600 characters as many languages have their own decimal digits). The Unicode class contains the Arabic decimal digits, so there’s a bit of redundancy in this set, but that’s not an error.

As noted earlier, you can negate a class (make it an exclusion) by placing a tilde in front of it:

notnumeric: ~[N].

The “notnumeric” nonterminal matches any single Unicode character that is not in the class of number characters. The set ~[] matches any character that is not in the empty set, in other words, any character.

(Invisible XML doesn’t let you build up more complicated character sets by composing them: you can’t mix inclusions and exclusions together in the same set.)

2.4 Matching sequences

All the rules we’ve looked at so far match exactly one thing: one word, one symbol, or one character. What if you want to match more or less than one? Invisible XML gives you several options here as well starting with “*”, “+”, and “?” which behave just as you’d expect if you’re familiar with regular expression languages.

factor*

A factor followed by a “*” will match zero or more occurrences of that factor. Given this rule,

seq: 'a', '.'*, 'b'.

A “seq” will match “ab”, “a.b”, “a..b”, etc. with as many full stops as you like between the two letters.

factor+

A factor followed by a “+” will match one or more occurrences of that factor. Given this rule,

seq: 'a', '.'+, 'b'.

A “seq” will match “a.b”, “a..b”, “a...b”, etc. with as many full stops as you like between the two letters, but it will not match “ab”.

factor?

A factor followed by a “?” is optional. It will match zero or one occurrence of that factor. Given this rule:

seq: 'a'? .

A “seq” will match either nothing or a single “a”. Matching empty strings (accidentally or on purpose) as an important topic that we’ll come back to later.

factor1 * factor2

Invisible XML has two special forms of repeat that are convenient for grammar authors. If two factors are separated by a “*”, they will match zero or more occurrences of the first factor separated by the second factor. Given this rule:

seq: 'a'*',' .

A “seq” will match “”, “a”, “a,a”, “a,a,a”, etc. Matching as many “a”s as appear (including none) provided that they are separated by precisely one comma.

Note that because the separator is a factor as well, you can have more nuanced separators. You could match a sequence of “a” characters separated by commas followed by optional spaces like this: 'a'*(',', ' '*).

factor1 + factor2

If two factors are separated by a “+”, they will match one or more occurrences of the first factor separated by the second factor. Given this rule:

seq: 'a'+',' .

A “seq” will match “a”, “a,a”, “a,a,a”, etc. Matching as many “a”s as appear provided that they are separated by precisely one comma.

2.5 Matching names and whitespace

At this point, you have all the building blocks necessary to write your own grammars, but two common features deserve special attention: matching names (or other contiguous sequences of characters) and matching whitespace.

2.5.1 Matching names

Invisible XML grammars match characters in your input. There isn’t a lexer in front of it to group characters into tokens and there isn’t a “regular expression” matcher for grouping. If you want to construct input “tokens”, you have to do it yourself.

You’ll find a pattern for this in the way that Invisible XML defines a “name”, for example the name of a rule:

        name: namestart, namefollower*.
   namestart: ["_"; L].
namefollower: namestart; ["-.·‿⁀"; Nd; Mn].

Those rules define names to begin with “_” or any letter, followed by zero or more characters that are themselves “_” or letters, or characters from a slightly broader repertoire: hyphen-minus (-, U+002D), full stop (., U+002E), middle dot (·, U+00B7), undertie (‿, U+203F), character tie (⁀, U+2040, characters in the class Nd (decimal digit numbers), and characters in the class Mn (nonspacing marks).

That’s the technique you have to employ if you want to make sequences of characters. This example is a tiny bit complicated by the fact that “name start” characters are different from the rest of the name characters. If the whole sequence came from the same repertoire, you could simply say something like: “name: namecharacter+”.

2.5.2 Matching whitespace

Whitespace in Invisible XML is defined as:

           s: (whitespace; comment)*.
  whitespace: [Zs]; tab; lf; cr.
         tab: #9.
          lf: #a.
          cr: #d.
     comment: -"{", (cchar; comment)*, -"}".
      -cchar: ~["{}"].

The crucial observation here is that whitespace is a “zero or more” symbol. That means everywhere that the Invisible XML grammar allows an “s” to appear in a grammar, it can be a space, a tab, a line feed, etc., or two of them or three of them, or none of them. As we’ll see in Section 2.6.1, Matching nothing, this can lead to trouble (though it doesn’t in the Invisible XML specification grammar).

Comments in Invisible XML are considered whitespace. They’re delimited by curly brackets and they’re defined so that they nest properly. (You can comment out an arbitrary section of an Invisible XML grammar without worrying about whether or not there are comments in the section you’re commenting out!)

2.6 Ambiguity

As we saw in the previous article, grammars can be ambiguous and that’s not an error. Nevertheless, ambiguous grammars can slow down the parser and different implementations may give different answers, so it’s best to avoid them if you can. Consider this somewhat contrived example:

seq: (A; B), '.'*.
A: 'a', '.'+ .
B: 'b', '.'+ .

A “seq” is an “A” or a “B” followed optionally by full stops, an “A” is an “a” followed by one or more full stops, a “B” is an “b” followed by one or more full stops.

Asked to parse “a..”, the Invisible XML processor will report ambiguity:

<seq xmlns:ixml="http://invisiblexml.org/NS"
     ixml:state="ambiguous">
   <A>a..</A>
</seq>

CoffeePot will tell you that there were two possible parses, but it only returns one of them by default because that’s conformant behavior for a processor. Internally, it enumerates the parses and returns the first. You can ask for the other parse by passing --parse:2 on the command line (alternatively, you can pass --parse-count:all for all of them.) Here’s the second parse:

<seq xmlns:ixml="http://invisiblexml.org/NS"
     ixml:state="ambiguous">
   <A>a.</A>.</seq>

Ambiguity arises because there’s more than one way to parse an input. In this case, it’s pretty easy to see that the ambiguity is whether the second “.” is part of the A or part of the “seq”. Removing that ambiguity involves choosing one of the options and then adjusting the grammar to make that the only possible parse. In this case, one option is to make sure all the full stops are associated with either the A or the “B”:

seq: A; B.
A: 'a', '.'+ .
B: 'b', '.'+ .

The other option is to make sure they’re all associated with the seq:

seq: (A; B), '.'+.
A: 'a'.
B: 'b'.

Neither answer is more correct than the other. Which one is “right” depends on the output you want to get.

Ambiguity is a broader, and more nuanced topic, than we have space for here. It’s possible to make distinctions between grammars that are ambiguous and parses that are ambiguous. Try parsing a. with the first grammar above and you’ll find that there’s only one parse. Invisible XML reports ambiguous parses, not ambiguous grammars. It’s also possible to write a grammar that has loops in it: seq: 's', seq; seq; . A “seq” is an s or a “seq” or nothing. That grammar is infinitely ambiguous because it can have an unbounded number of empty “seq” matches.

Asked to parse “sss” with that grammar, CoffeePot will report:

There is 1 parse, but the grammar is infinitely ambiguous
<seq xmlns:ixml="http://invisiblexml.org/NS" ixml:state="ambiguous">s
   <seq>s
      <seq>s
         <seq/>
      </seq>
   </seq>
</seq>

(In the interests of space and time, CoffeePot won’t let you enumerate infinitely many parses; it ignores loops.)

2.6.1 Matching nothing

A very common place for ambiguity to arise is when a symbol can match nothing. Anytime you use a “*” or “?” modifier on a symbol, you’re indicating that the symbol can be absent. Another way to think about “can be absent” is “can match the empty string”. Consider

seq: 'x', 'o'*, 'x'.

That rule will allow “seq” to match “xx” because 'o'* can match zero occurrences of “o”. That’s not necessarily ambiguous (and it’s not ambiguous in this simple case), but consider a grammar for North American phone numbers.

Ignoring the country code, a phone number in North America consists of an area code, a prefix (technically a central office code), and a number (technically a station number), canonically: 512‑555‑0100 Unfortunately, you’ll also find (512) 555‑0100, common before the prefix was largely mandatory, 512 555 0100, or even 5125550100. Sometimes the area code will be missing altogether since it isn’t always required. You might write a grammar for phone numbers like this (ignoring additional constraints on the digits in some parts of the number):

1phone-number: (areacode, sep)?, prefix, sep, number .
-sep: dash; space .
-dash: -'-'? .
-space: -' '? .
5number: digits .
prefix: digits .
areacode: digits; -'(', digits, -')' .
-digits: ['0'-'9']+ .

(The extra hyphen-minus signs in front of some of the rules and symbols are explained in Section 3, The output.)

Presented with 512-555-0100, it quietly does what you expect, returning:

<phone-number>
   <areacode>512</areacode>
   <prefix>555</prefix>
   <number>0100</number>
</phone-number>

Presented with 5125550100, things go a little sideways. CoffeePot will tell you that there are 48 possible parses, including this one:

<phone-number xmlns:ixml="http://invisiblexml.org/NS"
              ixml:state="ambiguous">
   <areacode>51255501</areacode>
   <prefix>0</prefix>
   <number>0</number>
</phone-number>

Can you see why there are 48 parses? That grammar allows optional punctuation between every digit and it allows one or more digits in each part. Of course, that’s not actually how North American phone numbers work. This is a better grammar that describes the area code, prefix, and number with explicit lengths:

1phone-number: (areacode, sep)?, prefix, sep, number .
-sep: dash; space .
-dash: -'-'? .
-space: -' '? .
5number: digit, digit, digit, digit .
prefix: digit, digit, digit .
areacode: digit, digit, digit;
          -'(', digit, digit, digit, -')' .
-digit: ['0'-'9'] .

This will correctly parse the number, but curiously it will still report ambiguity, asserting four possible parses. Can you see why?

You can ask CoffeePot to tell you why with the --describe-ambiguity option:

There are 4 possible parses.
Ambiguity:
sep, 3, 3
        dashⁿ, 3, 3
        spaceⁿ, 3, 3
sep, 6, 6
        dashⁿ, 6, 6
        spaceⁿ, 6, 6

What this says, somewhat cryptically, is that sep matches the empty strings at positions 3 and 6 in two different ways: either by matching an omitted dash or an omitted space. You can fix this changing how “sep” is defined:

1phone-number: (areacode, sep)?, prefix, sep, number .
-sep: (dash; space)? .
-dash: -'-' .
-space: -' ' .
5number: digit, digit, digit, digit .
prefix: digit, digit, digit .
areacode: digit, digit, digit; -'(', digit, digit, digit, -')' .
-digit: ['0'-'9'] .

Now a “sep” is an optional dash or space, but those aren’t independently optional. And the grammar is no longer ambiguous. (It also no longer accepts optional separators between every digit. Reader challenge: rewrite the grammar so that’s allowed without reintroducing ambiguity.)

Another common pattern that can introduce ambiguity is when you want to ignore whitespace. Consider this small grammar for a language a little bit like Invisible XML:

1         rule: name, s?, ':', s?, symbol+s .
       symbol: name .
 
           -s: whitespace+.
5  -whitespace: -[Zs]; tab; lf; cr.
         -tab: -#9.
          -lf: -#a.
          -cr: -#d.
 
10        @name: namestart, namefollower*.
   -namestart: ["_"; L].
-namefollower: namestart; ["-.·‿⁀"; Nd; Mn].

It only accepts a single rule with a name before the colon and a sequence of whitespace separated symbols after the colon. This grammar works perfectly well and will parse “a: b c” just the way you’d expect. Now let’s add optional marks:

1         rule: name, s?, ':', s?, symbol+s .
       symbol: mark?, s?, name .
 
         mark: '^'; '@'; '-'.
5 
           -s: whitespace+.
  -whitespace: -[Zs]; tab; lf; cr.
         -tab: -#9.
          -lf: -#a.
10          -cr: -#d.
 
        @name: namestart, namefollower*.
   -namestart: ["_"; L].
-namefollower: namestart; ["-.·‿⁀"; Nd; Mn].

A symbol is now an optional mark, followed by optional whitespace. This introduces ambiguity in the parse of a: b c. If you look at the output of --describe-ambiguity, you’ll find that the problem is in the expansion of “rule”. Imagine how you might expand “rule”, substituting the value of symbol where it occurs in “rule”:

rule ⇒ name, s?, ':', s?, mark?, s?, name …

Can you see the source of the ambiguity? If there is no mark in the input, but there is a space after the colon, there will be two ways to resolve that space. After the colon, we could match the empty string with the first s?, the empty string for mark?, and the space for the second s?. Alternatively, we could do it the other way around: match the space before the mark and the empty string after it.

One way to resolve this ambiguity is to change the definition of symbol:

       symbol: (mark, s?, name); name .

This definition still allows an optional mark, but there’s no second space unless there is a mark, in which case there is no ambiguity about how to consume the space after the colon in the rule.

Your first thought might have been to simply remove the space after the colon:

         rule: name, s?, ':', symbol+s .
       symbol: mark?, s?, name .

This will parse a: b c unambiguously. It will also parse a: b ^c and a:^b ^c. But it will not parse a: ^b ^c. Can you work out why?

2.7 Non-XML characters

Even with entities and numeric character references, the repertoire of XML characters is limited. You can’t, for example, have &#x0; (U+0000) in a well-formed XML document. Consequently, if you attempt to output those characters from your Invisible XML grammar, you’ll get errors. Similarly, if you create a nonterminal with a name that includes characters not allowed in XML names, you’ll get errors if it’s used for an element or attribute name.

Invisible XML is less restrictive. The following grammar accepts a sequence of letters separated by null characters:

letters: letter+#0.
letter: ['A'-'Z'; 'a'-'z'].

As long as you don’t attempt to output any invalid characters, it’s not an error to use them.

3 The output

Whatever else your processor may be capable of, if it claims to be a conformant Invisible XML processor, it must be able to produce XML documents.

As a general rule, the Invisible XML processor makes an XML element out of every nonterminal that it matches. Consider this simple grammar for dates:

1date: day, ' '+, month, ' '+, year .
day: digit, digit? .
digit: ["0"-"9"] .
month: "January"; "February"; "March"; "April";
5       "May"; "June"; "July"; "August";
       "September"; "October"; "November"; "December".
year: digit, digit, digit, digit .

If you parse a date, “7 March 2022”, with that grammar, you’ll get:

<date>
   <day>
      <digit>7</digit>
   </day> 
   <month>March</month> 
   <year>
      <digit>2</digit>
      <digit>0</digit>
      <digit>2</digit>
      <digit>2</digit>
   </year>
</date>

Pretty soon after the observation, “cool, it works!”, you’ll probably want to see about getting rid of all those extra “digit” elements. The Invisible XML grammar allows you to annotate rules, nonterminals, and terminals with extra symbols, called “marks” in the specification, to control how they’re used to construct XML.

The mark on a nonterminal controls how it’s serialized. If a particular nonterminal on the right-hand-side of a rule doesn’t have a mark, the mark on the rule that defines that nonterminal is the default. The marks are:

^ (the default mark)

A “^” is the default. On a nonterminal, it indicates that an element should be created using the name of the nonterminal and what matches the nonterminal should be inside it. On a terminal, it indicates that the terminal should appear in the output.

@

An “@” only applies to nonterminals. It indicates that an attribute should be created using the name of the nonterminal and what matches the nonterminal should appear in the attribute value.

-

A “-” suppresses output. On a nonterminal, it suppresses the name of the nonterminal, not its content. On a terminal, it suppresses the text of the terminal.

Now we can go back to the grammar and suppress all those extra digit elements. Replacing the rule for “digit” with this one:

-digit: ["0"-"9"] .

Changes the output to:

<date>
   <day>7</day> 
   <month>March</month> 
   <year>2022</year>
</date>

Using an “@” mark will make an element into an attribute. Consider this alternative rule for year:

@year: digit, digit, digit, digit .

With that rule, you’ll get:

<date year="2022">
   <day>7</day> 
   <month>March</month>
</date>

We’ve been “pretty printing” the XML output in this article because it’s easier to read that way. But it has actually obscured a detail. The spaces between day and month and month and year are being output. That may be fine, but sometime’s it’s not. Let’s consider a grammar for ISO 8601 dates.

date: year, '-', month, '-', day .
year: digit+.
month: digit, digit.
day: digit, digit.
-digit: ["0"-"9"].

If you parse a date, “2022-03-07”, with that grammar, you’ll get:

<date>
   <year>2022</year>-
   <month>03</month>-
   <day>07</day>
</date>

Now those extra characters really stand out. Suppress them with “-”:

date: year, -'-', month, -'-', day .

That yields:

<date>
   <year>2022</year>
   <month>03</month>
   <day>07</day>
</date>

Note that suppressing nonterminals and terminals are independent. Try to predict what happens if you use this grammar:

date: year, dash, month, dash, day .
-dash: '-'.
year: digit+.
month: digit, digit.
day: digit, digit.
-digit: ["0"-"9"].

Try it out. Were you right? Can you figure out where to put another “-” to fix the problem?

3.1 What you can’t do

This is a good place to observe that there are some things you can’t do with Invisible XML.

  • You can’t reorder the input and you can’t generate output. You can’t, for example, parse “7 March 2022” and output “March 07, 2022”. Not only is there no way to change the order, there’s no way to generate either the leading “0” or the “,” needed.

  • You can’t make element names that are based on content in the input. There’s no rule that will match a month name and produce a <March> element or a March attribute.

  • You can’t replace the matched value with another value. There’s no way to match “03” and produce “March”, or vice-versa.

  • You can’t output namespace declarations or namespaced elements or attributes.

Some of these are potential features for a future version of Invisible XML. But the goal of Invisible XML isn’t to transform your non-XML input into a final output format. The goal of Invisible XML is to transform the structure that’s only indicated with whitespace and other informal conventions, into explicit XML structure that you can transform with XSLT or XProc or your tool of choice.

That’s its power.

Related links