XML.com

# Observations on equational content in XML workflows

January 15, 2018

Peter Krautzberger discusses strategies for representing mathematical equations in XML and on the web.

Disclaimer: I managed the MathJax Consortium until recently.

## Equations ≠ Math

A pet peeve of mine is the language people use for what I've come to describe as "equational content". You might blame my background in mathematical research but then you should really criticize that I should be talking about formulas (or perhaps formulae). In my experience, most people describe anything looking like a formula as "equation" and while that's still strictly speaking wrong, it is what it is. More importantly, "equation" is far better than the more common alternative: math (or maths or mathematics). So I've started to call this kind of content "equational" with the hope that a different term helps clarify discussions a bit.

## Authoring Formats

If you need to include equational content in your XML workflow, the very first question is necessarily: what format do I use? As you would expect from any content with millenia of tradition, equational content comes in many different formats, even in the digital age, even in XML workflows. The dominant formats you will encounter in XML workflows are MathML and TeX/LaTeX.

### MathML

You would assume that MathML is the natural choice when it comes to XML workflows and by and large that's the case. MathML was one of the earliest applications of XML, with MathML 1 becoming a WC3 Recommendation (read: published standard) all the way back in 1998 and its current incarnation MathML 3 was released in 2010 and minimally revised in 2014. MathML's strength for XML workflows is clearly its nature as an XML language: it fits most naturally into any XML workflow. As a W3C and ISO standard, it represents the most vetted open standard available.

MathML's weaknesses have to be discussed in two parts as it contains two independent sub-specifications: "Presentation MathML" (about visual rendering) and "Content MathML" (about semantic representation). Content MathML's weakness is rather simple and absolute: it practically does not exist in the wild which makes a discussion moot. "Presentation MathML" is what most people mean when they speak of MatHML. Its main weakness is that it does not actually specify layout but primarily an abstraction of equation layout. As a consequence, quality control can be difficult when multiple rendering systems are used (as they usually vary significantly), accordingly editing tools and third party conversion services can vary significantly. Finally, Presentation MathML lacks semantic information, which limits the accessibility of the content. Although it has little impact on XML workflows, MathML's greatest weakness as a W3C standard is the lack of browser support and more specifically the lack of support from browser vendors: the partial implementations in Firefox and Safari were written by volunteers but activity has effectively ceased.

With care, these problems can be mitigated and MathML remains stable and the dominant format for equatonal content in XML workflows. Before adopting it for a new workflow, a minor concern should be that MathML is not actively maintained as a specification since the W3C Math Working Group's charter was not renewed in 2016. In principle, the W3C can update the specification but there is currently not enough interest in moving the specification forward.

### TeX/LaTeX

Despite MathML's dominance, TeX/LaTeX remains a strong competitor and (in my personal/consultancy bubble) is recently seeing a resurgence in usage. Now when people speak of TeX/LaTeX for XML workflows, it might not be quite clear what they mean. After all, TeX is a proper typesetting system (and in fact programming language) dating back to 1978 (and the LaTeX macro packages dating back to the early 80s). In the context of XML workflows, people really mean "math mode TeX/LaTeX" (and variants thereof) yet you will inevitably run into the problem of math mode and text mode being mixed.

TeX/LaTeX's greatest strength lies in the canonical visual rendering (by the TeX typesetting engine). In addition, various communities have developed stable subsets of TeX/LaTeX that work well in XML workflows (MathJax, texvc, jats4Reuse). TeX/LaTeX was designed for human authoring and in some cases (e.g., in research level documents) might be acceptably accessible while all the while being easily converted (e.g., to MathML).

TeX/LaTeX's weakness lies in the fact that it is format for print layout and thus being purely visual, lacking semantic notation, and being not generally accessible (beyond expert readers). Its baroque syntax and its nature as a not-context-free programming language makes it difficult to process in general. Simply put, macro resolution poses a problem when converting into other formats and the number of custom TeX packages is a problem for any kind of processing. Still, most publishers accepting TeX will severely limit what TeX packages can be used (ideally down to specific versions) which mitigates these problems in real life. While TeX's document model is quite reasonable in general, some popular TeX packages (e.g., graphical packages like tikz) hack it by writing directly to the output (DVI, PDF), making seemingly impossible (and in particular XML-incompatible) constructions possible.

For the web, the stable (La)TeX subsets are accompanied by another group of incompatible half-breeds, which can cause compatibility problems. Finally, especially older XML workflows will often contain full TeX/LaTeX documents instead of math-mode fragments which can complicate workflows considerably. Nevertheless, the stability and expressiveness of TeX outweighs the need for carefully limiting the allowed expressions in XML workflows.

### Other formats

In addition, there are numerous formats that are encountered less often in XML workflows. The most notable XML format is found in Microsoft Office products: "ECMA Math" which is part of Microsoft Open Office XML format describes equational content. It offers a canonical rendering (via MS Office), an isomorphic XML and Unicode format, and converts relatively easily to MathML. It lacks semantics, is limited to Microsoft Products, is poorly documented and its tools have unclear licensing. While it's obviously a very important format, for XML workflows it is usually converted to MathML (and discarded).

The remaining fall in two groups. Simplified ASCII/Unicode notations, most notably AsciiMath. These provide human-readable (and relatively accessible), designed for students and easily converted into formats such as TeX or MathML. They have intentionallylimited expressivity, are not as well established and not standardized in any way. The other group consists of various mathematics-oriented programming languages, e.g., for Computer Algebra Systems (Maple, Mathematica), scientific computing (SciPy, R, Julia), and general purpose languages (Python, Java). They shine with their computational abilities, are highly semantic, are made for (human) programmers and convert easily (lossfully) to other formats. They lack expressivity in terms of conceptual knowledge and their syntax are not friendly to most authors. Finally, the strengths of these groups are not actually leveraged in XML workflows(e.g., for accessibility) as they are too often discarded after conversion.

### Recap

MathML is never the wrong choice for an XML workflow but it's important to keep in mind that it may not be the best choice for any particular workflow and that many XML workflows will keep at least one other format on the side. As there is no commonly agreed-upon data model for equational content, the most important consideration for a format is conversion to other formats, especially for your target document presentations (e.g., HTML, PDF). And of course there is plenty of content that cannot be efficiently captured anyway, so you have to be ready to handle images when you run into edge cases (and be sure to use vector graphics, please), leading down its own rabbit hole (e.g., matching fonts for text and the text in such graphics).

## Authoring Tools

Equational content for XML workflows has to come from somewhere, so it's important to consider authoring tools, both for your content's original authors as well as conversion service providers. You should spend some time evaluating your options here as, invariably, somebody in your team will have to edit something. The ever popular option of plain text editors should not be discounted. A key strength of XML is its nature as a raw text format with structure. Even equational content (e.g., in MathML) benefits from this as non-specialists are able to handle advanced editing problems with the guidance of an XML structure.

Beyond plain text editor, there are effectively two (related) groups of tools to keep in mind: specialist editors (equations only) and more general document authoring tools with built-in equation editing ability. The former are usually more focused on the user experience for equational content specifically. They also often provide more options for generating output so as to enable users to inject the output into many document authoring tools. The built-in editors are often more streamlined and focused on the specific environment they are built into. They might be more limited but that's often a benefit (e.g., matching the limitations of the surrounding document platform). You can find a decent list (mixing both types of editors) on Wikipedia.

## Equations in XML

Even if you've settled on a default format for equational content in your XML workflow, there's one important consideration to keep in min: never lose any format given to you. All formats have some benefits and no format has all benefits so your best bet to be future proof is to keep as many of them around as you can. In particular, MathML is rarely authored directly but usually converted from some other format or an editor's internal format (offering multiple outputs). It's best to add one or two additional formats to your workflow to simplify your life further down the road. Reversely, it's best to build your process to ensure you can generate MathML from your content so that you can hook into the wide variety of tools build around MathML.

Something else that is still often overlooked but usually mission critical: consider adding an XML-compatible output format for the web in your workflow (unless you're in a very strange place and don't need to convert to web-based formats). Since MathML is not a practical option for the web (and likely never will), you will need to convert your equational content when producing web output. It's best to consider that path early on and ideally build it into your tool chain right away. The most viable options are HTML (with CSS) and SVG content. This is not always easy to integrate as inline-SVG is not very common in document-level XML and inline-HTML (with CSS) can be quite difficult.

And a final word: don't be tempted to convert equational content that can be handled in plain text or your surrounding XML (e.g., <sup>, <sub>) without also marking it as equational content and providing alternatives. This is an incredible loss of information (just think of trying to identify a variable named a from an equation in plain text; you won't stand a chance).

## Common Challenges in XML

As mentioned earlier, MathML is a useful default choice when it comes to equational content in XML workflows even if it's not without its problems, both formal and practical. A few practical observations.

Quality assurance remains a significant challenge for most MathML workflows because MathML does not specify layout sufficiently and does not have a canonical rendering to compare against. For most (and all older XML workflows), QA starts (and too often ends) with print production, i.e., content is only QAed in print output. More generally, QA usually focuses on a single MathML rendering engine and effectively compares the results against an author manuscript. In fact, one gets the impression that (intentionally or not) most publishers just leave it up to the author to check a final PDF for correctness; very few invest in specialized production teams or tools, or decent third party quality assurance. This naturally leads to issues when multiple MathML rendering engines are starting to handle your content, e.g., one for print, one for web, and possibly various ones rendering your content in an ebook reading systems.

Legacy MathML content faces an additional challenge. It was usually QA'ed against early rendering engines with bugs and sometimes non-conformant implementations. The problem of fencing characters (such as parenthesis) stretching to the height of the enclosed content is often a particular pain point, most likely due to MathML and TeX disagreeing on when to stretch automatically. Similarly, display style and rules for display style inheritance seem to have been buggy in old rendering engines. When moving such content from print to web-based rendering, such errors often pop up and can cause significant cost when not diagnosed correctly.

Another typical QA issue lies in Unicode Math Alphabets and MathML's mathvariant attribute. While the specification is fairly clear many conversion vendors seem to be confused on when to choose which approach. It's rather simple, really: a) don't use Unicode math alphabets in MathML (use the mathvariant attribute with regular BMP codepoints) b) if you have such content, then don't mix one with the other. In essence, Unicode Math Alphabets should be considered harmful outside of print; there's no semantic value in something like "MATH SANS BOLD ITALIC THETA SYMBOL" and it only limits how the author's intentions can be interpreted in non-print media. Authoring for the web is already undergoing a shift in stylistic expression and will continue to evolve, meaning that legacy content will have to be upconverted eventually.

The more general problem is deciding when content should be rendered via a font engine or MathML markup. In particular combining characters such as accents can easily lead to problems in rendering engines that do not expect them, especially when mixed incorrectly with the corresponding MathML structures (e.g., <mover>).

Equation labels are another major issue in XML workflows. Despite MathML being XML, many workflows do not seem to want to take advantage of the fact. Thus, equation labels are often ripped out of the MathML conten and stored in the XML. While this can be necessary in rare cases (e.g., when rendering to binary images is necessary), it is generally inadvisable. Too often, the labels are not rendered correctly when combining a MathML fragment with its label (especially when multiple labels are in one expression).

Similarly, alternative formats (e.g., TeX, alt text, alt images) are often stored separately (e.g., in JATS <alternative> tag) and then not passed along with MathML (using its annotation model). This means your content loses information unnecessarily during processing, preventing end users from benefiting from these alternatives. Finally, systematic limitations are easy to encounter. MathML is limited to single expressions and thus cannot provide cross-document information (such as equation numbering or alignment information); it also cannot encode certain types for equational content (e.g., most commutative diagrams). And of course Unicode remains limited for STEM content, leading to PUA-heavy fonts (e.g., in chemistry).

## MathML and today's web

On today's web, MathML is effectively dead: native browser support is limited to partial implementations in Firefox and Safari but the contributor-driven development has effectively stalled. Accordingly, very few professional publishers will use MathML without at least a client-side JavaScript library to convert it into supported web formats on the fly. The development of MathML has equally ceased, primarily due to the failure of MathML to gain interest from browser vendors.

This is also bad news for XML workflows. The perpetual, empty promise of MathML becoming natively implemented in browsers has prevented the specification from resolving and improving its situation in XML workflows, the source of MathML's success in the first place. As a format, MathML is more reminiscent of HTML3 than HTML5 in the way it mixes pseudo-semantic elements with pure layout information. It lacks useful semantics (see above) and favors print-oriented layout features such as table layout (including its own table specification, incompatible with HTML tables).

Nevertheless, MathML is a good basis for creating content for the web if only because its XML nature can easily be (ab)used to provide richer output for the web, e.g., in terms of ARIA markup or application data.

## Rendering equations on the web

Today there are many tools that convert equational content to HTML (with CSS), SVG, and even Canvas with excellent quality. MathJax is certainly the gold-standard, trusted by tens of thousands of sites, serving over 200 million end-users each month. Thanks to a history of 14 years, MathJax is able to render MathML at higher quality both visually (using Knuth's TeX layout algorithms) and more accessibly (using Volker Sorge's semantic enrichment process) than MathML itself could. In addition, it can process TeX/LaTeX and AsciiMath input alongside MathML. More importantly, MathJax has been leading the way for more equation rendering tools to be developed for the web, from jqmath and fmath (MathML only) to MathQuill, MathLive and KaTeX (TeX/LaTeX only) in the open-source space. Given the rich open-source tools, proprietary solutions are rarer but exist, e.g., WIRIS and ShareMath.

When rendering, the choices are primarily HTML (with CSS) and SVG. The choice willl likely depend on your own requirements but keep in mind that unless you have interactive needs (such as live equation editing), your content should be converted to HTML or SVG on the server. Finally, web rendering is beginning to invade the space of PDF/print solutions, with Vivliostyle leading the charge. But of course this space is already filled with XML to print solutions such as Antenna House and Prince XML.

## Accessible math on the web

If you consider publishing to the web, accessibility considerations are important given the rich facilities of today's web eco system. Without reasonable native browser support for visual rendering, MathML is not a good choice here. That loss is minor as Presentation MathML only describes layout, and so it is fundamentally incapable of producing accessible content the same way as is expected of other web content today. Nevertheless, it's important to be aware that some assistive technologies invested in support for MathML over the years, leading to broken expectations. JAWS and VoiceOver have quite partial support that provides little beyond layout information. NVDA can leverage MathPLayer, a former IE plugin which has not been maintained for some time. ChromeVox's MathML support was extracted by Volker Sorge, one of its creators, into a stand-alone JavaScript solution, Speech Rule Engine. Finally, you can find a few proprietary solutions, primarily focused on limited scope of their content (e.g., Desmos, Khan Academy) or their editing environment (e.g., WIRIS).

Speech Rule Engine is by far the most actively supported accessibility solution today. It employs heuristics to generate a semantically rich format from which it can create alternative renditions for accessibility purposes. Speech Rule Engine is also integrated into MathJax where it is leveraged for a full-fledged, client-side accessibility solution providing voicing, exploration and navigation in HTML and SVG output. Speech Rule Engine is the best accessibility solution not only because it is the only one actively developed but also by quality and planned (read: funded) new features in the pipeline.

The fundamental problem with all existing accessibility solutions for MathML is the complete disregard of established web technologies for accessibility, in particular ARIA. This makes it difficult for content production to enrich its MathML content for improved accessibility unless it is carefully coordinated with an HTML or SVG rendering engine.