Beyond XML: Making Books with HTML
February 20, 2017
For seven years, Hachette Book Group, the fourth-largest trade publisher in the U.S., has been creating both print and digital books with HTML and CSS. We’ve published a thousand different titles. We've sold more than fifty million print books, and untold numbers of ebooks. Doing this work ourselves in Manhattan saves large amounts of money over offshore conversion. It's easy and natural to create digital-only titles, or even to add a print-on-demand edition to a digital original. By leaving the page layout metaphor behind, and treating print and digital as aspects of the same content, we work faster, better, and cheaper.
Trade books are what you find in general-interest bookstores—from novels to children’s books to self-help—but not textbooks, STEM, or professional books.
Before I joined Hachette, I was working for a small typesetting company in Vermont, living the usual life of people who work with XML: dealing with urgent requests for changes to our DocBook-based schema, and despairing at the “creative” markup of my offshore colleagues. Who among us hasn't seen <para role="title">? We did typesetting for most of the big U.S. book publishers, including Hachette. They had what was then considered a state-of-the-art "XML-First" workflow, used for novels and narrative non-fiction. Manuscripts in Microsoft Word were converted to DocBook, some presentation-oriented markup was added in XSLT, and the result was imported into InDesign. When typesetting was finished, a PDF from InDesign was sent to the printer, and the XML was exported, reprocessed, and used as the starting point for ebook production.
Word’s track changes feature is so fundamental to the editorial process that we don’t dare change it. And, unlike some areas of publishing, we are in no position to dictate to our authors how they work. So everything begins in Word, but we abandon it as soon as copy-editing is done.
It worked well enough. Print was never a problem. Hachette’s ebooks were pretty good, because XML was a far better starting point than page-layout files or (the horror!) PDF. Every once in a while a stray control code from InDesign would end up in an ebook. But we weren't reaping the promised benefits of XML: the separation of content and style, and true single-source publishing. In trade publishing, a book might have a half-dozen different print editions as well as digital editions: hardcover, large print, trade paperback, mass market paperback, international mass market, and so on. This is where the whole page-layout model breaks down. In InDesign, the content and the presentation are inseparable—two different presentations of the same content means having two independent files. Even embedding XML in InDesign doesn’t change anything; you still have to maintain that content in multiple files. We needed something better.
What we found came from a tiny company called Infogrid Pacific, founded by an expat New Zealander who’d lived in India for twenty years. Richard Pipe had worked in the trenches doing document conversion, typesetting, and XML, and had seen everything that could possibly go wrong. “Digital Publisher” was his response; a web application for multi-format publishing. We call it Dante, being fond of both the typeface and the author.
Dante brought three transformative things to Hachette. First, we used HTML as our XML vocabulary. Second, we started creating print PDFs directly from HTML and CSS. Third, we insourced book production, from offshore vendors to actual employees in New York.
Dante is a web application written in Django. Although most of the program logic is in Python, most content manipulation is done with XSL. Word manuscripts are converted to HTML by using LibreOffice to export as XHTML, and then a sequence of XSL stylesheets convert the raw output to the HTML vocabulary (see appendix). The HTML book content lives in a database, split into chapters. There are design, editing and content processing tools. Critically, we can create any number of editions for a book, which essentially means a new stylesheet and new metadata. Of course we can output to PDF, various flavors of EPUB, Word—last time I checked, there were twenty-three output formats.
As an XML practitioner, the biggest shock was using HTML. At first I thought it was crazy. But it fits our content much more naturally than the DocBook customizations. It works beautifully with the CSS cascade. The ability to customize on the fly, without going through tooling changes or DTDs, is essential for books. My theory is that nearly every book has some unique element that didn't occur in the previous thousand books.
The entire HTML ecosystem is a tremendous benefit. Experimenting and learning requires only a text editor and a browser. Years ago, I had to troubleshoot a CALS table full of row and column spans. The only way to see what was going on was to run it through the DocBook stylesheets to get a visual rendering. Compare that to the convenience and power of browser dev tools.
We use Prince to create PDFs for print. Prince is a complete implementation of the web stack—HTML, CSS, SVG, MathML—but outputs to PDF rather than screens. Interestingly, it’s written in a functional programming language called Mercury, invented at the University of Melbourne. The output is high quality—there's very good support for OpenType font features, and it uses the TeX justification algorithm. But it took us a while to get there. We had to find better hyphenation dictionaries and cope with hyphenation exceptions, and it took months to find an ellipse character that satisfied our editors.
The bad news is that we do a lot of manual tweaking to fix widows, orphans and bad breaks. This is done by applying word- and letter-spacing to individual paragraphs, using a separate CSS file. Some fixes require content changes, such as inserting discretionary hyphens or zero-width spaces. Since these content changes are design-specific, they are stored separately in the database, so they do not influence other editions. We are working on automating more of this page composition process by analyzing the resulting PDF—we don’t yet have access to Prince’s internal layout model.
Dante is still a work in progress. The design tools are awkward. The user interface can be less than intuitive. But, aside from the business metrics of money saved and books published, I’m struck by the loyalty and affection that our users have for this eccentric little tool. Even our most die-hard InDesign user now much prefers Dante. We believe that single-source content in HTML, with styling expressed with CSS, is the best way to make our books.
Appendix: Sample HTML Markup
Here’s an example of the HTML vocabulary we use (Moby-Dick is the “Hello, World” of the ebook community). Everything is based on class attributes with (often) multiple values, with the -rw suffix reserved for the default defined vocabulary. H1 is always reserved for the title of a chapter or other large division of a publication. A highly opinionated post about XML and this markup can be found here.
<div class="galley-rw"> <div class="body-rw Chapter-rw"> <div class="title-block-rw"> <p class="title-num-rw">Chapter 1</p> <h1>Loomings</h1> </div> <p>Call me Ishmael. Some years ago—never mind how long precisely— having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world.…</p> <div class="block-rw extract-rw headline-cus"> <p>Whaling Voyage by one Ishmael.</p> </div> </div> </div>