The XML Scoop on Office 9

July 5, 1998

Liora Alschuler and Mark Walter

The Seybold Report on Internet Publishing
Special for

by Mark Walter and Liora Alschuler

The audience at SGML/XML Europe was treated to the first public preview of the new file formats for Office for Windows (code-named Office 9), presented by Jean Paoli of Microsoft in his keynote address. Though he spoke of Microsoft's adoption of XML, those looking for Microsoft to address structured editing can't help but be disappointed. Instead of structured documents, the emphasis for Office 9 will be on making it easier to publish Word, Powerpoint and Excel documents on the Web. Microsoft found, in its earlier foray into SGML editing, that a generic tool that works with arbitrary DTDs is hard to sell as a shrink-wrapped application to a wide market. Office is a mass-market product, and right now the Office team has its hands full just trying to fulfill Chairman Gates's vision of Microsoft Office as the number one Web authoring suite.

Word, HTML and XML.

Microsoft is being very clear about this: The XML implementation of Word in Office 98 is not, not, a structured editor. What we saw demonstrated from the podium in Paris, and what Microsoft will release in beta sometime this summer, is a version of Word that will offer writers the option of saving files in HTML as a default format-that's right, the source can be saved in HTML instead of in Microsoft's binary format. Users will be able to edit these HTML files, in Word, without having to go through the hoops of exporting and importing as they do now.

"We're focused on making great HTML for today's browser technology," said Marc Olson, Group Program Manager for Office Web authoring features. Olson clarified the statement, explaining that "today's browser technology" meant Navigator 3 and IE3 and later browsers, not just IE4 and Mozilla 4.

One slick aspect of the implementation is that you can request that Word flag to prevent you from using features that older browsers cannot display. Word grays out or tightens the parameters on those features to keep you from using them.

Of course, there will still be the option of the binary format (*.doc files). Even better news, though, is the fact that Office 9 will be binary compatible with Office 97. RTF will also still be supported, and Microsoft says that older versions of Word will be able to read the RTF output of Office 9, even though it will add a few new wrinkles.

HTML does not carry all of the information that Word keeps in its binary format-file properties, revision marks or preference settings, for example. In order to use HTML as a full-fledged data format for word processing documents, Microsoft will embed CSS styles and extensions and XML fragments inside the HTML document. These style sheets and fragments will carry the information necessary for Word to process the HTML document as if it were an ordinary Word document.

According to Olson, the set of tags that Word uses will be fixed in stone by Microsoft in order to ensure smooth processing of files. No stray random tags, no random DTDs; no ambiguities for the software program. Word will not even allow the author to name the linked graphics files within a document; the only control over naming conventions will be whether referenced files are stored in the same directory as the base document or in a separate subdirectory.

What will be user-definable are styles. Word will write style information into HTML files both as embedded HTML markup and as embedded CSS style sheets. According to Olson, Microsoft is embedding the formatting in HTML in order to ensure that Word files can be read by third-generation browsers (Navigator 3 and Internet Explorer 3). The use of CSS, where paragraph-level styles in Word are given correspondingly named CSS styles, enables Microsoft to add its own CSS extensions for storing information that Word uses but may not be handled by HTML-for example, a scotch-ruled border or images in borders.

The CSS styles are embedded, rather than referenced as external documents (as is typically done in professional Web authoring), according to Olson, in order to minimize the number of physical files associated with each document. Even though we agree that the average Word user will probably prefer to deal with self-contained documents, we would still like to see Microsoft offer users the option of saving referenced style sheets. (Word 9 will import, but not export, referenced CSS style sheets created with other products.) For those who do use Word for professional applications, such support would raise Word's stature as an HTML editor.

As for the HTML markup, Microsoft is focusing on simplicity at the expense of control. All HTML element tags are generated automatically, with no control over the mapping of style tags to element tags. For example, if you create a heading called "part number," it will be mapped to some HTML element (most likely <p>), with a CSS style called "part number." Changing its name will affect the CSS style, but not the HTML element.

Users will be able to create a limited number of XML metadata tags through the document properties dialog. It would be useful if these were to support RDF in some direct manner; Microsoft is making no promises in this regard.

The Office 9 document file format is HTML with XML fragments inserted between <XML> and </XML> tags. Inside the fragments, Word will create well-formed XML. Outside the XML fragments, the HTML will follow the current W3C DTD for HTML (with some Microsoft extensions of course!). Ironically, because HTML does not use XML conventions for empty elements (<IMG> not <IMG/>) and does not follow XML requirements for well-formedness (<P> does not require </P>), the hybrid XML-in-HTML Microsoft Office 9 file format is closer to being legal SGML than it is to being valid XML. In fact, it would be more accurate to say Word 9 supports editing SGML according to a fixed DTD than to suggest that Word 9 offers XML editing.

PowerPoint and Excel.

The PowerPoint support for HTML will be a marked improvement over PowerPoint 7 and 8. It will write out the elements of slides as HTML objects, so that text on bulleted slides, for example, can be edited. According to Olson, the display of these HTML files will look best in a browser that supports CCS 2. It also will work well with Internet Explorer 5, since the Office team "worked closely with the Internet Explorer team," according to Olson. As an example: The Internet Explorer team has had months to mull over how it will support VML, Microsoft's proposed Web format for vector graphics, which just happens to be used by PowerPoint 9. Netscape was not in on the development. See our additional coverage of VML.

Excel spreadsheets are fairly straightforward to render as HTML tables. In Office 9, Excel, like Word, will make use of CSS extensions and XML to store formulas, chart data, the structure of workbooks and other information that underlie the spreadsheet cells. In addition to rendering spreadsheets as HTML tables, Excel 9 will use framesets to link the multiple worksheets within a single workbook.

What's the impact?

For the professional publishing market, where Word is often used as the input for stories poured into page makeup programs or Web pages, or even as a front end for loading documents into SGML databases, the use of XML/HTML in the data stream will not be a dramatic change, but it is a significant improvement. It will not be true structured editing, so conversion will still be required in most instances, but the XML fragments will be more straightforward to convert than the convoluted RTF. If Microsoft keeps to its pledge of binary compatibility with Office 97, developers will still be able to use their existing import filters to read Word files. Publishers that have RTF conversion routines, or Visual Basic add-ons that capture metadata, should examine the new output to see how these need to be adapted, should they choose to upgrade.

From a standpoint as an HTML editor, we doubt professional users will like it-it offers too little control, and it inserts too much proprietary markup. In contrast, the ability to save and edit HTML directly will be of great benefit to the mass market (which is, after all, the bulk of Word users), many of whom are increasingly creating documents for publishing via the Web.