The Road to XML: Adapting SGML to the Web

October 2, 1997

D.C. Denison

The Road to XML

Adapting SGML to the Web


Many computer scientists have talked about simplifying SGML. The W3C's XML Editorial Review Board has been working at it since July '96. So far, their efforts have received almost universal acclaim. Recently D.C. Denison canvassed a group of Editorial Review Board (ERB) members, and asked them to look back on how the XML project got off the ground, and where they think it's going from here.

XML wasn't the only acronym in the running when W3C's Working Group began to consider a name for what they hoped to create: specifications for a subset of SGML that was optimized for the Web.

"There were several acronyms that we considered," Tim Bray remembers. "I believe there was MGML, for Minimal Generalized Markup Language, and something called SIMPL for Simple Internet Markup Protocol, or something like that. Eventually we voted, and XML--for Extensible Markup Language--won out. It was short and sweet, and people liked it."

"Everyone who uses HTML for very long discovers that they want 'just one more tag.'"

"Marketing XML to the HTML user was one of our prime goals," Jean Paoli adds. "We thought that putting the spin on the 'Extensibility' part of the language would attract the HTML user."

Choosing a name for the project was trivial, of course, compared to some of the other challenges that faced the group when they first started working together in July, 1996. Many other efforts to simplify SGML had run out of steam long before reaching the proposal stage. Somehow, however, this group managed to pull it off, publishing a working draft that's received wide acceptance in both the SGML and Web communities. How did they do it?

Let's Go Back a Few Years

A slimmed-down SGML is not a new concept. Many members of the XML team have been discussing the idea for years.

"Most computer scientists who have worked with SGML have proposed simplifications; specifically, keeping all the structural flexibility but losing many syntax options," ERB member Steve DeRose says. "I've heard of about a dozen proposals over the years."

Some of the XML authors, in fact, were already using a sort of proto-XML.

"I think it's important to understand that I and some other people had actually been doing XML for years," Tim Bray says. "A lot of people who are in the business were actually using SGML data in the case of open text searching and displaying. In the case of electronic book technology, there was a similar kind of story: we had long observed the fact that if they sent you some nicely-tagged text you could do any number, any amount of useful things with it without worrying about the minutiae of the standard and without having to have a DTD. So what XML in effect is has been around for a long time."

Dave Hollander was another XML author who had already jumped the gun, so to speak.

"I developed a simplified SGML language while working on HP's LaserROM program in the early '90s," he recalls. "That evolved into the language used in our HP-UX help systems."

The rise of the Web, and HTML, pressed other members of the ERB to approach XML from the other direction.

"Everyone who uses HTML for very long discovers that they want 'just one more tag,'" according to Steve DeRose. "If you're doing catalogs you need a <PRICE> tag; for repair manuals you need <PARTNUM>; for ancient manuscripts you need <LACUNA> and <SIC>. Having been through this enough times, I want to be able to create new information structures any time my data justifies them, and do it easily. This is why C++ lets you make your own classes (imagine a development environment that didn't!), and it's why XML is absolutely necessary. To do generalized processing, retrieval, etc., I have to be able to say what things in documents are. I can do that with XML, but I can't do it with any one particular fixed tag set."

Jean Paoli was also well aware of HTML's shortcomings. "I discovered that a lot of Web content providers were using what they called 'structured comments' to hide information in their HTML," he says. "I was convinced that they needed a simple way to extend HTML, and I always thought that it could be a kind of simplified SGML that my SGML customers were all already using."

Jon Bosak was similarly inspired.

"XML arose from the realization that HTML is insufficient for certain kinds of Web applications," he says. "I was one of the people who came to this realization early because I was working in a field--online technical documentation--where the requirements are well understood and it's clear that HTML can't meet them. I was putting this complex material in online browsers used by millions of people before anyone had heard of HTML or the Web, and I knew from experience that HTML wasn't going to work for that kind of publishing. I knew that it wouldn't work well for any kind of large-scale content production. So I could see a time coming when large content providers would have to turn from HTML to something more powerful. The question was, what would they provide?

"I could see only two possibilities: either the big software companies would offer proprietary and probably binary-coded formats or we could get them to adopt a single, standard, human-readable format. The only standard solution that I knew could do the job was SGML."

Bosak's solution: "I started a working group in the W3C to provide specifications that could put SGML on the Web. What came out of that activity was XML--a subset of SGML designed for Web use."

Working and Evangelizing

The official W3C group, originally called the SGML ERB, began working together in July 1996; the larger mailing list discussions, the SGML Working Group, started the following September. Work proceeded quietly through most of 1996 and early 1997, via teleconferences, email, and the occasional conference. (In July '97, the SGML ERB became the XML WG and the SGML WG became the XML SIG.)

Meanwhile interest was growing, as the XML authors discussed the project with their colleagues. Perhaps it was an early indication of XML's flexibility that some authors, like Tim Bray, found that they could tailor their descriptions of XML to their audience.

"If I was talking to people to whom search and retrieval is very important, I would point out that when you invent your own tags, you can use them to drive searches and that's a lot better," he recalls.

"When I was talking to people to whom Java and that whole type of thing is important, I would point out that HTML is fine but it doesn't give Java much to chew on. And XML does. And if was talking to people who are in the publishing business and are irritated at HTML's fairly primitive page make-up facilities, I would point out that one solution to that is to de-couple the markup syntax and the formatting semantics, and XML does that.

When ERB member C.M. Sperberg-McQueen spoke to colleagues about XML, he promoted "the ability to use your own tags, rather than the rather eccentric and constricted vocabulary of HTML," he recalls. "That's easily the most important aspect of XML from the point of view of academic research. The ability to write an XML parser that fits in 30 Kb of memory also captured the attention of a lot of programmers and tool developers."

Eve Maler found that the XML applications that generated the most excitement were "the ones that blur the distinction between information delivery and transacting business, such as ordering a new part by clicking on a part number in an online service bulletin. And the idea of using XML as an exchange protocol for purely transaction-oriented applications is also pretty popular, as we've seen by the quick promotion of XML-based EDI initiatives.

"Of course, for many people who have had exposure only to HTML, they're most impressed simply by the notion that tags can have meaning," Maler continues. "Many of the business and technical requirements they've conceived to date could be addressed with this one innovation!"

Soon, a certain software company began to show an interest in XML. Jean Paoli, of Microsoft, a member of the original SGML Editorial Review Board, had been aggressively evangelizing XML to the company's Explorer product teams.

"When I talked about XML to the people here at Microsoft," Paoli remembers, "I always stressed its ability to encode data, not documents. Nobody at Microsoft understood why you would want to use XML for things that HTML is good for. But data? Yes. And describing customers and orders? Yes. Financial information? Yes. So I always sold XML to the database people, the people who understood the value of structuring data."

"Adam Bosworth (who designed Microsoft Access) and Thomas Reardon helped me a lot selling this idea."

"But, even more important, it was the Channel Definition Format (CDF) that helped sell the whole XML story to Microsoft," Paoli continues. "At that moment (February '97), the push battle was terrible between Netscape and Microsoft, and the Internet Explorer team was searching for a good data file format to represent Webcasting information. It was evident that XML was a good choice. I presented XML to the managers of the Microsoft Internet Push team and we modeled their Webcasting data in ten minutes! It took only a few days to decide to use XML. The first XML application (CDF) by Microsoft gave Microsoft a big win. This was the beginning of a lot of PR around XML. Starting XML with a winning application was a great thing for XML!"

In March '97, Microsoft officially announced that they were going to base their new Channel Definition Format on XML. This generated a fair amount of interest in XML among programmers and Internet professionals.


As late '96 turned into early '97, two events brought a new level of attention to the XML project. The first was the SGML '96 conference, held in November 1996.

"The SGML '96 conference was a watershed," Steve DeRose remembers, "because it was not clear whether the SGML community would see XML as SGML writ large, or as some kind of competitor. Since SGML software already supports tag extensibility, variant delimiters, etc., and the SGML market has huge amounts of high-value data, this community is important. The SGML community saw the benefits of simplicity and ease of adoption and jumped on board. The Web community has done the same, though for different reasons: extensibility and validation. The beauty of XML is that it gets you the best of both worlds; but any technology like that overlaps partly with both of the things it draws on; the reception in both communities is therefore crucial. As soon as I saw the major SGML vendors and the major Web vendors all diving in, I knew we were in good shape."

The WWW6 Conference, held early in 1997 in Santa Clara, California, was another milestone.

"We put on a major PR blitz at that conference, and I think it went over pretty well," Tim Bray recalls. "I think XML was one of the hot stories of that conference. By May 1 of '97 it was pretty obvious we were onto something that was going to be significant. And it's grown since then."

"Microsoft announced CDF based on XML a few weeks before the WWW6 conference, on purpose, in order to boost the interest in XML," Paoli remembers. "I took a bunch of Microsoft people who were involved in XML to the conference, and we made as much noise as possible in all the XML sessions."

"The SGML people got it as soon as they saw XML," Jon Bosak recalls, "because they all come from industries that had to solve this problem a long time ago. The HTML people only got it this year; that's when they started hitting the wall in large numbers, in terms of having to deal with significant levels of content. At the WWW5 Conference in Paris a year earlier, not many people knew what I was talking about. But when we presented the XML draft at the WWW6 Conference in April '97, about half the faces in the audience lit up. Those were content providers and Web site administrators who'd finally hit that wall. They knew that they had a problem, they just didn't know what to do about it. As soon as they saw XML, they knew."

Microsoft versus Netscape

Soon Netscape joined Microsoft in agreeing to support the new standard. Tim Bray began working with Netscape as a consultant. Articles on XML began showing up in a variety of print magazines and online publications. Predictably, many media stories played up the Microsoft-versus-Netscape angle.

"The SGML people got it as soon as they saw XML because they all come from industries that had to solve this problem
a long time ago."

Many ERB members tend to downplay the importance of the competition between Microsoft and Netscape, but they all agree it will have an impact.

"Looking at this purely from the industry point of view," Jon Bosak says, "the competition can only do us good by accelerating the acceptance of a truly open, human-readable data format."

"The participation of both Microsoft and Netscape has been very beneficial," C.M. Sperberg-McQueen adds. "They bring a particular technical perspective to the discussions: the view of the world from a large programming shop with enormous numbers of current users is rather different from the view of the world from an academic institution or from a smaller commercial organization. In that sense, the Microsoft and Netscape viewpoints have been more similar than different, in my view."

Steve DeRose believes that competitive issues will not intrude on the creation of the XML specification.

"The competition between Microsoft and Netscape would be almost a non-issue if not for a few over-excited articles," he says. "All the representatives on the XML Working Group are deeply committed to doing the right thing, and to a consensus process. Neither Netscape nor Microsoft has tried to dominate the process or to foist any self-serving proposals on the [XML working] group. Also, I think both companies realize they have better places to compete than over syntax. Let them and everyone else compete on user interface quality, reliability, performance, and functionality--not on who can dream up new tag names or punctuation marks faster!"

Details, Details

Although XML has met with an enthusiastic reception, the ERB members are well aware of the work that remains. First and foremost, they have to finish the specification.

"It would be nice if we could finish XML 1.0 and get it snapshotted," Tim Bray says. "We should get it blessed by W3C as a recommendation, and maybe even get it blessed by another standards organization as well, just so that we have a line in the sand and can say, 'Okay, this phase is done.' I think we need to do that simply because there are so many implementations happening so fast that just to be fair to the people who believed in what we've done we have to stop changing it. We have to stop and say, 'Okay, here's what it is. Maybe it's not perfect yet, it could be improved still further, but here's 1.0 and that's what 1.0 is.' I think clearly by the end of the year we must have 1.0 finished, blessed, and canonized. There will still be lots of other things to work on. The 1.0 version won't have a solution to the style sheet problem, it won't have a solution for lots of other things, but the base language has to be frozen."

Jon Bosak, for one, is hopeful that the big issues are behind them.

"I may be whistling in the dark," he says. "But aside from the political issues we're going to have to deal with as a result of competition, I don't think that XML really faces any major problems once we get the specification for 1.0 finished. It's been designed to be easy to implement, and outside of all the last-minute internationalization details, it hasn't really changed much for a while. The basics have been in place since last November '96, and most of the finer points were settled by April '97."

Still there are details on top of details.

"In addition to the greater complexity of XML itself," Bosak says, "we're dealing with all kinds of issues that were never confronted directly in HTML--how to handle whitespace, for example, or whether to make stuff like tag names case-sensitive or not, or whether the Japanese character for an ideographic space is really a space or not. Lots of nitty but mind-bogglingly complex problems that finally can't be sidestepped any more. And there was a big policy question, which was what to do about error-handling--but we're past that now."

"The real action," Bosak continues, "shifts now to the other two pieces of the puzzle, the linking piece and the style sheet piece. We call them XLL, for extensible linking language, and XSL, for extensible style sheet language. XML itself is just about syntax. With XLL and XSL we get into semantics, and that's where the real competition is going to be: how you actually do stuff."

"The hardest thing, in general," Steve DeRose says, "is to look far enough ahead to make sure that the language will scale up smoothly and accommodate later extensions without getting kludgy. The broadstroke picture is very clear, but if you don't pin all the details down well enough, systems won't interoperate and you lose a central benefit of standardization. It's nice to see descriptive markup move into the mainstream and be adopted so quickly. I hope that it will let us really move data into forms that will outlast rev of somebody's word processor, and help make bit-rot a non-issue for the future of literature."

Fortunately, XML will be easier to develop than HTML, according to Tim Bray.

"HTML is painfully difficult to evolve," he says, "because it is a mixture of formatting semantics and hypertext semantics and GUI semantics with forms and so on. And trying to evolve all of those capabilities at once without breaking them is very difficult. Now XML, the basic language, has a syntax and there's going to be a style sheet facility and there's going to be various behavior facilities. That doesn't mean that evolving any of this stuff is easy, it just means that you can partition the problems and solve them without having to solve them all at once, which is the problem that HTML faces. So a lot of the advanced capabilities that users of the Web are asking for, I think, are going to be easier to solve in an XML context."

Yet still ahead, after the big technical problems are largely solved, there's another challenge: inspiring people to exploit the new possibilities that come with XML.

"Now that it is reasonable to expect next generation tools to have better control over encoding information," Dave Hollander says, "we need to get ready to use these features. My next key initiative is how to get authors, collaborators, and consumers of information to make the best use of the new capabilities."

"Now, we have to encourage the market to create specific horizontal and vertical DTDs, to build common vocabularies," Paoli says. "We need to let content providers generate useful XML data while we, the software and tool builders, build tools which access and uses this data."

There's plenty to do, to be sure. Yet, at this point it appears likely that the early work of the XML ERB has created enough momentum to carry the project to completion.

"What's important, from here on in, is to keep all these activities moving toward the goal we started with in July 1996," Jon Bosak says. "It's more like a snowball gathering speed down a slope now. It doesn't need pushing, it just needs to be kept pointed in the right direction."