Building XML Parsers for Microsoft's IE4
Building XML Parsers for Microsoft's IE4
Jean Paoli, David Schach, Chris Lovett,
Andrew Layman, Istvan Cseri
Abstract
Microsoft cofounded the XML working group at the W3C in July 96 and actively participated in the definition of the standard. This article describes why Microsoft implemented its first XML application and how it led to the development of two XML parsers shipping in Internet Explorer 4.0, one written in C++ and the other in Java. We describe the importance of designing an object model API and our vision of XML as a universal, open data format for the Internet.Motivation
Our First Application: Active Channels for Internet Explorer 4.0
Conventional Web use waits for a user to request a page before sending it. That is known as the "pull" mode. A powerful alternative exists, however, called "push" or "webcasting," in which pages are sent to a user in advance, based on automatic matching of pages to the user's interests. Webcasting provides each user with automatic delivery and offline access to the information and Web sites that he uses most often.To bring this idea to reality, in February 1997 the Internet Explorer team needed a standard way of describing sites and pages. The first broadly popular form of Web "metadata" (so called because it describes data about other data) is the Channel Definition Format, or CDF [1]. This allows a Web site to post a description of itself in a standard form. Having done so, it is no longer just a site; it is also an "Active Channel."
A channel is a set of related Web pages. Channel Definition Format files include the following characteristics:
- A minimal CDF file contains a list of URLs pointing to the pages that make up the content of the channel.
- A more advanced CDF file can include title and abstract information describing individual items, a schedule for updates, and a hierarchical organization of the channel's offerings.
- A CDF file must be easy to create and not require changes to existing HTML pages.
CDF is an application of XML that deals with the particulars of Web metadata. CDF consists of a vocabulary of terms that are related to Web sites and their Active Channel content. Technically, the terms are used as "Elements" and "attributes," and CDF defines how they can be used together to expand a Web site into a webcasting channel (see Example 1).
Example 1
<?XML version="1.0" RMD="NONE" ?>
<!DOCTYPE Channel SYSTEM
"http://www.microsoft.com/standards/channels.dtd">
<CHANNEL>
<SCHEDULE>
<INTERVALTIME HOUR="2"/>
<LATESTTIME MIN="30"/>
</SCHEDULE>
<TITLE>Internet Explorer News</TITLE>
<ITEM HREF="http://www.microsoft.com/ie/new/666784.htm">
<ABSTRACT> The latest news on Internet Explorer. </ABSTRACT>
<TITLE> Latest support for CDF </TITLE>
</ITEM>
</CHANNEL>
A Universal, Open Data Format
for the Internet
At the same time as the metadata CDF work was proceeding, members of the Internet Explorer team and others in Microsoft started to understand the broad need for a universal, open data format for the Internet. The opportunities are very exciting. The Web has created an opportunity to communicate with anyone, anywhere. Fully realizing this potential depends on widespread use of standards--as with the telephone, this communication depends on numerous layers of interoperating technology. One such important layer is visual display and user interface, exemplified by standards such as HTML, GIF, and ECMAScript (previously JavaScript). These standards allow a page to be created once, yet displayed at different times by many receivers.
Although visual and user interface standards are a necessary layer, they are insufficient for representing access medium to text and pictures. There are no standards for intelligent search, data exchange, adaptive presentation, and presonalization. The Internet must go beyond setting an information access and display standard; it must set an information understanding standard--a standard way of representing data so that software can better search, move, display, and otherwise manipulate information currently hidden in contextual obscurity. HTML cannot fulfill these needs because it is a format that describes how a Web page should look, rather than one that represents data. For example:
- HTML does not provide a standard way for a doctor to send a prescription to a pharmacist.
- HTML does not enable a medical laboratory to publish statistical information in a format that any receiver can analyze.
- HTML does not describe an electronic payment in a form that any recipient can decode and process.
- HTML does not provide a standard way to search legal libraries to find, for example, all litigation documents about a certain topic.
- HTML does not specify how information in a company catalog can be transmitted, such that a salesman can work offline, show the catalog to clients, take orders, then upload those orders in a standard format.
A standard for data representation will expand the Internet in much the same way that the HTML standard did for display a few years ago. The data standard will be the vehicle for business transactions, publication of personal preference profiles, automated collaboration, and database sharing. Payments, medical histories, pharmaceutical research data, semi-conductor part sheets, and purchase orders will all be written in this format. It will open up a wide variety of new uses, all based on a standard representation for moving structured data around the Web as easily as we move HTML pages today. That data standard is XML.
XML: A Standard Format for Data
XML provides a data standard that can encode the content, semantics, and schemata for a range of cases, from simple to complex. XML can encode the representation for the following:
- An ordinary document
- A structured record, such as a appointment record or purchase order
- An object with data and methods (for example, the persistent form of a Java object or ActiveX control)
- A data record, such as the result set of a query
- Meta-content about a Web site (such as CDF)
- Graphical presentation (such as an application's user interface)
- Standard schema entities and types
- All the links between information and people on the Web
Benefits of XML
As a universal standard for the expression of data, XML offers many advantages to organizations, software developers, Web sites, and ultimately to end-users.For software developers building Web applications and line-of-business Intranet software, XML provides a powerful, flexible format for expressing data--whether as a wire format for sending data between client and server, a transfer format for sharing data between applications, or a persistent storage format on disk. Because structured data in XML can include a self-describing schema, XML promises interoperability between applications that manipulate structured data independent of the underlying semantics.
For example, because XML enables publishers to supplement their Web sites with metadata such as CDF, users can receive "pushed" content as structured channels. XML can also provide a means for embedding arbitrary data and annotations within HTML, extending the possibilities for Web-based applications based on HTML and scripts.
For end-users, XML promises to provide a much richer set of Web applications for browsing, communication, and collaboration. The growing use of XML will improve Web-browsing applications for viewing, filtering, and manipulating information on the Internet.
As collaboration on the Web spreads to more businesses, customer services will eventually migrate from phone lines and storefronts to Web sites. The majority of these Intranet and Internet business applications will involve manipulation or transfer of data and database records, such as purchase orders, invoices, customer information, appointments, maps, and so forth. XML promises a revolution in the richness of end-user possibilities on the Web because it enables such a wide array of business applications to be implemented on the Internet.
Microsoft XML Parsers
Our long-term goal of XML is that it function as a data format that anyone can use to build a range of Web applications. To achieve this goal, we decided to write an XML parser and make it freely available. The result of these efforts was two XML parsers--one in C++ and the other in Java--both of which are included as part of Microsoft Internet Explorer 4.0. The parsers were written in parallel, but with somewhat different design goals.The Microsoft XML parser in C++ (MSXML in C++) was written to perform as an integral part of Internet Explorer 4.0. Consequently, its design was oriented toward the following:
- Fast parsing speed
- Low memory usage
- Asynchronous parsing during download
- Strong international support
In contrast to the XML parser in C++, the goals of the Microsoft XML parser in Java (MSXML in Java) included the following:
- To be a reference implementation
- To be a full validating parser
- To be cross-platform
- To promote widespread acceptance of the XML standard
- To experiment with leading edge XML standards efforts, like DOM and namespaces
With some minor exceptions (such as no current support for conditional sections), Microsoft's XML parsers completely implement the W3C Working Draft of the XML specification dated June 30, 1997.[1]
MSXML in Java shipped in the spring of 1997 and is available from http://www.microsoft.com/standards/xml/xmlparse.htm. Both MSXML in C++ and MSXML in Java are shipping with IE40.
Object Model
Once parsed, an XML document is manipulated through an object model (or API). To really help make XML the standard format for data over the Web, we felt that a standard object model was crucial; one that was simple, scriptable, minimal, and consistent with the Document Object Model (DOM) Working Group.[2] We are currently working with the W3C to standardize the XML object model. The object model is language neutral, which means it is equally accessible from all programming languages. To keep the object model independent of the parsers, it was designed prior to implementing them. The idea was to completely separate the parser implementation from the XML data structures. Having the parser use the object model ensured that problems with the object model would be flushed out during development.Document object
The object model is very simple. It models the XML document as a tree structure using only three classes of objects:
- A Document
- An Element
- A Collection
Element object
All XML data is stored in a tree of Element objects. Container Elements are non-leaf nodes. Empty Elements, text, as well as comments and processing instructions are stored as leaf nodes in the tree. An Element's type is revealed by the type property. Currently, the following types are returned:ELEMENT
- For container and empty XML Elements
- For PCDATA and CDATA
- For comments
- Processing instructions
The other important properties of the Element object are:
tagName
- The name (or GI) for objects of type ELEMENT (otherwise an empty string)
- The parent Element of this object in the tree.
- The text for objects of type TEXT or COMMENT (otherwise an empty string)
- A collection of the objects contained by this object. This collection is empty for all other types besides Element
Element collections
Element collections are used to walk the XML tree. An Element collection has one property, the length, which is the number of Elements in the collection. Child Elements are fetched via the item method, which returns either an Element by index, or by name. When more than one Element has the same name, the item method returns a new collection with all of the child Elements with that name.The object model for the C++ parser is written using Microsoft's component object model architecture (COM). As a result, it is language neutral and equally accessible from JavaScript and VBScript as well as C++ and Java. For example, once a Document object is created, loading a document involves setting the document's URL. The following JavaScript code fragment shows how to load an XML document from an HTML page using the C++ parser:
myXMLDoc = new ActiveXObject("msxml");
myXMLDoc.URL = "http://www.somecompany.com/somedata.xml;"
Using the Java parser and the XML DSO applet that is shipping with IE 4, you can load an XML document as follows:
<APPLET class=com.ms.xml.dso.XMLDSO.class width=0 height=0 id=xmldso> <PARAM NAME=URL VALUE="http://www.somecompany.com/somedata.xml"> </APPLET>Then you can access the Document object via script as follows:
var doc = xmldso.getDocument();While the object model is minimal, it is functionally complete. We expect that it will evolve over time.
For more information about Microsoft's XML object model see [2] and [3].
Technical Details
Simplicity of design
The Microsoft XML parsers are simple. This is by design. They are implemented as hand-coded, recursive-descent parsers. This has a couple of benefits:
- First, the minimal syntax of XML makes a parser generator unnecessary: a hand-coded parser works just fine.
- Second, recursive-descent parsers are both easy to write and easier to understand.
Character encodings
Although XML parsers are required only to read UTF-8 and UCS-2 encodings, the Microsoft's XML parsers handle many more encodings, such as shift-jis, euc-jp, and big5. In fact, the C++ parser supports the same set of character encodings as IE40, and the Java parser supports all the encodings supported by the Java VM. The recursive-descent parsers are isolated from these different encodings by input readers that convert everything to Unicode. While this increases memory usage for European languages, it simplifies string processing overall.Storage of Element and Attribute names
Because Element and Attribute names tend to repeat, they are stored as atoms so that only one copy of each string is stored. This also speeds up string comparisons because atom objects can be compared for equality very quickly, without comparing the characters in the strings. This technique amortizes some of the cost of checking for NameChar characters and converting Unicode characters to uppercase.Object model implementation
The Java parser builds the Element tree using the object model. When it creates new Elements it uses an Element class factory that is passed in by the creator of the parser. The parsers come with a default object model implementation that is fully functional; however, clients with special needs can write their own class factory that creates custom objects. This makes it easy for programs that want to use XML but still need to process legacy data structures.The Java parser does not parse asynchronously, it could be run on a separate thread. The C++ parser parses asynchronously by running on a fiber. The object model was designed so that asynchronous parsing can be implemented transparently to the programmer. Because all properties and methods are function calls, the object model can block the caller when attempting to access a node in the tree that isn't completely downloaded.
Entities and other language features
The Java parser also implements DTD validation, full Entity handling, and the namespace proposal. We found that DTD validation was relatively easy. The XML spec was clear and pointers to algorithms for implementing validation were helpful, but we found that supporting validation does seem to impact the overall performance of the parser.Correct entity handling was actually quite subtle--especially when we were trying to figure out how to expose entity references in the Object Model. The problem is that some clients of the Object Model (like JavaScript's) prefer the entities to be fully expanded and thereby essentially invisible to their scripts. Other clients of the Object Model (like an authoring tool), on the other hand, want to actually know where the entities are, how to edit them, and so on. We decided that entity references should be simple leaf nodes in the tree of type ENTITYREF that point to the full entity definition in the DTD and also decided to provide helper functions like getText() for those clients who just want the fully expanded text. Parameter entities in the DTD are more difficult. Currently parameter entities are expanded by the parser and not represented in the Object Model. It is not clear whether we can ever represent parameter entities in the Object Model or in fact we'd even want to.
Namespaces were relatively simple since we already had an atomized Name object in the Java parser to represent all tag and attribute names in the document. We simply added a namespace field to these Name objects, support for parsing the name space declarations, and we were done.
The parsers are small and fast. MSXML in C++ with full international character support is less than 100K and the MSXML Java Parser is 127K.
Using the Object Model
to Process XML Data
To illustrate how the Object Model can be used to do interesting things we will show you a small example based on the CDF data we saw earlier in Example 1. Example 2 shows how to walk the XML Object Model to find out the INTERVALTIME of the scheduled event.
Example 2
<script>
function GetInterval()
{
// Fetch the CDF file and extract the INTERVALTIME element
var doc = new ActiveXObject("msxml");
doc.URL = Resolve("cdf.xml");
// First extract the SCHEDULE node
var s = doc.root.children.item("SCHEDULE");
// Then the INTERVALTIME
var t = s.children.item("INTERVALTIME");
// Then the HOUR attribute
var h = t.getAttribute("HOUR");
return h;
}
function DisplayTime(hour)
{
// Display this with an appropriate message in a popup window
var w = window.open("","NextShow",
"resizable,width=400,height=100");
w.document.open();
w.document.write("<body bgcolor=yellow>");
w.document.write( "<h2>The next show is in " +
hour + " hours !</h4>" );
w.document.write("</body>");
w.document.close();
}
function Resolve(relurl)
{
// This is a useful little function that I use to resolve a URL relative to the
current document location
var url = document.location.toString();
var base = url.substring(0,url.lastIndexOf("/"));
var href = base + "/" + relurl;
return href;
}
</script>
// A button that invokes the above scripts
<input type=button value="When ?"
onclick="DisplayTime(GetInterval());">
Notice that the GetInterval() method uses a small fixed set of objects and methods to manipulate the XML data that is independent of display-oriented things like HTML. As long as the CDF DTD (or schema) stays relatively fixed, this script code will work on any CDF file. In other words, this is robust enough to build Web-based business applications.
Conclusion
When we choose XML to encode CDF files, we were a little bit anxious. XML was just created--even though Microsoft co-created the W3C XML Working Groups in July 1996, it was as new to us as anyone else. In addition, launching "channels"--by using the first broad, public application of metadata--by using an untried standard was risky. A few months later (as of this writing in August 1997), we know that we have made the right choice.The flexibility and ease of use of a text format for representing and exchanging structured information has been demonstrated. CDF is now widely used by industry's leading content providers, Web and Java authoring tool vendors, and "push" developers (such as PointCast, AirMedia, and BackWeb). Multiple tools have been developed to produce CDF files. Because it is simple text-based format, tools are easily developed to generate and process it. XML helped make CDF successful.
Now a set of XML enabling technologies, including C++ and Java parsers with their Object Models, are shipping in Internet Explorer 4.0. Because IE 4.0 will be integrated into Windows 98, there will be an XML parser on each desktop--another step toward the vision of making structured data an integral part of the Web.
At Microsoft, we strongly believe that XML is the standard, extensible, universal data format for the Internet. It is simple and easily authored. It is based on international standards that have been tested for many years. It is enormously extensible. It is flexible enough to allow representation of an incredibly wide range of information, and it also allows this information to be self-describing, so that structured data expressed in XML may be manipulated by software that doesn't have previous knowledge of the underlying meaning behind the data. XML provides a file format for representing data and can be extended to contain a description of its own structure. It is a means of formatting data and also a mechanism for extending and annotating standard HTML.
With its powerful expressiveness and flexibility, XML promises to add structure to data on the Internet, bringing the Web one step closer to realizing the potential for universal communication with anyone, anywhere.
- http://www.w3.org/Submission/1997/2/
- http://www.microsoft.com/msdn/sdk/inetsdk/help/inet5017.htm
- http://www.microsoft.com/standards/xml/default.htm
About the Authors
- Jean Paoli
- 1 Microsoft Way
- Redmond, WA 98052-6399
- jeanpa@microsoft.com
- Andrew Layman
- 1 Microsoft Way
- Redmond, WA 98052-6399
- AndrewL@microsoft.com
- Istvan Cseri
- 1 Microsoft Way
- Redmond, WA 98052-6399
- istvanc@microsoft.com
- Chris Lovett
- 1 Microsoft Way
- Redmond, WA 98052-6399
- clovett@microsoft.com
- David Schach
- 1 Microsoft Way
- Redmond, WA 98052-6399
- davidsch@microsoft.com
[1] The latest version of this draft was in fact August 7, 1997, and is published as the "Extensible Markup Language (XML)" specification in the "W3C Reports" section of this issue.
[2] The "Document Object Model (DOM)" specification is in this issue's "W3C Reports" section.