Creating Efficient MSXML Applications

March 6, 2002

I have read dozens of books and articles directed at programmers who need to get something done with XML. While these resources can be helpful, many are for programmers new to XML development and contain examples that use tiny files (e.g., always less than 10 KB and often smaller than a kilobyte).

The techniques used in these examples don't always translate well into the real world, particularly when dealing with extremely large file sizes. What happens, for instance, when you need to parse XML files on the order of 122 MB, with each file originating from a different source application?

That was the dilemma my team faced when presented with the challenge of developing our company's new server-based XML rendering engine, along with a proof-of-concept Web site that would allow anyone to upload files and convert them to XML, SVG, HTML, etc. In short, we needed to be able to accommodate everything from Microsoft Word documents to Quark files, none of them small.

Obviously, the first inclination is to increase memory or processing power. Unfortunately, that was too easy because we were looking at a Big O problem. Remember Big O notation from college? An algorithm can be measured in terms of processor cycles (time) and memory consumption (space), with respect to the number of elements in the problem. Big O notation represents the inherent boundaries of the scalability of your solution. For example, an algorithm using O(n²) time or O(n²) space is considered inefficent because of the extreme limits to the size of the problem that a computer can complete.

We first encountered the Big O issue when trying to parse a large XML file using a parser that implements the XML DOM. Such an approach loaded the entire XML file into a memory resident tree structure. This used O(n) time to read the file and O(n) space to store it, which -- at first glance -- seemed acceptable. However, there was more involved than just loading the document.

We still needed to write code to read the DOM, analyze it, and carry out tasks. We assumed this analysis, such as scanning for a particular element, would take O(n) time. Surprisingly, we found it to be O(n²) in practice, which is not acceptable.

A profiler utility helped us zero in on the problem: getItem. This simple method is a favorite of many DOM programmers. It is simple to use, reminds them of indexing into an array, and it works -- at least for small XML files. However, the DOM tree is actually like a linked list. There is no random access. The DOM implementation is forced to traverse the list. When included in a loop for a list of n elements, there will be 0.5(n²-n) references, or O(n²). A quick look at the documentation showed us that nextNode is the function included in the DOM for such traversals. We changed the line and noticed the results immediately (see below table). In this case, a one-line change will improve performance dramatically on large XML files.

	10,000 nodes	100,000 nodes	1 million nodes
getItem( i )	15 seconds	~23 minutes	> 1 hour (didn't finish)
nextNode( )	1 second	12 seconds	~200 seconds

However, developers of server products know that memory is a particularly precious resource when trying to accommodate several simultaneous users and processes. A simple change from getItem to nextNode is not going to stop the DOM tree from quickly absorbing your server's memory.

It turned out that reading the entire XML file into memory is unnecessary for most manipulations. XML developers have already agreed that an event-based parser like SAX is a much better alternative because no more than a single element needs to be loaded at a time. At the start of each element, a function is called, some logic is performed, and the element is discarded. It's up to the application developer to store (in memory or otherwise) those portions of the XML file that need to be analyzed after the element has been discarded by the SAX parser.

It was an easy decision for us to migrate from DOM to SAX. After all, one of our core components simply extracted large XML subdocuments and saved them to separate files. There was no need to hold any state information. Therefore, we were looking at O(1) space, which is what any scalable server application should strive for.

We used MSXML 3.0's SAX implementation to parse the documents. Microsoft's MXXMLWriter implements a ContentHandler that accepts the same parameters as the SAX event entry points. In Microsoft's samples, the entire process is chained together, which means you effectively copy the XML file to the MXXMLWriter. We simply added some code to restrict copying to a subset of the elements encountered during the SAX parse. Unfortunately, it wasn't on disk yet, and there was the little problem of UTF-8 encoding before sending it to disk.

There was another issue as well. The MXXMLWriter stores all of its output in a string property called output. This is where we encountered the next major Big O issue in developing our server application. Every subdocument is stored in memory as MXXMLWriter collects it. When we were ready to save it to disk, we had to allocate a chunk of memory of about the same size to convert from Unicode to UTF-8. Because the size of the subdocuments increases proportionally with the size of the original XML document, our O(1) space algorithm got demoted to O(n). Not good.

Microsoft's documentation on IMXWriter, specifically the output property, offered us hope: "You can also set this property to any implementation of the IStream interface, and the resulting document will be written into the provided IStream."

It sounded promising, but a lack of sample code and a sea of COM documentation on the IStream interface left us frustrated. In the end, however, we found that once you hook up MXXMLWriter with a proper implementation of IStream, the rest of the program stays the same. Every time MXXMLWriter accumulates 4,096 Unicode characters (which occupy 8 KB of memory), it invokes the Write method of IStream, which converts the buffer to UTF-8 (which takes another 3.3 KB) and writes it to the file. Regardless of the size of the file or the subdocuments, this implementation will consume less than a dozen kilobytes of memory at a time. That's what you call O(1) space, and it means you'll now have to look elsewhere for the bottleneck on your busy server.

I have included sample code for the IFileStream class, which implements IStream, as well as the code that converts a buffer of Unicode characters into UTF-8, available as xml_article.cpp and xml_article.h.

If you want to see our XML rendering server in action, visit http://www.createxml.com.