July 19, 2000
The Apache XML developer lists have seem some turmoil recently over a proposed plan to develop their next generation XML parser. However a phoenix may have been borne out of the ashes: this week we report on "Spinnaker", a.k.a. the Xerces Refactoring Initiative.
Sparring over Spinnaker
In a bold announcement on the general Apache developer list, James Davidson outlined his intention to start a project, code-named "Spinnaker", with some ambitious goals:
Spinnaker is an attempt to create a next generation Apache XML Parser based on all the lessons learned from the current versions of Xerces and Crimson.
The announcement listed goals for the Spinnaker effort itself. Amongst these was the desire to produce a parser suitable for inclusion in the Java Development Kit, a goal that apparently had internal backing from Sun.
To say that the announcement was welcomed with less than open arms by some of the IBM developers would be an understatement:
Looks like a "coup d'etat" to me.
This was the reaction of Arnaud Le Hors. Andy Clark, another IBM developer, questioned the wisdom of beginning such a project at the start of the weekend:
Is it possible that, in the future, we hear about submissions to the tree *before* everyone goes home on Friday? I want us all to work together on the future of the Xerces parser instead of being surprised by a new source tree over a weekend.
These opening remarks set the tone for a heated debate, which threatened to turn into an IBM versus Sun squabble over who was playing fair by open source rules. Abrasive remarks were exchanged, with both sides quickly disavowing any corporate-lead motives, whilst questioning those of the other side. James Davidson believed the timing of his announcement was irrelevant:
This is open source, Apache style. People work whenever they work and that's the way this all works. Most Apache developers don't work on the main sources during the typical M-F 8-4(local time) window. They work when they get time, or the muse is with them, or whatever. There are no limits, it's a 24/7 shop and to be blunt, conformance with a corporate schedule isn't part of the mandate.
Arnaud Le Hors singled out the method of initiating the project for particular criticism:
The problem is not about Sun vs IBM. It is not about corporate vs individuals. Not about creating a new project. Not about working on week-ends. Not about questioning the current goals or implementation. The problem is about making decision on your own. Not communicating. Not consulting others in the project.
Ironically, it seems that a lack of communication, along with a good deal of miscommunication, was the root cause of the argument. This is symptomatic of an internal divide within the Apache parser development community. Stefano Mazzocchi, in valiant attempts to defuse the situation, highlighted the cause of the divide:
...Tomcat, Xerces and Xalan didn't start as Apache projects and their development community was _imposed_ and did not emerge from the community of volunteers.
Both Xerces and Crimson are "donations" to the Apache effort. However, these donations haven't been made into the arms of a waiting team of Apache volunteers. Both projects are still largely driven by development teams at IBM and Sun. Mazzocchi's key point is that simply providing a public CVS server does not make for a good open source project. If the development teams continue as they have done prior to opening the source (e.g., by having off-line design meetings), then there's little chance for a developer community to form around the project.
Mazzocchi suggested that the Tomcat developers have managed to make the transition from internally driven development to a full open source community, whereas the Xerces, Xalan, and Crimson teams have yet to achieve it. Rajiv Mordani further illustrated this point:
... since most of you'll work in the same office there is a lot of things that happen in there that should be actually done on the mailing lists. I don't remember seeing a mail going out proposing the implementation of schemas for [example]. The only info that was sent out is that the repository will be a little unstable for the next few days as schema support is being added to Xerces so please use the tagged version of the workspace.
It would be wrong to single out just the IBM developers in this regard. It appears that the Sun team is as much at fault--although, as Brett McLaughin commented, it's the external perspective that matters:
What I meant to focus on, and what we may have miscommunicated on, is the /perception/ of what is happening... The _perception_ ... is what we have to fight in an open source project.
As tempers cooled, it quickly became apparent that the perception that neither team is interested in the other's software was incorrect. Once the rhetoric was laid aside, a lot of common ground was discovered. At that point the real work started.
To help push matters forward, Sam Ruby suggested that the project should be known as the Xerces Refactoring Initiative (XRI)--a name he described as "intentionally boring". Stefano Mazzocchi produced a "virtual press release" to explain the common purpose of the initiative:
The Apache XML community started a "Refactoring Initiative" to create [a] next generation XML parser for Java. This will be known as "codename Spinnaker" (or simply spinnaker) and will be the collective design whiteboard where the worldwide Apache community of individuals will openly develop such [a] new XML parser. The final result of this RI will be determined when "final" status is reached and, as always, decided by the community.
Taking the Initiative
Turf wars aside, what are the driving forces for XRI, and what are its goals?
In his opening announcement, James Davidson listed several perceived problems with Xerces. These included performance problems under HotSpot, and code complexity. Both of these appear to be the result of efforts put in by the Xerces team to optimize the parser for the Java 1.1 platform. While the efforts have clearly paid off, the optimizations do make the code harder to understand, and limit the ability of HotSpot to automatically optimize performance on Java 1.2 and 1.3 virtual machines.
Edwin Goei observed that code complexity is one reason why additional contributions haven't been made by the XML community:
The main objection I have to the current Xerces code is not about which VM it runs on, but on the ability for a developer to understand the code and make changes to it. I think one major reason there has not been more developer participation is because the code is difficult to understand ... I looked at two other parsers that I believe are easier to understand: Aelfred2 (not Apache) and Crimson/ProjectX ... With Xerces it takes much more effort to understand the code and I prefer not to make changes to code I don't understand.
This observation was echoed by Brett McLaughlin who suggested that developers are turning their efforts elsewhere:
I have lots of users on the JDOM lists who think Xerces is great, but are totally scared by the code--unfortunately, I can't blame them. They would rather spend their time working on JDOM, or something similar.
The complexity of Xerces was one issue that wasn't disputed. Arnaud Le Hors explained that the IBM Xerces team had been considering a redesign all along:
The fact is that we're also interested in a new version of Xerces which is more modular. As a matter of fact, we've given it quite some thoughts already, and [have] even written down a first draft of a design document on it that we'll be happy to send out as input.
We don't think the current model is perfect. We refer to it as "the star model". Basically every piece knows about the others and the flow of information is far from being linear. Instead, we're thinking of a pipeline model. Where every piece works as a box with an input and output stream working independently of the others. It will be hard to make this as efficient as the current parser. But this, among other things, would let one to easily plugin the validator or not.
Once the common ground had been found, the requirements began to emerge thick and fast. Scott Boag described his vision of what the initiative would involve:
I think the point is to build/refactor a next generation, commercially viable, parser based on the current state of the art (including and especially Schema support), and the collected requirements. It is indeed pointless, in my opinion, to talk about "adapting" other code bases -- what is going to occur is a mining operation from Xerces and Crimson. Anything that's open source with the right license is open to be mined for ideas...
Arved Sandstrom agreed that the search for ideas should be expanded to all parsers with available source and suitable licensing:
There's stuff that microparsers like nanoxml can contribute to the discussion. Python XML parsers (and there is more than one) are quite good. If we are talking James Clark, let's not forget expat; this is a very good parser and represents the core of the Perl XML processing family.
...I think there is potential here for making this project best-of-breed when it comes to showing that open-source can do process.
The project has now taken on a life of its own: differences seem to have been set aside (at least temporarily) to focus instead on the technology. Ed Staub has taken on the task of co-ordinating the collection and publication of the XRI requirements. The discussion has already thrown out some interesting ideas such as Grammar caching (avoiding reparsing of the same DTD multiple times), and compiling an XML Schema into a custom parser.
There's obviously a long road ahead for XRI/Spinnaker but it should yield some interesting results. Don't go looking for a Xerces 2 just yet, and don't despair, as Xerces 1 is not about to be abandoned: the IBM team is currently hard at work completing XML Schema support.
Perhaps the key benefit of this project, despite its rocky beginnings, is that the internal fragility of the Apache XML may be removed. With effort from both the Xerces and Crimson teams, as well as the wider developer community, we'll not only benefit from a next-generation parser, but also a more stable and collaborative process underpinning the development of a vital component in the XML infrastructure.