Capturing the State of Distributed Systems with XML
Throughout a PNR's lifecycle, from the first call to a travel agent to the final posting of frequent-flier miles, these complex data structures face three challenges: to be distributed across time, to both past and future readers; to be distributed across space, to other machines; and to be distributed across communities, to other organizations and applications. The first challenge calls for a stable data format, since itineraries have to be updated consistently by reservationists, ticket agents, gate agents, flight crews, database engineers, accountants, and others. In the second case, there needs to be a stable grain of exchange, to share records and commit transactions between a bevy of information systems. Finally, to communicate across organizations, there have to be common definitions: agreements between airlines, hotels, rental car agencies, travel agents, and passengers about the interpretation of dates, locations, flights, prices, and so on.
In each of these situations, system designers can leverage several strategies to manage distributed data cost-effectively. File formats, for example, must be machine-readable, but can be more future-proof if they are also human-readable and use self-describing schema. When packaging related objects together to exchange with other machines, finer-grained marshaling strategies are more flexible than integrating systems through a handful of fixed report formats. Finally, industry-wide coordination has been notoriously difficult to design by committee. Instead of fixing protocols and data dictionaries, the best strategy may be to collaborate through conventional "documents"--for example, purchase orders instead of Electronic Data Interchange (EDI) records.
All too commonly, the actual decisions of system designers fall short against these measures. Proprietary, underdocumented, binary file formats are not merely quick hacks; they are strategic decisions to lock in users. Concurrent systems almost immediately retreat to a unified system image, so instead of marshaling only relevant data, the entire database needs to be shared. The result is horrifying: black-box legacy systems that are rarely shared within a community, much less among suppliers, vendors, and other outside users.
In this paper, we argue that the Extensible Markup Language (XML) [3][1] and its companion Extensible Linking Language (XLL) [4] can together provide an effective solution for capturing the state of distributed systems, particularly on the World Wide Web. XML was designed to provide a subset of the Standardized General Markup Language (SGML) that is easy to write, interpret, and implement.[2] Since XML allows extensible markup while preserving rigorous validation, we advocate storing information in XML, sharing it according to XLL's link model, and weaving XML-enhanced data structures into Web documents.
The point of reducing some complex multidimensional data structure to a bitstream is ultimately to allow some future user to reconstitute that same data structure and manipulate it accurately. The key is enforcing a schema for these transformations. In this section, we will explore the tensions that lead to brittle data formats (Section 2.1), three strategies for future-proofing data formats (Section 2.2), and how XML-based data formats execute those strategies (Section 2.3).
Type-equivalence problems in a language can spread to the archives, too, like the impedance mismatch between Java's (and its Serialization's) int type and Integer class [14]. Each system establishes its own set of canonical primitives such as character, string, integer, and float, and its own encodings, leading to yet more conversion challenges--both on the wire level (for example, COM [6]) and on the interface level [19]. Abstract Syntax Notation (ASN.1) [13] encoding rules, for example, specify the type, length, and value of each datum in the stream--as well as the type, length, and value of the type and length.
Human-readable formats have their own traps. Many UNIX system databases, for example, embrace the need for extensibility, manual editability, and to include comments [17]. Each of the many column-separated flat-file databases for users, groups, email aliases, and so on are still cryptic, not automatically validatable, and are not self-documenting. As the system grows, some databases need to be replaced wholesale by incompatible binary forms updated by distributed directory protocols.
In short, data storage formats are difficult to "future-proof." It takes care and effort to design extensible, editable, scalable, and correct formats, as well as the parsers, generators, and Application Programming Interfaces (APIs) that implement it. Instead, designers face immediate concerns about:
Inertia
Machine-readable formats
Successfully machine-readable formats are measured by the logic required to extract and manipulate them. Rigorous enforcement of syntax rules simplifies parsing logic at the expense of robust error handling. Direct projection of the data-representation in memory simplifies parsing and generation at the expense of human-readability and cross-platform support. For example, capturing numeric data in binary form is simple and potentially compact, but unreadable and dependent on the endianness of the CPU architecture. Mission-specific grammars can be more compact than adapting general purpose encodings (e.g., ASN.1). Turing-complete formats, representing state as executable program text, inflate parser and generator size while reducing the fidelity of the data manipulation. For example, an airline ticket as PostScript requires executing a large program and even then ending up with strokes and arcs instead of cities and flights.
Successfully human-readable formats, by contrast, are measured by the cognitive effort to extract and manipulate information [18]. In this case, flexible enforcement of syntax rules makes it easier to edit and read. Data representations need to be translated to accessible forms, potentially at the expense of fidelity. For example, integers can be represented accurately in decimal, but inaccuracies can crop up for floating-point. Data presentations also need to be accessible: a Portable Network Graphics (PNG) picture is "human-readable" when presented as an image. A spreadsheet presented as a table, though, loses the equations and symbolic logic behind the numbers in the process. The benefit of all of these tradeoffs is increased reusability, which will increase the viability and investment in maintaining that format. Conversely, when human-readability is reduced to an afterthought as a companion "import/export" format, the canonical binary format may still not become future-proof.
Successfully self-describing formats are measured by how much can be discovered dynamically about their mechanical structure and semantics. The first test is simple identification. The file should contain some type signature, perhaps even a revision number, or at least a filename extension--enough to characterize the format. Leveraging that identity to define the provenance of the data and its definitions is the next step. A typical Unix system configuration file, for example, at least refers to the section of the manual that defines its entries. The third test is whether that definition is sufficient to dynamically extract and manipulate the information within both structural and presentational guides. These kinds of metadata can future-proof a format, preserving machine-readability and human-readability.
First, each specific XML-based file format is based on a separate, explicit Document Type Definition (DTD). Each DTD defines the names of new tags, their structure, and their content model. More to the point, XML files are required to disclose their respective DTDs in their headers, or include the entire DTD within the XML file itself, neatly enforcing self-description. The DTD functions analogously to an Interface Definition Language (IDL) specification or a relational database schema.
Second, XML parsers can validate files with or without the DTD. The implicit grammar rules define a hybrid machine- and human-readable text format that can represent numbers, strings, and even escaped binary content. The tools themselves can be built small and run quickly, as described elsewhere in this issue. The resulting files are probably larger than alternative formats, but XML markup should compress effectively for tightly constrained environments. XML-formatted metadata can also be stored alongside legacy files as appropriate.
Furthermore, using XML unlocks other opportunities. DTDs can be cascaded to represent compound data types [5]. A TravelAuthorization record, for example, could combine an Itinerary record and an EmployeeAccount. DTDs can also be hosted on the Web, allowing users to dynamically learn about new formats. Style sheets can be applied to the tagged XML data, garnering all of the formatting abilities that application entails.
While every system needs to ensure that its input and output formats can withstand the test of time, distributed systems need to share knowledge between different physical locations at the same time. Protocols must be established for excerpting relevant parts of the workload and shipping data between subsystems, either across a network of separate computers or using interprocess communication within the same computer. In this section, we explain some of the tradeoffs of marshaling data (Section 3.1), a strategy to use the network to defer marshaling decisions (Section 3.2), and how the Extensible Link Language improves upon the Web's hypertext semantics to match (Section 3.3).
Consider the challenge of exchanging state between the flight dispatcher and another critical resource, the crew dispatcher. Once a flight has been scheduled onto a plane, it still needs pilots and attendants certified to operate that plane in that departure city. It is an especially complex space-time chess game because people, unlike planes, need to return home soon. Optimization algorithms manipulate all of these records simultaneously, producing a complex, connected graph of Employees, Flights, and Planes in memory. The results need to be shared with yet other subsystems in operations and human resources: reports summarizing the activity of each Plane and Employee.
To "pickle" the state of a Plane, we can write down its particulars, but then there are pointers to the several Flights it will take that day which in turn point to several crews. Extracting that report requires marking all the records that plane depends on, then cutting that subgraph out of the larger database by replacing pointers with internal references in the archive. Of course, it is not just a simple spreading tree: pickling an Employee requires enumerating the Flights and Planes it is linked to, and the recursive, tangled mess could easily expand to encompass the entire database.
The system designer has to break this cycle, literally. Decisions must be made either to include a linked record in the archive, or else to replace the pointer with a symbolic name. For example, the daily roster for a Plane can terminate by recording only the Employee name and ID, eliding other details that can be reconstructed by dereferencing that ID. The Employee schedule need only list Flight numbers, rather than include the full details of the flight's passengers, meals, and revenue.
In the geography of distributed systems, distance is (the inverse of) bandwidth, which constrains both the size and frequency of messages. Designers also have to enforce policies about how often to update the system. At one extreme, all data can be stored in atomically small records in one high-performance database. Even if that database is hosted on several computers running in parallel, it is essentially a centralized philosophy at work. It can be made to scale: today's airline reservation systems pool dozens of mainframes in a massive hardened data centers into some of the highest throughput transaction networks in the world. At the other extreme, all related data can be isolated within one system that emits a batch-processed summary of the entire set every so often.
The Web solves this problem rather differently. A page can include many subsidiary resources, some of which load other subparts in turn. Different pages can also share common resources. Web servers do not transmit a single neat package, though: each resource is transferred in a separate HTTP request-response pair [10].
The key observation is that the links between resources already have names. Instead of pointers that can only be interpreted in the sender's context (like memory addresses), relative and absolute Uniform Resource Locators [1] can be interpreted by any recipient. Instead of expensive marshaling burdens on the server (writer), the client (reader) can incrementally fetch the desired resources as needed.
Separating each transaction does not necessarily compromise consistency. At first it might seem that since each resource is exchanged at a different point in time, the entire set could change in the middle. That race condition can be prevented by incorporating state into the URL (for example, a version indicator a la Web Distributed Authoring and Versioning (WebDAV) [20]) or into the protocol (for example, an HTTP Cookie [16]).
Separating each transaction can hamper performance, though. HTTP's strictly synchronous model implies a round-trip delay for each resource, even if the sender already knows what dependent resources should be marshaled together. HTTP caching or the future evolution of HTTP to allow "push" responses can both address this limitation.
Neither of these engineering concerns dilute the lesson of linking with names, since URLs are designed to assimilate new naming schemes and access protocols. The strategy of linking resources together with names defers both of the costs associated with marshaling: the perogative to drill down shifts to the recipient, and the sender does not have to map out an entire report.
XLL can indicate whether each linked resource should be interpreted within the same context or a new one (the SHOW axis) and suggests whether to access it in parallel or in series (the ACTUATE axis).
Since the actual link address format is just a URL, it can point at any named span in the target document. Fragment identifiers like document#label behave just as they do for HTML. First, they load and parse the entire document, and then they search for the anchor element the original author so labeled. Unlike HTML, though, URLs referring to XML documents can use an extended pointer (XPTR) syntax developed by the Text Encoding Initiative (TEI). An XPTR identifier such as document|ID(label),CHILD(2,*) points to the second element below the labeled anchor; there are many other operators for navigating the parse tree, counting characters, matching strings, and indicating spans. XLL deliberately leaves it unspecified who dereferences an XPTR identifier, so the dictionary server can indeed return only matching definitions.
The latter development is perhaps the most significant for XML's future as an archiving format. Portions of the state within a structure can be named, linked to, and even excerpted without modifying the source. Even state-of-the-art object-oriented serialization services for Objective-C and Java can only archive an entire stream all at once [9] [14]. XML's well-formedness requirements produce structured documents that can be correctly manipulated, even without the entire contents of the document at hand.
The information to fill out an expense report certainly exists within the airline's databases. That information was even collated into a self-contained document. When that ticket changed hands from agent to passenger, though, it was ripped out of context. The point of preparing a report should be to come to enough ontological agreement to allow an outsider to reconstitute its context, and hence its meaning. In this section, we explore the challenges of interorganizational collaboration (Section 4.1), document-centered integration strategy (Section 4.2), and how XML-enhanced documents can provide a usable face to structured data (Section 4.3).
This is not a technology problem. It is not a matter of wiring up all the players with email and Web sites. It is an ontological problem where no two vocabulary sets quite line up. For example, if a meeting slips from the afternoon to the next morning, it is one extra hotel "night" (which are calculated as solar days), and zero extra car rental "days" (calculated as 24-hour blocks) and possibly even a more expensive airline ticket (if the fare had a "maximum-stay" limit, which would be measured in the originating timezone).
Understanding these varying bits of jargon for marking time confers membership in each industry. Organizations can be defined by their language: ontology recapitulates community. Coordinating tasks across organizations ineluctably requires adapting to local conventions. It also requires prying information out of the several distributed systems: each of the travel plans hide behind an opaque reservation code, to say nothing of the chaos in calendaring standards.
Consider a bank check. Legally, a demand deposit account can be used with a signed napkin, but the U.S. Federal Reserve's clearance policies set out the physical dimensions, layout, and magnetic-ink encoding of a check. As the check moves from bank to bank, there is no confusion as to the exact interpretation of accounts, amounts, and dates, because the check incorporates its own legal conventions. At another end of the spectrum, a forty-thousand page New Drug Application to the U.S. Food and Drug Administration still has the same roles. The application is the one artifact that represents years of negotiations, carefully logged. The application sets out its own drug-specific scientific terms and tests, negotiated by both sides' analysts.
Documents in cyberspace assume the same roles: embodying the user interface to a task and defining its terms. The document metaphor has a long pedigree in user interface research, far predating the web. Taligent, arguably the most sophisticated mulituser collaborative document toolkit to date, strongly endorsed the convergence of application-as-document and "collaborative places" [8]. Concurrent documents views were assembled from active components consulting a shared structured storage model while interaction could pass from user to user. While such peer-to-peer collaboration may be several generations ahead of current Web client technology, server-based coordination of Web pages with forms and active content is a sufficient simulacrum. The broader lesson is that an intelligent "purchase order" document can be a more usable representation of the collaborative process than a traditional application. Web technology accelerates the development cycle by dramatically lowering the threshold for creating document interfaces. A form and a CGI interface to the Shipping Department can put a business online faster than an army of EDI consultants, because the Web's markup format is so accessible.
The logic embedded in a collaborative document also defines the ontology for that task. Within a community, understanding the semantics of a document is a matter of identifying its format (Section 2.2). An outsider has to understand the ontology behind the format, well enough perhaps to translate it into locally-meaningful terms. A calendar developer has to build a lot of shared context with an airline reservation structure to extract facts like "the user will be on a plane and inaccessible during each flight; during a flight, the calendar's time zone should be reset; and the user will not be available for meetings at the office." On the other hand, instead of waiting for an industrywide or international standards process to deliberate over the canonical meaning of "place" and "time," developers can at least knit one-to-one mappings. Popular ontologies can emerge organically, like well-trodden paths in a field.
Electronic commerce on the Web is already big business [15], but its HTML form-based infrastructure is not enough to "become the concrete face of the task." For example, two sites selling books and flowers will both inevitably ask for a shipping address. But without a structured container for addresses, there is no way to automatically fill in the order page at either site, much less use a shared address format. As XML-savvy tools become more popular, Web developers will be able to publish and receive XML street addresses within HTML Web pages [7]. Forms extensions could specify the DTD of input data. Style sheets will format the appearance of embedded data structures.
The technology to manipulate the ontology of XML documents is a little further off. The key is XML's hooks for identifying DTDs. The Formal Public Identifier for a document type can now be associated with a URL. XML processing tools could expect to dereference that address and not only discover a DTD file, but also metadata about the meaning of each tag, default style sheets, and possibly even mobile code resources for manipulating such data. With this kind of documentation, automated translation tools might be able to associate an airline's <location> tag, which refers to airport codes with an atlas's latitude and longitude entries.
In the interim, several exciting tools are already focusing on this vision. webMethods' Web Interface Definition Language (WIDL)[3] can extract structured data from HTML and XML web pages, invoke processing on Web servers using forms, and collate reports harvested from multiple sites in a single format. Many developers have rallied to the motto "XML gives Java something to chew on,"[4] referring to the synergy of XML and mobile code embedded in Web pages together. All of these trends are narrowing the gap between human-readable and machine-readable documents.
We have tried to set forth the challenges facing distributed system designers in this context. We argue that XML can effectively future-proof data formats, exchange data structures, and enhance Web documents into robust platforms for system integration. It is not the first, last, or universal solution, but it does accelerate the continuing evolution of the Web. As the Web assimilates "the universe of all network-accessible information" [2], and as XML adds the metadata to define that universe, at some point information transubstantiates into knowledge.
A modern airline can no more take flight without its information systems than without jet fuel. At some point, the distributed system no longer models reality; it becomes reality. As David Gelernter predicted in his 1991 book, when the image in the machine corresponds to the real world, in both directions, we have built a Mirror World [12]. Today, these only exist in limited domains at vast expense: transportation systems, telecommunications systems, military operations. Soon, to the degree that the Web continues to evolve toward richer data representation, and proprietary systems gain Web interfaces, XML will mediate the recreation of reality in cyberspace.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.