XML & Object Persistence: Serialization Problems

September 8, 1999

Contents

• Part 1: Using XML for Obj ect Persistence
• Part 2: Serialization Pr oblems
• Part 3: Roll Your Own Ge neric XML Data Format

So far serialization is quite straightforward: set up a new method in your object's class that takes every property worthy to persist and adds it to a string or a byte array, which serves as your serialization data container. But there are a couple of problems you should be aware of. The next few sections will try to address these problems in turn.

How do you separate the data items from one another?

Data items have to be separated in the output "string" (or byte stream) returned from the object serialization process. Otherwise we won't be able to correctly identify them later when we want to read them data back into some new object during de-serialization.

When storing simple two dimensional arrays in a text file (yes, that is also a form of serialization), we usually separate one array dimension from the other by putting them on different lines. Within a dimension the array cells are separated using a semi-colon (;) or tab character.

100;200;300;400
150;230;299;415
99;201;319;399
111;222;333;444

Two different delimiters have to be used: one for each dimension (carriage-return-line-feed and ";" in the above example). That's okay, as long as you know exactly how many dimensions your data is made up of. But be careful: the delimiter characters must not occur within the data items. (But don't be bothered by this too much right now, it won't be a problem when you read on.)

How do you preserve hierarchical structures?

If you don't want to have to reconsider data serialization formats for all your different objects, you need to come up with a general data format suiting all your objects' needs. However, then you cannot safely assume a fixed number of data dimensions, so you cannot define a fixed number of delimiters. The general problem you face here is how to delimit arbitrary nested data, that is, truly hierarchical data structures. (Multi-dimensional arrays, which you might not immediately think of as being some sort of hierarchical data structure, can however be mapped into a hierarchy of nested one-dimensional arrays.)

You delimit nested data items as you would write mathematical expressions: you use parenthesis. The above code sample does exactly that. It uses angle brackets (< and >) to delimit a data item. A data item is either a simple type value, such as a number or a string, or a data structure made up of simple type values or other data structures.

By using two different delimiters to mark the start of a data item and its end, it is easy to nest data items. You just have to be careful to open a data item and not "close" it before all contained data has been serialized. By bracketing data correctly, a container is created which can hold other containers.

This is also one of the rules of well-formed XML. The above sample is just much more simple in that it uses the same delimiters for all data items and all nesting levels. XML, though, adds information by allowing an arbitrary number of delimiters. Here's the above data serialized into an XML string:

<SerializedData> <a>100</a> <pi>3.141592<pi> <msg>Hello, World</msg> <myarray> <item>24232</item> <item>9823.23</item> <item>12.782</item> </myarray> </SerializedData>

How do you deserialize the data?

In a nutshell we've so far followed part of the reasoning behind XML: how to mark up arbitrarily nested data. So why not use XML as the target data format for serialized objects? XML data is a string and it allows us to store hierarchical information. Also XML comes in very handy when you think about deserializing a dehydrated object. Before XML you had to write your own low-level import routine to extract the data items from maybe a tab-delimited file or your own data format. But with XML you can safely rely on the XML parser to do that for you. You lean back and pick the data items from the XML DOM tree produced by the parser. Look at this example working on the above XML data passed into the routine as a string. (It's using the Microsoft MSXML-component of IE5; reading in the myarray-data is left out for brevity.)

Sub Deserialize(byval data as String) Dim xml as new MSXML.DomDocument xml.LoadXML data Dim dataItem as MSXML.IXMLDomNode For Each dataItem in xml.documentElement.childNodes Select Case dataItem.nodeName Case "a" a = dataItem.text Case "pi" pi = dataItem.text ... End Select Next End Sub

XML lets you concentrate on the semantics of your data, such as which data item to store where (for example, the <a> element's text goes into variable a). The syntax (where does a data item start in the stream of bytes of serialized data) is taken care of by the XML parser.

How do you preserve object relationships?

So far we've seen how to serialize simple data types and common data structures like arrays. Now if you look back at Figure 1 you see not just one object, but several ones linked together in a hierarchy. When you want to persist or serialize the root object of this object tree, you certainly mean to also serialize the objects linked to it. But how do you store object references (pointer to objects in memory) or the linked objects themselves? One way would be to treat them like nested data structures:

<Application> <name>My 3D Application</name> ... <Scenes> <Scene> <title>Dark Side of the Moon</title> <LightSources> <LightSource> ... </LightSource> </LightSources> <Objects> <Object type="sphere"> ... </Object> <Object type="cylinder"> ... </Object> </Objects> </Scene> <Scene> ... </Scene> ... </Scenes> <Templates> <Template> ... </Template> </Templates> </Application>

Each object's data is nested within its parent object data. Looks nice, works fine -- unless you must deal with multiple references to one object or circular references (an object references itself either directly or indirectly through other objects it is pointing to). If two objects reference the same third one, it would have to be included within the serialized representation of both referencing objects. That would destroy the object's identity, which in-memory is based not on its content (the data) but on its memory address.

A much better way is to not nest objects at all. Compare the following XML data to the above:

All objects reside on the same level (right below <Objects>). Every object has been assigned a unique ID, thus object references could be serialized into references by ID. This XML data format effectively defines its own address space where each ID is an address and an object is the smallest addressable memory unit.

The object relationships might not be as obvious as when object data was nested, but it is a more general way of serializing arbitrary networks of objects.