XML & Object Persistence: Roll Your Own

September 8, 1999

XML Object Persistence: Roll Your Own Generic XML Data Format for Object Serialization

Contents

• Part 1: Using XML for Obj ect Persistence
• Part 2: Serialization Pr oblems
• Part 3: Roll Your Own Ge neric XML Data Format

Now, after the above somewhat theoretical discussions, we want to see, how the concepts can be implemented to benefit one of your next projects. The following code is supposed to demonstrate how object serialization can be done. It is not meant to solve all problems associated with object persistence, nor to be a replacement for any of several supported technologies and efforts. But I hope, you´ll see how rewarding using XML can be.

What Do We Want to Accomplish?

We want to have an easy means of serializing and deserializing objects. The serialization format should be an XML string, which can easily be stored in a file, in an RDBMS table field, sent over a network or passed as a procedure parameter (also called marshaling the object). As application programmers we don't want to be bothered with lots of additional code to accomplish this. So here's our vision of how easy serializing a network of objects should be for a user:

Dim myObject as new MyClass, serializedObject as String ... serializedObject = Serialize(myObject, false) ... Dim myObject2 as MyClass Set myObject2 = Deserialize(serializedObject)

As users of the object we don´t want to care about how complex its internals are or whether it is linked to other objects. We just want to serialize it and be sure all information is preserved. That means

all object properties must be serialized,
all object properties must retain their types, and
any persistent objects referenced by the object must also be serialized. (Non-persistent objects, such as transient objects, need not to be serialized.)

These are reasonable demands, and as we will see, they can be fullfilled by the ObjectStore component we are going to build. ("ObjectStore" seemed a good name for the service, but please don't confuse it with the like named ODBMS product from Object Design, Inc.) We've already seen the two only methods of our ObjectStore object: Serialize and Deserialize.

A note for VB developers: By declaring the ObjectStore VB-class in our sample project as GlobalMultiUse, an instance of it always gets created automatically when running an application referencing it. So we don't need to create an instance ourselves and also do not need an object variable as a qualifier for the methods.

How Does a Serializied Object Look?

The goal is set, we know how we want to use the serialization mechanism, we know what it is supposed to accomplish, but how should the serialized data look? Here's an example of what the ObjectStore implementation´s XML serialization format looks like:

<?xml version="1.0" encoding="ISO-8859-1"?> <ObjectStore version="1.0" xmlns="x-schema:xop-schema.xml"> <Object id="id1234" classname="Project1.Class1"> <string name="stringVar" xml:space="preserve">John Doe</string> <int name="intVar">32</int> <float name="floatVar">3.141592</float> <boolean name="booleanVar">1</boolean> <dateTime name="dateTimeVar">1999-08-16T12:45:17</dateTime> <objectRef name="objectRefVar">id9876</objectRef> <struct name="udtVar"> <string name="stringVar">Henry Doe</string> <int name="intVar">65</int> </struct> <array name="arrayVar" lowerBound="0" upperBound="2"> <int name="0">10</int> <int name="1">9</int> <int name="2">8</int> </array> </Object> <Object id="id9876" classname="Project1.Class1"> ... </Object> </ObjectStore>

Let's have a closer look at the structure of a serialized object.

<ObjectStore> as a container of objects

The document element <ObjectStore> serves as a container for all the objects that get serialized during a call to the Serialize method. As we found out above, nesting objects into one another is not really the way to go, if you want to be able to serialize arbitrary networks of objects. But that's exactly what the ObjectStore is supposed to do: be independent of any particular object hierarchy. It should be able to serialize any ever so intricately intertwined mesh of objects. Therefore it sets up the <ObjectStore> element as a little "universe" of objects which therein exist in parallel, on the same level.

The version attribute gives us a way to recognize possibly older versions of serialized data and treat them differently from more current versions. It's an attribute to be interpreted by the ObjectStore component.

The xmlns attribute attaches an XML schema definition to the data so it can be validated. We'll discuss that further below.

<Object> as a container of an object's data

Each object serialized gets stored in its own element. To allow an object to be referenced by another one, the <Object> element has an id attribute which identifies it uniquely within the set of all objects in the ObjectStore´s object universe. It is a kind of an address for the object, though not in memory.

The classname attribute is necessary to enable the ObjectStore to create a new, empty instance for an object's data during deserialization. Imagine the serialized data being sent over a network to another computer. How should the ObjectStore on that machine determine what kind of object the serialized data belonged to? How could it create a new object of the same kind to be filled with the data? Without some hint as to how to create a new instance of the original object this would be impossible. The classname is that hint. It can be a Java class name or a COM progID. The only requirement is it must be understood by the ObjectStore trying to deserialize an object´s data. More about this below.

Simple type properties

Each object property of a simple type (integer, floating point number, string) is serialized into an element according to its type. Data type names are used as tag names, not the names of the object´s properties. But why would we do that? There are two reasons:

ability to validate ObjectStore XML documents
preservation of property data types.

Again, more about that later.

Instead of

<myproperty type="integer">1234</myproperty>

we write

<int name="myproperty">1234</int>

It's equally easy to understand and follows other initiatives (such as SODL, WDDX). By attaching a name to the element, though, our little XML language is more compact than, for example, WDDX. In WDDX we'd have to write:

<var name="myproperty"> <number>1234</number> </var>

See the complete schema definition (as well as the above XML sample) for a list of all simple data types supported by our ObjectStore.

Data structures

For the purpose of this article only two data structures can be serialized with the ObjectStore component in the sample project: user defined types (UDT, like a C struct) and arrays.

A <struct> element is simply a nested container of properties (like the <Object> element is). It has a name of its own and each of its members is also named. That way addressing properties within a <struct> can be as straightforward as accessing members of a UDT:

Type myStruct name as String age as Integer End Type Dim person as myStruct person.name = "John" person.age = 46

The <struct> element for the UDT would be:

<struct name="myStruct"> <string name="name" xml:space="preserve">John</string> <int name="age">46</name> </struct>

And serializing the person UDT-variable would look like this:

Store "person.name", person.name Store "person.age", person.age

Whereas <struct> elements contain named members (mostly of different types), <array> elements contain unnamed, but indexed members (mostly of the same type). The array

Dim myarray(5 to 6) as Integer myarray(5) = 1234 myarray(6) = 9876

would be serialized into

<array name="myarray" lowerBound="5" upperBound="6"> <int name="5">1234</int> <int name="6">9876</int> </array>

Arrays in our sample implementation can only be one-dimensional. Although, like <struct> elements they can be nested:

<struct...> ... <struct...> ... <struct...> </struct> ... </struct> ... </struct>

<array...> ... <array...> ... <array...> </array> ... </array> ... </array>

The ObjectStore component recognizes the type of each array element and nests the elements if necessary.

In addition to these data structures other, more complex data structures could also be implemented, so as to be automatically serializable. For example ADO 2.5 Recordset objects could be streamed into an XML representation and then inserted to the ObjectStore XML data tree. WDDX, for one, already sports a special <Recordset> element.

Object references

Serializing object references is as easy as serializing simple type properties.

<Object id="id123"...> ... <objectRef name="child">id987</objectRef> </Object>

<Object id="id987"...> ... </Object>

The value of an <objectRef> element is the id of an <Object> element in the <ObjectStore>. This way unidirectional, bidirectional (circular) and self-referential object relationships can be serialized. Each object gets serialized only once. Thus object identity is preserved.

Note: No special XML linking technology is used, because it would not provide any advantage. All object references point to locations within the same XML document and need only be interpreted by a dedicated processor (the ObjectStore component). See below for how it´s done.

How Do We Ensure the Formal Correctness of a Serialized Object?

The task of the ObjectStore component is to ("semantically") interpret the XML DOM generated by the XML processor form the serialization data. To be able to focus on this task, we should offload as much work as possible from it that is "more low level". By using a validating XML parser and a schema definition we can ensure ObjectStore of all XML data it receives will be well-formed and valid XML representing serialized objects.

A DTD as well as an XML schema (see http://msdn.microsoft.com/xml/reference/schema/start.asp for details on the XML schema preview implementation in the MSXML component) define which elements and attributes constitute a particular XML language. XML itself and its DTD, though, do not have a concept of data types. But the schema languages under development do (see http://www.w3.org/XML/Activity.html for an overview). Also the XML schema preview implemented by Micorosoft´s MSXML component lets you define elements and attributes being of a particular type. In addition to that, XML schemas are written in XML, whereas DTDs have a completely different syntax.

The ObjectStore XML language for object serialization - lets call it the XML Object Serialization (XOP) data format - is defined using XML Schema because of its advantages over DTDs. Let´s look at an excerpt:

In the XML Schema preview you define an XML element for your XML language using <ElementType>. Then you state where the element is allowed to appear, that is, with which other elements, by using <element>. The same goes for attributes.

So by definining our elements and putting a couple of constraints on them, such as,

content="eltOnly" model="closed" order="many"

we have clear rules on how a XML string "pretending" to contain serialized objects should look. By including a reference to the XML schema in every XOP string the MSXML parser can check the data before the ObjectStore actually uses it. The reference to the schema has to be included as an attribute in the document element:

<ObjectStore version="1.0" xmlns="x-schema:xop-schema.xml">

By working only with validated XML data the ObjectStore component can make certain assumptions and needs less error handling code. It´s more focused on the task at hand: deserializing objects from an XOP string.

In addition to that, the XML schema preview allows the definition of a type for element content and attribute values. For example:

<ElementType name="int" content="textOnly" dt:type="int">

The data type dt:type="int" limits the text contained in an <int> element to strings representing a 32-bit integer number. The element

<int name="a">123</int>

would be valid according to the above definition, however the element

<int name="b">hello123</int>

would not. Even though it is well-formed XML and conforms to the syntactical rules of XOP (structure of elements, hierarchy of elements) it violates the semantics of the type definition. The element's text cannot be converted to an integer value.

By including type information in the schema we assure the ObjectStore component, that the data contained in the elements representing the object properties is correct according to the properties' data types.

Since element data types have to be declared in the schema, we have to know in advance, which elements to expect in an XOP string. That's the reason why we are using a limited number of tag names (the data type names) instead of an unlimited number (property names).

The Plumbing: Serializing an Object

Now, that we know what we want and how the serialized object data is supposed to look, let's see how we can implement the ObjectStore component.

The sample project consist of a test project (project1.vbp) defining a persistent class (Class1) and the persistence manager (persistencemanager.vbp, a COM in-proc server) with the ObjectStore class.

Setting the stage

In order to use the ObjectStore services a VB-project has to do two things:

include a reference to the persistence manager COM-component, and
implement the IXOP interface (class ixop.cls) in every class to be made serializable.

The IXOP interface is small:

Public Sub LoadProperties(ByVal op As ObjectProperties) Public Sub StoreProperties(ByVal op As ObjectProperties) Public Property Get Classname() As String

Implementing Serialization in a persistent object

To set a class up for serialization, two methods of IXOP have to be implemented: StoreProperties and Classname. Here´s an excerpt from Class1, which has properties of every type supported by the XOP XML schema:

Implements IXOP Public Type UDT stringVar As String intVar As Long End Type Public stringVar As String ... Public objectrefVar As Class1 Public arrayVar As Variant Private udtVar As UDT Private Property Get IXOP_Classname() As String IXOP_Classname = "Project1.Class1" End Property Private Sub IXOP_StoreProperties(ByVal op As PersistenceManager.ObjectProperties) op.Store "stringVar", stringVar ... op.Store "objectRefVar", objectrefVar op.Store "udtVar.stringVar", udtVar.stringVar op.Store "udtVar.intVar", udtVar.intVar op.Store "arrayVar", arrayVar End Sub

IXOP_Classname just returns the progID of the class to be included in the <Object> element. Later we´ll see how it is used to create new objects during deserializatoin.

The bold code parts above show how different type of properties are serialized. Since the object itself knows best, which information to make persistent, we give it the responsibility to pass whatever data items it deems worthy to the persistence manager. For that purpose the ObjectStore hands to the persistent class object an ObjectProperties object. (Those of you familiar with VB6 may recognize the similarity of this approach to using class persistence and the PropertyBag object. However, the XML approach we are following here is more flexible in several ways: It's platform independent, it's language independent, and it does not rely on CLSIDs for deserialization.)

For the purpose of serialization, the ObjectProperties object offers the Store method. It accepts as parameters the name of the property to be serialized and its value. Simple type values, one-dimensional arrays, and object references (to objects also implementing IXOP) can be passed to it. However UDTs have to be passed member by member, since the ObjectProperties object has no way of "looking into" a UDT variable and "see" which members it has.

How serialization works

This is how we wanted to serialize an object:

Dim c as New Class1 c.stringVar = "Hello, World!" debug.print Serialize(c, false)

We call this the Serialize method of the ObjectStore component. But how does it call the IXOP_StoreProperties method of our object and serializes everything into an XOP string? First look at a simplified sample procedure call stack:

ObjectStore.Serialize(myObject) ObjectStore.SerializeObject(myObject) myObject.IXOP_StoreProperties(objectProperties) objectProperties.Store(propertyName, propertyValue) objectProperties.StoreProperty(propertyName, propertyValue) [objectProperties.Store??? | ObjectStore.SerializeObject(propertyValue)]

Serialize calls the IXOP object and passes it an ObjectProperties object, which then gets called for every property from within the IXOP_StoreProperties. This leads to further method calls within the ObjectProperties object and possibly to a recursive call to the ObjectStore. But lets have a closer look at how everything starts. Here's an excerpt from SerializeObject, a method, that gets called from Serialize and does most of the work.

First the object to be serialized is added to the object cache. The cache contains all objects so far serialized and of course we check it first, before we serialize an object. If it's in there already, we don't do anything. This is to preserve object identity: every object gets serialized only once.

m_objectCache.Add rootObject, id
Then we create the <Object> element for the object. It's later appended to the list of objects in <ObjectStore> represented by m_xop:

Dim objNode As MSXML.IXMLDOMNode, a As MSXML.IXMLDOMAttribute Set objNode = m_xop.createNode(NODE_ELEMENT, _ "Object", m_xop.documentElement.namespaceURI)
Note how the XML schema namespace is used during creation of the new element. Even though the XML schema namespace was set up as the default namespace of the <ObjectStore> element in m_xop (see Serialize in the sample code for details), later validation of XOP strings was only successful when the namespace was used whenever creating a new element (this is a glitch in the Microsoft MSXML component).

Then we add the id of the object (generated from the object´s address) to the <Object> element as an attribute as well as the classname:

Set a = m_xop.createAttribute("id") a.nodeValue = "id" & ObjPtr(rootObject) objNode.Attributes.setNamedItem a Set a = m_xop.createAttribute("classname") a.nodeValue = rootObject.Classname objNode.Attributes.setNamedItem a m_xop.documentElement.appendChild objNode

After that we are ready to serialize the properties into the newly created <Object> element by calling IXOP_StoreProperties of the persistent object:

Dim op As ObjectProperties Set op = New ObjectProperties op.Constructor Me, objNode rootObject.StoreProperties op Set op = Nothing

The workhorse method in the ObjectProperties class during serialization is StoreProperty. It gets indirectly called by the object to be serialized from within the IXOP interface. Here´s an excerpt from the function:

Private Sub StoreProperty(ByVal parentNode As MSXML.IXMLDOMNode, _ ByVal propertyName As String, ByVal value As Variant) If InStr(propertyName, ".") = 0 Then Dim p As MSXML.IXMLDOMNode, nodeName As String, _ a As MSXML.IXMLDOMAttribute Select Case TypeName(value) Case "String" nodeName = "string": GoSub StoreSimpleTypeValue Set a = parentNode.ownerDocument.createAttribute("xml:space") a.nodeValue = "preserve" p.Attributes.setNamedItem a Case "Byte", "Integer", "Long" nodeName = "int": GoSub StoreSimpleTypeValue ... Case Else If IsObject(value) Then value = m_os.SerializeObject(value) nodeName = "objectRef": GoSub StoreSimpleTypeValue ... Exit Sub StoreSimpleTypeValue: Set p = parentNode.ownerDocument.createNode(NODE_ELEMENT, nodeName, _ parentNode.namespaceURI) Set a = parentNode.ownerDocument.createAttribute("name") a.nodeValue = propertyName p.Attributes.setNamedItem a p.appendChild parentNode.ownerDocument.createTextNode(value) parentNode.appendChild p Return End Sub

The value passed to the function is handled according to its type (typename(value)). Simple types are simply stored in an element named according to their type (StoreSimpleTypeValue), for example, <int>. And more complex types (arrays, UDTs) are passed to special functions and then stored in <array> or <struct> elements.

Please note, how SerializeObject is called recursively to serialize an object passed as a property value. It returns the id generated for the object which then gets serialized like any string data item. The object itself is stored somewhere else in the XOP XML DOM -- or maybe it had already been serialized so just a reference to it was retrieved.

The Plumbing: Deserializing an Object

Once you understand serialization, deserialization is usually just doing the same thing, but the other way around. Let's follow the process step by step starting with the call of Deserialize.

First we create an XML parser object and load the XOP string, thereby validating it against the XOP XML schema. The XOP data is either passed directly in the xopSource parameter or it is contained in a file, whose filename is in xopSource. The method checks how to load the data by looking at the first character in the xopSource string. If it´s a "<" it assumes xopSource contains the XOP data, otherwise it assumes xopSource is a filename.

Public Function Deserialize(ByVal xopSource As String) As Object Set m_xop = New MSXML.DOMDocument m_xop.async = False If Left$(xopSource, 1) = "<" Then m_xop.loadXML xopSource Else m_xop.Load xopSource End If

If the XOP data is valid and the XML DOM has been built by the MSXML XML parser, we initialize the object cache. The cache will hold references to all deserialized objects. Whenever an object is referenced by an <objectRef> element it is first looked up in the cache. Only if not yet present, the ObjectStore will locate it in the XOP XML DOM and deserialize it.

The deserialization starts with the first <Object> element in the XOP data. The XML schema guarantees there is one by defining a minOccurs="1" for it within <ObjectStore>.

If m_xop.parseError.errorCode = 0 Then Set m_objectCache = New Collection Set Deserialize = DeserializeObject(m_xop.documentElement.selectSingleNode("Object")._ Attributes.getNamedItem("id").nodeValue) Set m_objectCache = Nothing Else ...

Please note how the MSXML method selectSingleNode (this method is not XML DOM spec compliant) is used to get at the first <Object> element. The same could of course have been accomplished using just XML DOM specification conformant methods - it would only have taken a couple of more lines of code.

Creating empty object instances

De-serializing an <Object> element is straightforward:

Friend Function DeserializeObject(ByVal objId As String) As Object Dim obj As IXOP, objNode As MSXML.IXMLDOMNode On Error Resume Next Set obj = m_objectCache(objId) If Err <> 0 Then On Error GoTo 0 Set objNode = m_xop.documentElement.selectSingleNode("Object[@id='" & objId & "']") Set obj = CreateObjectFromClassname(objNode.Attributes.getNamedItem("classname").nodeValue) m_objectCache.Add obj, objId Dim op As ObjectProperties Set op = New ObjectProperties op.Constructor Me, objNode obj.LoadProperties op Set op = Nothing End If Set DeserializeObject = obj End Function

First check the cache if the object has been deserialized already and if so, return the reference to it. If not, locate its <Object> element, create an empty instance and fill the instance by calling its IXOP LoadProperties method with an ObjectProperties object full of data.

The exciting part here is how an empty object instance is created from the classname attribute.

Private Function CreateObjectFromClassname(ByVal Classname As String) As IXOP Dim obj As IXOP RaiseEvent RequestObject(Classname, obj) If obj Is Nothing Then On Error Resume Next Set obj = CreateObject(Classname) End If Set CreateObjectFromClassname = obj End Function

The ObjectStore component uses two approaches. First it tries to ask the application for an instance. It does so by firing the event RequestObject. If the application has registered itself as an event sink for this event, it creates an object matching the class name however it likes:

Private Sub myObjectStore_RequestObject(ByVal classname As String, newObject As PersistenceManager.IXOP) Select Case classname Case "Project1.Class1" Set newObject = New Class1 End Select End Sub

If, on the other hand, the event raised returns with no object assigned to obj, the ObjectStore tries to create an instance using a standard COM technique: calling CreateObject. For this the classname must be a COM progID of a class registered on the machine where the ObjectStore is running.

Interpretation of the classname attribute thus rests on two shoulders: the application and the ObjectStore. It is not dependent on any programming language or platform.

Deserializing the data

The workhorse during deserialization in the ObjectProperties class is LoadProperty. Here's an excerpt:

Private Sub LoadProperty(ByVal parentNode As MSXML.IXMLDOMNode, _ ByVal propertyName As String, _ ByRef propertyValue As Variant) If InStr(propertyName, ".") = 0 Then Dim pNode As MSXML.IXMLDOMNode Set pNode = parentNode.selectSingleNode("*[@name='" & propertyName & "']") Select Case pNode.nodeName ... Case "objectRef" Set propertyValue = m_os.DeserializeObject(pNode.Text) Case Else propertyValue = pNode.nodeTypedValue ... End Select

As long as a property name contains a period (.) it is interpreted as a <struct> name and LoadStructMember is called (not shown here). Otherwise it's a "regular" property whose element is located in the childNodes collection of the parentNode. selectSingleNode is used to find the element.

The select statement determines how it is to be deserialized. For simple type data items the value of the element is assigned to the ByRef-variable used as a return value. Since we are using XML schema data types, nodeTypedValue contains the already properly converted data value, not a string (that could be found in the nodeValue or text properties); no conversion on our side is necessary. The parser has done that already.

Deserializing object references is almost like deserializing simple type values: We just call DeserializeObject recursively and it takes care of everything. The value of the property element is the id of the object we are linking to and it is passed to the method.

Conclusion

As you can see, serializing objects is no black art. It's fairly simple to implement if you are careful -- even with a language like Visual Basic. And using XML as the basis for serialized data is an ideal choice for several reasons:

It's easy to "read" and understand.
It's platform and language independent.
It can store intricate networks of objects.
It can can be validated.
It has a text format that can be passed over a wire or stored in a file.
Import tools (XML parsers) are readily available and take of a lot of the burden off the deserialization process.

So regardless if you are working on a document-centric application, a database program or some other application, don't twice when you need to persist objects. Use XML and some kind of ObjectStore component. Please don´t get me wrong: Of course, the concepts presented herein are not supposed to replace RDBMS or ODBMS. Only real database products can handle large amounts of data and provide sophisticated functionality like transaction processing.

But there are so many cases where you just need to store objects in a file or pass them as parameters to some method running on another machine -- maybe on the other side of the Internet. This is where XML based object persistence shines.

If you have any comments on this article, liked it, disliked it, want to share some thoughts: let me know.