Groves Explained

April 19, 2000

For those trees will no longer be horrid...
-- H. P. Lovecraft

An Introduction to Groves

Despite valiant efforts from certain quarters of the XML world to promote the idea of groves, they are still perceived in the mainstream as being a complicated, obscure concept. This article tries to clear up some of the mystery surrounding groves by providing a clear exposition of their underlying concepts. The main aim of this article is to show what groves are, though it includes a brief argument as to why we should care about them.

Groves are sets of nodes and properties that represent some logical resource -- say an MP3 song or a plain text file. The possible groves that may result from the parsing of a resource are determined by a set of rules called a property set.

Overview

The process of creating a grove starts with a notation processor. A notation processor is a piece of software that can read (and make sense of) resources in a given source notation (say, MP3). A notation processor reads data in the source notation and creates a grove to represent it.

The type of output a notation processor can offer is determined by a property set. A property set defines the classes and properties that will be used to represent the original data (as a grove).

A notation processor reads the source notation, identifies the classes and properties specified in the property set, and produces a graph of nodes, called a grove. This process is called the grove construction process.

Additionally, we may not want all possible nodes returned. In order to say which nodes we want to include in the output (and which to ignore) we define a grove plan. You can think of a grove plan as a filter for the grove output of a given processor.

Figure 1: The Grove Construction Process

Property Sets

A property set--the set of rules that dictates what can be returned by a notation processor--is defined in a document that conforms to the property set architecture.

The property set architecture is a DTD (well, actually it is an architectural form, but that is not relevant here), which means that property sets are defined using valid SGML documents.

A property set is formed from property set components: classes, properties, enumerated values, and normalization rules (which may be applied to string values). Some property set components are common to all property sets, these are called intrinsic components.

The following code shows a very simplistic, yet valid, property set for information about music CDs. This is a very basic example, designed with the purpose of showing what a property set looks like.


<!DOCTYPE propset "ISO/IEC 10744:1997//DTD Property Set//EN"

         "propset.dtd">

<propset>

  <!-- A sample Property Set for Hypothetical Music Information -->

  <classdef rcsnm="cdinfo" fullnm="Music CD info" >

    <!-- A class named cdinfo, its description and a string property -->

    <desc>A property set for basic music CD information.</desc>

    <propdef rcsnm="artist" datatype="string">

    </propdef>

    <!-- add other properties here... -->

  </classdef>

</propset>

Classes and Properties

A class identifies a type of information (for example. someone's personal data). A class is a named collection of properties (for example, a social security number, a last name, or so on).

Properties have a name (in the example above there is a string property named "artist") and a property number (this an automatically assigned, incremental number). Intrinsic properties, properties common to all property sets, are always implicitly located at the beginning of each class.

As we hinted before, each property has a type. The possible types are:

char
string
boolean
integer
enum
node (a property whose value may be another node)
nodelist (a list of nodes)
nmdndlist (a list of named nodes)
strlist
intlist
compname
cnmlist

Note that property types can be classified in two ways: nodal vs. non-nodal types, and primitive vs. list types. boolean, enum, char, string, integer, compname and node are said to be primitive types (the rest are list types). node, ndlist and nmdndlist are said to be nodal (the rest are non-nodal).

Nodes and Nodal Properties

As we said above, the result of parsing the source data is a set of nodes.

You can think of nodes pretty much as instances of the classes defined in the property set. A node is a collection of property assignments between property names and values. For instance, this means that if your property set included a string property named "artist" for the class "CD", a CD node may have an assigment like Artist="Harold Budd".

In groves jargon, for each property assignment, we say that the node exhibits a value for the given property.

Now suppose we were interested in a property set to represent data about someone's friends. There should be a person class with properties like name (a string), phone number (maybe an integer), and so on. There would also be properties like spouse, whose values are themselves persons, that is, nodes. This is what we call a nodal property (see property types above).

When creating the grove, if a node exhibits a value for a nodal property (if the person is married, in our example), there will be an arc from the exhibiting node to the exhibited node. The name of the arc is the name of the nodal property.

Relationships Between Nodes

The relationship between the exhibiting and exhibited nodes can be one of three kinds: Subnode, IRefnode (internal reference), or URefnode (unrestricted reference).

Subnode arcs go from parent to child nodes, and bind nodes into subnode trees (see graphic below). A subnode can have only one origin (ie. each subnode can have at most one parent).

IRefnode arcs further connect the nodes within a subnode tree. IRefnode arcs make possible the existence of cycles and convergences.

URefnode arcs are unrestricted connections--unlike IRefnode arcs they don't have to connect nodes in the same tree.

Figure 2: The Anatomy of a Grove

A Definition of Groves

After all this, we are finally ready to define a grove: A grove is a set of nodes connected as a subnode tree, and further connected by IRefnodes.

URefnode arcs can connect any two nodes, no matter if they belong to the same subnode tree. Multiple groves with nodes connected by URef arcs form a Hypergrove.

Conclusion: Why Groves?

Hopefully by now you now know what groves are. The next question is: why should we care about groves in the XML world?

I would like to enumerate four reasons:

Implementation Independence: The grove paradigm relies on no specific language, tool or platform. It is truly an implementation-independent way to formally define an interoperable data model.
Formal Data Model for XML: Major problems so far with linking and addressing in the XML world come from the fact that they are defined in terms of the XML syntax and not in that of the underlying structures. Property sets offer a very good chance to create a formal definition of the XML data model.
Standard Addressing and Query Language: This is connected with the previous point. Property sets can be defined for virtually any kind of data, thus we can have groves full of ready-to-be-addressed nodes for everything from 3D models to text files. Having the possibility of accessing many types of information using the same, single mechanism can become more than a long-held wish. (There are excellent articles dealing with this particular problem. Please check out the resources section).
Scalability: The more XML applications appear (whether vertical or horizontal), the more views on XML resources materialize. Using only one view, like that of the DOM, will not be enough to satisfy everyone's needs. A decision is before us: either we choose a scalable representation for an evolving data model, or we gamble in the risky business of never-ending addenda to the DOM.

Resources
	• Addressing the Enterprise: Why the Web needs Groves - Paul Prescod
	• HyTime Annex A.4 annotations/presentation - Fabio Arciniegas A.
	• ISO/IEC 10744:1997 Annex A.4, "Property Set Definition Requirements (PSDR)"
	• Robin Cover's SGML/XML Special Topics