DIDL: Packaging Digital Content
May 30, 2001
Mark Walker, Todd Schwartz, and Vaughn Iverson
In this article we detail the reasons for undertaking the development of a digital packaging standard and describe in depth a package manifest scheme that potentially addresses the enumerated needs. In doing so, we show how such a scheme effectively disassociates the notion of content item from individual files. We conclude by describing an XML vocabulary, the Digital Item Declaration Language (DIDL), a recently released first working draft from ISO/MPEG that will, when completed, provide standard means for packaging digital content.
The Need for a Raw Content Description Standard
Today's popular Internet applications generally fall short in their ability to transfer raw resource content. The content of a web page for example may be defined as the collection of discrete resources -- bitmaps, JPEG images, text blocks, and so on -- that are aggregated within some predetermined format. The components of the web page may possess attributes and relationships that, while not explicitly part of the final, viewable form, may be critical in generating the displayed result. Information accompanying a JPEG image, for example, could be utilized in creating a photo caption. Information about the relationships among a group of images could be utilized in locating the images on the page. If the web page is generated from a script, information on the sizes of the various images could be utilized to decide which images to begin downloading first.
Describing raw content as a structured collection of resources in a standard manner requires: (1) a standard and flexible metadata format; (2) a standard way to aggregate multiple resources of various types; and (3) a standard way to express structural relationships within the resource collection. Associating standard-form metadata with a given file allows semantic descriptions and application-specific behavior to be directly associated with content contained in the file. Currently, ad hoc metadata schemes are employed in several Internet applications. In peer networks for example, long file names are often used as crude substitutes for semantic descriptions of file contents. File headers are also utilized; but header formats are largely designed to document only the technical rather than semantic contents of a particular file. And in spite of the widespread use of headers, digital content in the form of a standalone file currently cannot be delivered to any client or rendering platform without a significant amount of user intervention. Intervention typically takes the form of directing a browser to some web site, selecting some resource URI for download or streaming, and, then, if it's a file, directing the downloadable material to a directory. Rendering or viewing the content in many cases includes being informed by the client system that a required plug-in or player is either not installed or not updated, requiring the user to search the Web for the right rendering engine or viewer.
The greatest limitation of multimedia header and file formatting schemes is that they are inherently incapable of describing multicomponent collections. XHTML, for example, while serving well as an output format for multicomponent content, is not adequate for describing the raw digital components and their relationships. Standard ways of aggregating multiple digital components in an output-agnostic way are required simply because things like web pages and other display types are composed of many items.
Finally, the ability to describe relationships (this goes with that component, this component contains that component, etc.) in a formal way is required to associate things like images with their corresponding descriptive text. It also could be used to describe component structures that would otherwise be difficult to describe with textual metadata.
Case In Point: The Family Album in Cyberspace
Consider the digital family scrapbook. The scrapbook may be composed of digital photos, video, and text documents. The scrapbook designer needs a straightforward way to represent the individual digital components as a single entity, to annotate the components, and to specify the relationships among the components ("this video and these pictures were taken on Bob and Emily's last trip to Florida"). Having a formal annotation scheme would allow other family members to add new annotations without disturbing the original content ("caption this picture"). It would also permit the setting of intermedia anchor points. This would be especially useful for long videos containing sequences of special interest ("here's the part where Bob fell off the boat"). All of the technical information required by the viewing client, like the media format of each component, sizes of the binary elements, and so on, would need to be included as transparently as possible. Since the collection is likely to be viewed by friends and family on all kinds of computing platforms, a user-transparent way to package together multiple format versions of the same content is also critical for minimizing user intervention in obtaining the album ("I need the QuickTime version of this video").
Another scrapbook need that exposes additional packaging requirements is the case of content that requires encryption, identification, or formal rights declarations to be associated with some specific source component. In the scrapbook example, one might want to associate a specific picture or some other component with a formal copyright statement. If one of the pictures was a derivative of some other photo, identifying it as a copy and also identifying the original source would be valuable. Noting what specifically constituted the original content would be critical in order to maintain the original material as inviolate and reconstructable under long-term usage and storage.
Perhaps the strongest motivation for the use of digital packages emerges from the distinction between the scrapbook package manifest and the resources. While it would be occasionally necessary to actually encapsulate small resources (like thumbnail images) in the manifest itself, most resources would be included in the package by reference. In the digital scrapbook, each component would ideally be accompanied not only by a detailed description of its media type but also the URI for obtaining the platform-specific browser/player plug-in capable of rendering the media type. This would be an especially critical feature in the design of a scrapbook for an extended family in which the various digital components of the collection were located in different, fixed archives in geographically far-flung locations. The highly compact nature of the manifest would allow it to be rapidly transmitted and edited without dragging around the whole collection. The content of the scrapbook would thus be defined by the scrapbook package manifest description rather than the collection components themselves.
Metadata associated with each component and component relationship would also allow the viewer to execute searches on the package manifest (perhaps employing regular expressions) for specific components and, thus, to download or view only a subset of the materials referenced by the package ("Retrieve only the pictures of Bob and Emily when they lived in Ohio").
Finally, since a given package manifest would describe only the structural and semantic relationships of the components in the scrapbook collection in a completely output-agnostic way, formatting for renderable output would be relegated to the application software, or to a transformation or stylesheet. This would allow a multitude of differently-formatted scrapbooks to be generated from the same package manifest.
The MPEG-21 Digital Item Declaration Language
ISO MPEG has sought, in its development of the emerging MPEG-21 standard, to develop a multimedia framework that is capable of supporting the delivery and use of all content types by different categories of users in multiple application domains. Earlier this year, in response to the needs articulated above, the Multimedia Description Schemes (MDS) Group within MPEG released the first working draft of an XML vocabulary, the MPEG-2 Digital Item Declaration Language (DIDL). The overall goal for DIDL was to establish a uniform and flexible multimedia data abstraction and interoperabilty schema for declaring digital items. Within the MPEG-21 framework, a Digital Item is defined as a structured digital object with a standard representation, identification, and description. This Digital Item entity is also the fundamental unit of distribution and transaction within this framework. DIDL is based on an abstract model called the Digital Item Declaration Model. The primary concepts within the model appear below. Many of the model elements have directly corresponding DIDL XML elements.
- A resource is an individually identifiable asset such as a video or audio clip, an image, or a textual asset. A resource may also potentially be a physical object. All resources must be locatable via an unambiguous address.
- A component is the binding of a resource to all of its relevant descriptors (see below). These descriptors are bits of information related to all or part of the specific resource instance. Component descriptors will typically contain control or structural information about the resource (such as bit rate, character set, start points or encryption information) but generally not information describing the "content" within.
- A descriptor associates information with the enclosing element. This information may be a component (such as a thumbnail of an image) or a portion of text.
- A statement is a literal textual value that contains information but not an asset. Examples of likely statements include descriptive, control, revision tracking, or identifying information.
- A fragment unambiguously designates a specific point or range within a resource (like a specific frame sequence in a long video).
- An anchor binds descriptors to a fragment, which corresponds to a specific location or range within a resource.
- A predicate is an unambiguously identifiable declaration that can be true, false, or undecided.
- A selection describes a specific decision that will affect one or more conditions somewhere within an item. If the selection is chosen, its predicate becomes true, if it is not chosen its predicate becomes false, and if it is left unresolved, its predicate is undecided.
- A condition describes the enclosing element as being optional and links it to the selection(s) that affect its inclusion.
- A choice describes a set of related selections that can affect the configuration of an item. The selections within a choice are either exclusive (choose exactly one) or inclusive (choose any number, including all or none).
- An item is a grouping of sub-items or components that are bound to relevant descriptors. Items may contain choices, which allow them to be customized or configured. Items themselves may be conditional. If an item contains no sub-items, then it can be called an entity. If it contains sub-items, then it can be called a compilation.
- A container is a potentially hierarchical structure that allows items to be grouped. These groupings of items can be used to form logical packages (for transport or exchange) or logical shelves (for organization).
- An assertion defines a full or partially configured state of a choice by asserting true, false or undecided values for some number of predicates associated with the selections for that choice.
Example: the DIDL packaged family scrapbook
DIDL documents are XML 1.0 documents. A specific goal in the design of the element set was to be as flexible and general as possible, providing a basis for constructing higher-level functionality. This was done to allow it to serve as a key foundation in the building of higher-level elements potentially residing in other XML namespaces. The actual XML elements composing DIDL are
<DIDL> <DECLARATIONS> <CONTAINER> <ITEM> <COMPONENT> <RESOURCE> <DESCRIPTOR> <STATEMENT> <ANCHOR> <CHOICE> <SELECTION> <CONDITION> <OVERRIDE> <REFERENCE> <ANNOTATION> <ASSERTION>
Note that many of the XML elements directly correspond with model elements.
Returning to the scrapbook example, the following DIDL XML fragment illustrates the case of a small photo album referencing pictures of two differing media TYPEs. A descriptor of statement TYPE 'text' is associated with each photo. Note how the ITEM containment structure is used to denote two separate photo albums.
<DIDL> <CONTAINER> <DESCRIPTOR> <STATEMENT TYPE="text/plain">Jones family on-line photo albums</STATEMENT> </DESCRIPTOR> <ITEM> <DESCRIPTOR> <STATEMENT TYPE="text/plain">Album #1: The Kids</STATEMENT> </DESCRIPTOR> <ITEM> <DESCRIPTOR> <STATEMENT TYPE="text/plain"> Johnny Williams' first day at Westside High school. His friends Bruce and Walter are also pictured. </STATEMENT> </DESCRIPTOR> <COMPONENT> <RESOURCE REF="Pjn1.jpg" TYPE="image/jpg" /> </COMPONENT> </ITEM> <ITEM> <DESCRIPTOR> <STATEMENT TYPE="text/plain"> Jane's first day at Jefferson elementary school, accompanied by her Dad, Robert Williams </STATEMENT> </DESCRIPTOR> <COMPONENT> <RESOURCE REF="Pja1.bmp" TYPE="image/bmp" /> </COMPONENT> </ITEM> <ITEM> <DESCRIPTOR> <STATEMENT TYPE="text/plain">Album #2: Bob & Emily</STATEMENT> </DESCRIPTOR> <ITEM> <DESCRIPTOR> <STATEMENT TYPE="text/plain"> Bob catches a big one at Blue Lake. </STATEMENT> </DESCRIPTOR> <COMPONENT> <RESOURCE REF="Bp1.jpg" TYPE="image/jpg" /> </COMPONENT> </ITEM> <ITEM> <DESCRIPTOR> <STATEMENT TYPE="text/plain"> Emily around the campfire at Blue Lake. </STATEMENT> </DESCRIPTOR> <COMPONENT> <RESOURCE REF="Ep1.bmp" TYPE="image/bmp" /> </COMPONENT> </ITEM> </ITEM> </CONTAINER> </DIDL>
The second example fragment illustrates how the DIDL CHOICE and SELECTION elements can be used to package a deliverable resource (in this case, a music file) that is available in more than one media file format.
<DIDL> <ITEM> <CHOICE MIN_SELECTIONS="1" MAX_SELECTIONS="1"> <DESCRIPTOR> <STATEMENT TYPE="text/plain">What format would you prefer?</STATEMENT> </DESCRIPTOR> <SELECTION SELECT_ID="MP3_FORMAT"> <DESCRIPTOR> <STATEMENT TYPE="text/plain">I want MP3</STATEMENT> </DESCRIPTOR> </SELECTION> <SELECTION SELECT_ID="WMA_FORMAT"> <DESCRIPTOR> <STATEMENT TYPE="text/plain">I want WMA</STATEMENT> </DESCRIPTOR> </SELECTION> </CHOICE> <COMPONENT> <CONDITION REQUIRE="MP3_FORMAT"/> <RESOURCE REF="clip.mp3" TYPE="audio/mp3"/> </COMPONENT> <COMPONENT> <CONDITION REQUIRE="WMA_FORMAT"/> <RESOURCE REF="clip.wma" TYPE="audio/wma"/> </COMPONENT> </ITEM> </DIDL>
The third example fragment demonstrates how a DIDL wrapper' can be constructed around a proprietary, XML-based descriptor type. In this case, an external schema is employed in creating a photo captioning scheme. The elements in the separate, external schema contain the "xzs" prefix. The MIME designation of the foreign descriptor STATEMENT types is "text/xml".
<DIDL> <CONTAINER> <ITEM ID="NIAGRA_PHOTO1"> <COMPONENT> <DESCRIPTOR> <STATEMENT TYPE="text/xml"> <xzs:SpecialCaption> Bob and Mary Jones standing at Niagra Falls </xzs:Caption> </STATEMENT> </DESCRIPTOR> <DESCRIPTOR> <STATEMENT TYPE="text/xml"> <xzs:CreatorName="Bill Smith"/> </STATEMENT> </DESCRIPTOR> <DESCRIPTOR> <STATEMENT TYPE="text/xml"> <xzs:Copyright> 1998 Bill Smith Photo Enterprises </xzs:Copyright> </STATEMENT> </DESCRIPTOR> <RESOURCE REF="pict1.jpg" TYPE="image/jpg"/> </COMPONENT> </ITEM> </CONTAINER> </DIDL>
Internet-transacted digital content is a reality, but the lack of standards makes it very difficult for non-technical users to obtain and transmit content. Content that is transacted generally is not interoperable across platforms and is still tightly bound to the directory/file paradigm which greatly limits its flexibility. The MPEG-21 Digital Item Declaration Language addresses these and related problems by providing a relatively simple, standard method for describing complex, multicomponent content source collections.