A Realist's SMIL Manifesto

May 29, 2002

Realist: One who is inclined to physical evidence or pragmatism. -- From the Realist Manifesto (1920), written by constructivist authors and brothers Antoine Pevsner & Naum Gabo

The Synchronized Multimedia Integration Language, SMIL, has a less-than-stellar past but a very interesting future. SMIL 2.0 recaptures the simplicity and practicality of declarative synchronization of media introduced by version 1.0, while adding modularization and content-related features much missed in the early version.

The goal of this two-part series is to illustrate best practices and creative uses of SMIL 2.0; in particular the creation of guided-reading documents which push the boundaries of Web narrative technology by combining classic layout and design practices with television-like effects.

The present article deals with the problem of enhancing video inexpensively and dynamically with SMIL 1.0 and assumes no prior knowledge of SMIL 1.0. It covers the current state of SMIL; the structure and syntax of the language, with examples; and SMIL 1.0's strengths and flaws. It is meant to get you up to speed with the last three years of SMIL, while the next article will show you what is ahead in the coming years, and how SMIL can be a player in improving narrative technology on the Web. (You can download the example files I use in this article, but be warned: they are about 4 mb.)

The State of SMIL

The SMIL project started in 1998 and then, after initial enthusiasm in multimedia circles developing kiosks and similar applications, virtually disappeared from people's attention, in favor of other technologies. With the August, 2001 release of SMIL 2.0, the buzz is starting to return, but SMIL suffers from two main problems: confusion about terminology and the lack of business or artistic orientation in current literature.

Confusion about terminology and versioning

Keeping up with version numbers in commercial multimedia packages is simple; the relevant entities are the "editor" and "player", the versions of which are usually the same, and they are either "beta" or "release". Because of technical and bureaucratic reasons, things with SMIL were not so simple. First of all, SMIL 2.0 is technically not just a language but a collection of reusable modules (animation, layout, synchronization) which can be independently implemented and used in other languages. Second, as a W3C recommendation, the status of SMIL at any point includes less well-known markers like "Candidate Recommendation", "Note", which generally do not improve the clarity of the situation to the intended SMIL public.

In SMIL elements and attributes are grouped into independent bundles called modules; for example, the layout and region elements are in the Layout Module, and the animateColor and animateMotion elements are in the Animation module. SMIL modules can be grouped into a language, called a profile. There are two SMIL profiles, "SMIL 2.0 language profile" and a simplified version, "SMIL 2.0 basic profile", designed for small devices. Both are supersets of the original SMIL 1.0 language.

Modules are designed to be reusable as parts of other XML vocabularies, so vendors or other standards initiatives may decide to implement only parts of SMIL. Examples of this practice include the marriage of XHTML and the SMIL timing module and declarative animation in SVG, implemented by IE6 and Adobe SVG Viewer 2+ respectively. As far as direct SMIL support is concerned, there are a number of SMIL 2.0 players in the making (see side box) but most of the available players still use SMIL 1.0. The examples of the SMIL 2.0 language profile discussed in this article work on SMIL 1.0 players, except where noted.

The other big impediment to popularizing SMIL is the nature of the current literature, which for the most part contains a descriptive overview of each module, its elements and attributes, with occasional examples of a zooming square or a photo slideshow. This documentation pattern doesn't address the communication potential of SMIL or its contribution to the media. It's certainly not going to convince any manager to invest in a SMIL development or a creative developer to learn SMIL. The key to popularizing SMIL is to emphasize its potential to expand the the possibilities of a media-rich Web, rather than its strictly technical superiority.

The Process

Whether using SMIL 1.0 or 2.0, the steps involved in creating a presentation with SMIL 1.0 (hereafter, "SMIL") are invariably the following:

Create an XML document and include the appropriate namespace. The root element is smil, and its children are head and body
In the head element, code the layout of the areas where content can be inserted
In the body element, code the references to the content to be inserted; specify where, when, and for how long each element is shown.

The Problem: Late and Localized Annotations

When you watch even the simplest television show you're watching images composed of several layers of content: the actual video filmed with a camera, the logo of the channel on a corner, annotations (in the case of Figure 1 the name of the band and the song) etc. Some networks add even layers of content, providing extra data about, say, the drivers of a NASCAR race or trivia about the band on a music video.

Figure 1. Images are composed by superimposing layers

The problem with traditional TV is that all the layers get merged before they are shipped to everyone's television, where people get one flat image. Media like Digital HDTV and the Web using SMIL can keep track of the different components of a presentation. Thus, they can avoid merging layers early by deciding at presentation time to hide or to show content, depending on user preferences or other factors and constraints. For example, a DVD can show or hide captions with script notes synchronized with the movie, at the user's will.

Showing and hiding extra layers of content is just the tip of the iceberg; using SMIL you can position and synchronize any media on top of your video without ever having to decide on a final merge of your pieces. Furthermore, you can combine SMIL with dynamic content and customize and localize your layers, opening new opportunities for information, entertainment, and publicity.

The Project: Annotating Boxing Footage

What we want from this project is a solution for visually annotating videos, adding layers of content with data dependent on the locale and preferences of each viewer, without having to alter the video itself. This is a very desirable feature for many media sites, which want to inexpensively add dynamic content to their video for publicity and business purposes.

The steps to create an annotated video include

deciding what video and what kind of annotation data we want;
creating a layout for the annotated video: figuring out which region serves which purpose, the size and position of the regions;
deciding the sequence and duration of events; and
modifying the source of the annotations so that they can be localized and customized.

Each step involves not only technical knowledge about the SMIL language, but effective design ideas, which make the difference between a nice experiment and an effective tool.

The Video

The video we will annotate is a portion of a boxing match between Jake La Motta and Sugar Ray Robinson in 1951. The reason I picked this clip is because it is small, and sports feeds are a realistic example of video that can be served by dynamic annotations.

We want to add three kinds of annotation: opening titles, boxers' statistics, and associated trivia. Figure 2 shows a snapshot of the final result versus the original video.

Figure 2. The naked video vs. the Final Result

Layout

Layout is the process of arranging elements in a space. Effective layout directs the attention of the user, guiding her through the hierarchy of elements. Layout is accomplished in a variety of ways, like providing a sense of depth, creating contrast between elements, or intuitively sequencing elements.

Directing the viewer's attention to different elements involved in a video is a lot easier than in static graphics because elements can pop up and disappear from the screen. However, important style notions are relevant for our example, especially the notions of regularity, recognition, and depth. Table 1 shows the layout regions for our content, the code necessary to implement them, and their rationale.

Layout Areas	Code
	<smil> <head> <layout> <root-layout id="video" width="159" height="20"/> <region id="comment" left="10" top="9" width="34" height="29" z-index="1"/> <region id="stats" left="105" top="14" width="43" height="75" z-index="1"/> <region id="title" left="12" top="99" width="113" height="15" z-index="1"/> <region id="caption" left="29" top="90" width="102" height="20" z-index="2"/> </layout> </head> <body> <!-- Not shown --> </body> </smil>
Rationale
Using a total area not bigger than the video itself promotes the reusability of the annotated video because we don't have to make compromises or assumptions about the background color of the area not covered by the footage.	Rhythm is important in a layout because it helps the user recognize and classify information. In the case of SMIL annotations nothing is easier than achieve regularity by consistently showing related information on the same places. We use totally different areas for Tips and Statistics.	Banking on well-known practices is often convenient. Titles and people's names at the bottom are instantly recognized by users, so are white-on-black captions centered at the bottom, on top of all else.

Table 1. Video Annotation Layout

I've kept the code compatible with SMIL 1.0 because there are very few players for SMIL 2.0 and the ideas introduced here are the same in SMIL 1.0 and 2.0.

Adding and Grouping Elements

The first elements we want to add to our presentation are the opening credits, which are two simple GIF files. What we want, as specified in the timeline of Figure 3, is for each GIF to appear for 3 seconds, one after the other. To achieve this we reference the media using img elements, and we group them in a seq (for sequence) element, as shown in Listing 1.

	<smil> <head> <!-- Layout exactly as in Table 1 --> </head> <body> <seq> <img src="Intro-Names.gif" region="video" dur="3s"/> <img src="Intro-Date.gif" region="video" dur="3s"/> </seq> </body> </smil>
Figure 3. Timeline for credits	Listing 1. Showing the credits in a sequence

As you can see, specifying a sequence in SMIL is very intuitive. Before getting into more sophisticated ways of specifying synchronization, the prior question is what media types you can synchronize. The media elements tags are

img : JPEG or GIF images work on all current players. See the documentation of your player for details. GIF89 transparency is supported in any current player, non-interlaced GIF preferred in RealPlayer.
video: MPEG, AVI, RealVideo and other formats for motion clips must be included using this element. The support for different video formats is specially dependent on the player.
text: Static text. HTML is not supported in any SMIL 1.0 players.
audio: Audio clips including WAV and AU. Also covers streaming audio such as RealVideo
animation: Animation clips. The types supported are especially player-dependent and limited (don't really expect Flash and Mojo support in standalone players).
ref: Any clip not covered by other elements but supported by the player

It is important to realize that the existence of an explicit tag does not mean that every SMIL player supports that media type. The incomplete support for some media types in many players is one of the reasons for the slow adoption of SMIL. For example, you cannot see through the transparent areas of a GIF file or include HTML as a media element in any of the current SMIL 1.0 players and support is only partial in SMIL 2.0 players presently.

RealPlayer and Quicktime include extra elements for including vendor-specific "smart text" for effects like tickers and basic formatting. Unless you have to produce SMIL 1.0 specifically for either platform, you should avoid such extensions for the sake of portability.

Establishing event duration

In our boxing example we used the dur attribute to specify the total duration of each clip. You can also specify the beginning and end of the clip using the begin and end attributes. With elements inside a sequence, the begin attribute specifies a time after the end of the previous element and the end attribute specifies a time after the clip started to play. Figure 4 and Listing 2 summarize the point.

	<smil> <head> <!-- Layout exactly as in Table 2 --> </head> <body> <seq> <img src="Intro-Names.gif" region="video" dur="2s"/> <img src="Intro-Date.gif" region="video" begin="1.0s" end="3.0s"/> </seq> </body> </smil>
Figure 4. Alternative Timeline	Listing 2. Implementing the alternative with begin and end.

Transitions

SMIL 2.0 supports transitions like horizontal and vertical wipes. To preserve SMIL 1.0 compatibility, wipes must be incorporated in the video. Lack of transitions is an example of SMIL 1.0's over-simplification, a problem corrected technically in version 2.0, but with serious image consequences for SMIL in the mind of many creative professionals.

Concurrent Media

Apart from sequences, you can organize your media elements in two other blocks: par and switch. The following section deals with par blocks; switch will be explained when we localize and customize our presentation.

The par element is used to group together media elements that are played concurrently. For media elements inside a par, the begin attribute is relative to the beginning of the whole group. In the code below we synchronize the names of the fighters (appearing on the "title" region) with a voice-over explaining who is who. As you can see in Listing 3, seq and par blocks can be arbitrarily nested.

	<seq> <par> <audio src="jake.wav"/> <img src="title-jake.gif" region="title" dur="4s"/> </par> <par> <audio src="jake.wav"/> <img src="title-sugar.gif" region="title" dur="4s"/> </par> </seq>
Figure 5. Timeline for fighter names audio	Listing 3. Implementing the names in graphics and audio

More About Regions

In the previous example we've been working with conveniently designed graphics that fit nicely into the regions of the layout. Clean fits are not always the case, so it is important to understand the different clipping options we can use on an area. When the size and shape of a visual element doesn't match that of an area, some kind of cropping or resizing mechanism intervene. The particular mechanism used depends on the fit attribute of the region. The options are better explained with some examples; Figure 6 shows how an image will look if we change the fit attribute in the target region.


Fit Value	Result
fill	Resize thee clip so that it fills the region exactly (even if distorted)
hidden	Keep the proportions of the clip and resize until the clip fits completely in the region
meet	Keep original size. Cut content outside the boundaries of the region
slice	Keep the proportions of the clip and resize until the clip fills completely the region. Cut content outside the boundaries

Figure 6. Different Fits in the Same Region

Localizing and Customizing

SMIL provides a declarative syntax for limited localization based on the switch element. A switch element contains media or group elements, each with a test attribute(e.g. system-language). When the interpreter finds the switch element it begins trying, one by one, the test attributes of the enclosed elements; the first element to have a test attribute that matches the environment is chosen. Listing 4 illustrates the point, allowing Russian, Spanish, and English audio in our annotated video.


<par>

  <switch> 

   <audio src="jake-ru.wav"     system-language="ru"/>

   <audio src="jake-fr.wav"     system-language="fr"/>

   <audio src="jake-en.wav"/>

  </switch>

 <img src="title-jake.gif" region="title" dur="4s"/>

</par>

Listing 4. Localization of audio

Since the options are evaluated in the order they appear, it's a good practice to leave a sensible default, in this case the English clip, as the last entry, with no test attribute.

Another useful test attribute is system-bitrate, used to choose different content based on the available bandwith. Listing 5 shows the use of system-bitrate to customize the quality of our boxing video.


<switch>

  <video src="LaMottaRobinson51-High.avi" region="video"

   system-bitrate="150000" />

  <video src="LaMottaRobinson51-Low.avi" region="video"/>

</switch>

Listing 5. Choosing media for different bandwidths

The other test attributes are system-captions, system-overdub-or-caption, system-required, system-screen-size, and system-screen-depth (see details on the spec). However, only system-language and system-bitrate are reliably supported by the existing 1.0 players.

Since only a few test-attributes are provided and no client-side scripting can be used to change the DOM of the SMIL document in 1.0 players, all customizations based on other criteria must be implemented on the server side, forcing the need for server-side programming:

<video src="cgi-bin/getVideoBasedOnUserAgeCookie.pl" region="video"/>

This is exactly what we can do in our example to bring random statistic content into the video every time you watch it. /cgi-bin/getRandomStatistic.pl returns either the fight record, the history of fights between these two boxers, or the temperature conditions in the arena. Since there is no test attribute to do random choices, we don't use switch; rather, the url of the CGI is called:

<img src="cgi-bin/getVideoBasedOnUserAgeCookie.pl" region="stats" />

Note that I've an image to display stats; text would be easier to produce. The reason is text formatting is not supported in SMIL 1.0, and all current players will either display it in a dull system font or force you into using a proprietary text markup format, like RealText. HTML is not supported in either RealPlayer or Quicktime or in any other SMIL 1.0 player. (GRiNS is a limited exception but there are no trial licenses available, and it probably isn't realistic to expect your users to buy GRiNS before downloading your media.) The lack of HTML support is one of the key reasons behind SMIL 1.0 failure as a mainstream technology.

Where Does This Work?

The annotation, localization, and customization possibilities of SMIL are interesting and make a great pitch for the technology in the abstract. But the fact is that SMIL 1.0 has had problems being accepted. The result must be stated clearly: the subset of SMIL which works on current players (that is SMIL 1.0 plus some proprietary extensions) is applicable mostly to static pieces of data in controlled environments, where the chosen player is known in advance and the code can be tuned to accomodate player idiosyncrasies.

But you cannot rely on the behavior of just one player if you plan to put SMIL content on the Web. And be prepared for significant differences between the way your content is presented, even if you don't use vendor extensions. In short, if you can't guarantee a particular SMIL 1.0 player is going to be used by your audience, the realistic advice is to not use SMIL. If you want to use SMIL, try to embed the player rather than merely referencing the file, which is about all you can do to make the file playable under a controlled environment.

Another main problem of SMIL 1.0 is the lack of support for HTML and the unexpected behavior when displaying transparent GIF89. The fact that text formatting is restricted to proprietary extensions limits the technology, especially for common uses like having a tool to replace Powerpoint presentations.

That being said, there are many environments where the player conditions are closed and the shortcomings are acceptable, making SMIL a reasonable alternative:

Sequencing of advertisement and content inside a particular player. RealPlayer developers use SMIL for this purpose often.
Simple prototyping and storyboarding of video content, by elongating the duration of still images. This is an inexpensive and often nice use of SMIL.
Closed environments where the elements of content don't change much, but they need to be reorganized in many ways, easily and inexpensively. Think for example of a kiosk in a large museum with pictures of each room, providing directions to users. The pictures don't change at all but depending on where you are and where you want to go, the system must show you a different sequence of pictures. It is a lot cheaper to create and maintain simple text files with SMIL sequences than to edit each sequence as a long video in Premiere (or some other video tool)

Summary

This article catches up with three years of SMIL, studying the elements in version 1.0 of the language for solving a real and practical problem: inexpensive and flexible annotation of videos. It also examined the real state of SMIL and its mainstream players, as well as recommended how to deal with some problems, given the current support and the language shortcomings.

In the next article we will look ahead and see how the new modularization and content-related additions to the SMIL language make it an interesting new tool to improve narrative technology on the Web.