The Silent Soundtrack

February 2, 2005

John E. Simpson

Like many hearing-impaired movie fans (and more than a few with normal hearing), I've gotten used to seeking out movies whose packaging displays the little National Captioning Institute logo (a registered trademark):

National Captioning Institute Logo

It's taken a while for movie producers and distributors to catch up to the closed-captioning capabilities of the hardware, but they're almost there. Yet in one important area, content is still all too often obscured from my earnest attention: computerized multimedia. From games to Flash and Shockwave animations to Quicktime and Windows Media clips, what's going on on my PC is frequently just flat-out lost on me.

Computers... text... hmmm. You'd think XML might come to the rescue here. And so it does.

Types of Captioning

To begin, closed captioning isn't the only form of sound-to-text assistance available. Some definitions:

  • Subtitles are translations of a film/TV show's dialogue into a language which is understandable by a local audience. They're useful not just to the hearing-impaired--it's not a matter of hearing what's being said, but of understanding it. They also (in my experience) never provide any information about sounds other than speech. If a door creaks open off-camera, the subtitles don't note it.
  • Open captioning transcribes both speech and other sounds into text. This text is actually "embedded" into the image projected onto the big or small screen; you can't show the film (or whatever) without showing the captioning. Unlike subtitles, which always display the text at the same portion of the output device (usually centered on the bottom), open captions can be shifted around to indicate who's speaking. (An open-captioned version of Saving Private Ryan was released, roughly concurrently with the non-captioned one, presumably to accommodate hearing-impaired World War II veterans.)
  • Closed captioning is identical to the open variety, with one difference: it optionally displays the sound-to-text information. For you to see this text, your projection device (movie or TV screen, computer monitor or web browser) must have special circuitry for decoding and displaying the closed-captioning portion of the signal.
  • "Rear-window" captioning, a more recent development, is meant for use in movie theaters. This rather complicated, Rube-Goldberg-style system displays LED captions, backwards, on a wall at the back of the theater; people who want to see the captions use little rear-view-mirror devices at their seats to mirror the captions--which appear to the users to be on-screen. (More information on rear-window captioning is available from the National Center for Accessible Media, or NCAM.)

Of these varieties, closed captioning is most suitable for use with computerized multimedia. Note one problem with closed captions, though: the text isn't actually part of the default visual signal, so there's no "built-in" way to associate a voice or sound with a particular on-screen text display. The creak of the door and the actor's urgently whispered "What's that?!" must be somehow synchronized to the respective captions. It's this synchronization, not necessarily the captions themselves, which these XML-based solutions address.

Note: A great source of information on adding captions to web multimedia is the Web Accessibility in Mind (WebAIM) site.


Synchronized Multimedia Integration Language (SMIL) is a wholly XML-based standard for multimedia presentations on the web. Its success is due largely to the early support of RealNetworks Inc., home of the RealOne Player multimedia software; while not the only SMIL player available, RealOne Player is easily the most widespread. (Even if you've never used SMIL, let alone its captioning feature, you've probably got the RealOne Player client on your computer.) Importantly, SMIL is not in itself a "multimedia format" like Quicktime or Flash. Instead, it provides a standard for bundling references to files in such formats--for packaging multimedia, as it were.

Examples of synching up the captions with the visuals in a SMIL presentation largely follow the guidelines laid down by RealNetworks for use in its own player. Predictably, RealNetworks refers to this as "RealText." The text itself is actually kept in an external file with a filename extension of .rt, identified using the src attribute of the SMIL textstream element. (RealOne Player can also "play" simple .txt files, by the way.) The RealText file includes not only the text itself, but the location, size, and other attributes of the "window" in which the text will be displayed. This window usually occupies a portion of the one used by RealOne Player for the presentation as a whole. Following is a simple example, taken from the RealNetworks SMIL production guide:

<window height="250" width="300"

duration="15" bgcolor="yellow">

Mary had a little lamb,

<br/><time begin="3"/>little lamb,

<br/><time begin="6"/>little lamb,

<br/><time begin="9"/>Mary had a little lamb

<br/><time begin="12"/>whose fleece was white as snow.


Here, the lines in the nursery rhyme--the captions, say--occupy a window 250 by 300 pixels in size, with a background color of yellow. The lines are timed (via the four time elements) to appear every three seconds. (Again, at this point there's no actual SMIL code; this is just the RealText content.)

The text and visuals are synched up in a controlling SMIL file, which points to both the audio/video content and the captioning text. These pointers are wrapped together in a par (for "parallel [content streams]") element. For instance:

<par id="lesson">

  <audio src="soundtrack.rm" syncBehavior="locked" .../>

  <ref src="training.swf" syncBehavior="canSlip" .../>

  <textstream src="translation.rt" syncBehavior="locked" .../>


Here, the audio is provided by a RealMedia file (soundtrack.rm); the video, by a Flash presentation (training.swf); and the text, in a RealText file (translation.rt). The syncBehavior attributes specify whether the content is allowed to "slip" with respect to the timing of the parent container (the par element). This example says, in effect, that it's all right if the Flash animation takes longer to load than the soundtrack and accompanying text, but that the audio and text are bound up in the presentation's overall time stream. If the sound and/or text gets hung up, the presentation as a whole will pause.

Captioning Via Other Standards

SMIL has behind it the force (such as it is) of the W3C. But as a multimedia-over-the-web format, it's nowhere near as prevalent as other, more proprietary "standards," particularly Macromedia's Flash, Apple's Quicktime, and Microsoft's Windows Media.

The catch with these formats is that there's no provision in either for external text streams, accessed in real-time; that is, the captioning must somehow be embedded in (say) the Flash presentation itself. If you change the captions, you've got to rebuild the entire thing. Fortunately, third-party tool developers for both formats have come to the rescue.

Standard disclaimer: As always here in "XML Tourist," don't expect the following discussion to be exhaustive. It's merely representative.

Hi-Caption for Macromedia Flash

This extension to the Flash software is a product of HiSoftware Solutions. It directs Flash to process the contents of an external file which spells out the display of captions in a "captioning control panel." The XML in this external file must conform to the Hi-Caption DTD; the file will contain a header (the hmccheader element, which points to the source file for the visual component of the presentation) and one or more so-called caption sets (captionset elements, reasonably enough).

One interesting feature of the hmccheader wrapper element, defined by a ccstyle child, is a language identifier (similar to XML's built-in xml:lang attribute). This says in effect that a caption displayed in Style X will be in English, while one in Style Y will be in French, or Spanish, or whatever, allowing the controlling Flash viewer to select the locally appropriate language. For example:

<ccStyle ccStyleName="ENUSCC" 

  ccStyleType="caption" ccLang="en-US" 

  ccName="&apos;English Captions&apos; lang: en-US">

  . . .


<ccStyle ccStyleName="escc-blanca"

  ccStyleType="caption" ccLang="es" 


  . . .


(There's no "auto-translate" feature here, by the way, any more than there is using xml:lang. If you want captions in alternative languages, you must provide the translations yourself.)

The ccstyle element also spells out features such as fonts, size, and placement of the caption(s) associated with that style.

In the captionset element(s) you'll find the caption text itself, plus information identifying the speaker and the timing (start point only) of the caption. You might find something like this:


  <cc start="2">


    <caption>Why don't you begin by telling us where you grew up.</caption>


  <cc start="5">


    <caption>It was a small town in southern New Jersey, on the Delaware 




The hmccheader element preceding the caption set establishes the units in which the start attributes' values are expressed. If the units in this example were in seconds, this code fragment says that this text would be displayed two seconds into the presentation:

Interviewer: Why don't you begin by telling us...

This would be followed, three seconds later, by:

Candidate: It was a small town in southern New Jersey...


The Media Access Generator (MAGpie) is a project of NCAM, mentioned above in the discussion of rear-window captioning.

Unlike Hi-Caption, MAGpie is meant for providing captions across a variety of multimedia formats: RealMedia, Quicktime, and Windows Media. (A Flash-compatible extension is currently in beta.) A Java-based package, MAGpie keeps the XML-encoded captioning information--as well as XSLT stylesheets for it--in "project files." (The XSLT is eventually meant to be included with the MAGpie installation; until then, NCAM is soliciting contributions.)

The MAGpie user interface looks like this:

MAGpie user interface

The controls in the interface allow you not just to specify the speaker and caption itself, but to set the characteristics of the caption's display--font, placement, timing, and so on. MAGpie also includes a media viewer enabling you to see how the captions will look in the final multimedia product; that's the purpose of the slider control at the right side of the upper toolbar.

Here's a portion of the project file (many Property elements omitted for simplicity) produced from the above:

<MagpieProject baseMedia="">

  <Property name="lastModified" value="Wed Jan 5 22:32:53 EST 2005"/>

  <Property name="media.toolkit" value="quicktime"/>

  <Property name="videoArea.height" value="240"/>

  <Property name="captionArea.width" value="320"/>

  <CaptionTrack trackName="Track1Captions" trackType="Caption"

    country-code="US" language-code="en">

    <Property name="country-name" value="United States"/>

    <Property name="language-name" value="English"/>

    <Caption speaker="John" 

      startTime="00:00:00.3000" endTime="00:00:01.0000"


      <PlainTextCaption>Caption here!</PlainTextCaption>

      <HtmlStyledCaption><![CDATA[<html><body><font face="Arial">

<font size="2"><font color="#FFFFFF"><font bgcolor="#000000">Caption



      <HtmlStyledSpeaker><![CDATA[<html><body><font face="Arial">

<font size="2"><font color="#FFFFFF"><font bgcolor="#000000">John</font>





As you can see, the heart of the document is the CaptionTrack element and its contents. As does Hi-Caption, MAGpie allows for localization (here, through country-code and language-code attributes. As for the caption's display, it depends on whether the user (and/or the viewing software) prefers plain or styled text; in the latter case, styles are achieved by way of embedded HTML.

Note: As I did, you may have winced when you saw the contents of the HTMLStyled elements: all that HTML (including--shudder-- font elements) embedded within CDATA sections. I recommend that you, too, just close your eyes and roll with it for now and assume the program's authors must have their reasons.

The Bottom Line

Regardless of the multimedia format a website uses, of course, there's ultimately one aspect of closed captioning for which there are no absolute requirements: the human decision to provide captioning in the first place. Some developers may not know of W3C-blessed "standards" like SMIL and the Web Accessibility Initiative (WAI). Some may know of those standards but elect to look the other way (because, naturally, there's no out-and-out law requiring standards compliance). And in some cases, issues like "the artist's vision" may simply make captions an unattractive option. (When you're an artist, I gather, it's difficult to cede a portion of your expensive canvas to mere words.)

I don't know what the answer is. I do know--since multimedia presentations which include spoken words and sound effects are always based on scripts--simply copying-and-pasting the script into a captions file, say, shouldn't be too awfully difficult. (On the other hand, true, synching text to audio does require some time and effort.)

And I also know it would be nice if I (and a gazillion other users of the web) could participate more fully in the online experience. How weird that such an inherently democratic medium remains, in this respect (and in others), so casually exclusive!