Adapting Content for VoiceXML

August 23, 2000

Didier Martin

This article is part two of Write Once, Publish Everywhere, a project to create an XML-based server for PC, WAP (Wireless Application Protocol), and voice browsers.

Author's note: Between the time of writing and publication of the second half of this article, the W3C's XForms working group changed its mind about the model specification. In their recent working draft they observed that the former draft "has now been obsoleted while the Working Group studies how to combine XForms with XML Schema." Our experiment in this article is based on the previous draft -- which is not based on XML Schemas, but on a model particular to XForms.

Not all devices are able to accept and transform an XML document into something we can perceive with our senses. Most of the time, the target device supports only a rendering interpreter. Therefore we need to transform the abstract model -- as held in our server-side XML document, into a format the rendering interpreter understands. Browsers with more resources at their disposal will probably in the near future perform the transformation on the client side. But devices with less available resources require that any transformation from an XML model to a rendering format occurs on the server side.

To transform the <xdoc> element and its content, we'll use XSLT (Extensible Stylesheet Language Transformations). Dependent on the device profile, we perform one of three different transformations:

  • xdoc into HTML, for PC web browsers. XSLT source: xdoc-html.xsl

  • xdoc into WML (Wireless Markup Language) for WAP devices. XSLT source: xdoc-wml.xsl

  • xdoc into VoiceXML for voice browsers (plain old phones and mobile phones). XSLT source: xdoc-vxml.xsl

In this article, we'll introduce the VoiceXML rendering language. We'll demonstrate how the login.xml document can be heard by a user, and how a form can be filled out verbally through a voice browser or by using the number keys on the wireless device. The voice browser used for this lab experiment is the Tellme voice browser. XML developers are able to test their VoiceXML applications for free (until October 31, 2000) using the Tellme studio and the toll-free number. To create a VoiceXML developer account, go to (The studio even incorporates a debugger.)

As you can see in the illustration below, the VoiceXML architecture bears some resemblance to the WAP architecture. Between the phone and the HTTP server sits a voice server. This voice server interprets the VoiceXML documents, and acts as a middleware processor between the HTTP server and the phone. (However, the VoiceXML interpreter could be included in a rendering device such a car radio that is wirelessly connected to the Web). In this last case, the voice browser is located on the client side and there is no need for an additional voice server. However, most of the time, the VoiceXML architecture will be structured as in the illustration below.

VoiceXML serving architecture
VoiceXML serving architecture

Inside the VoiceXML interpreter resides a voice recognition and synthesis engine used to automate a conversation between a machine and a human being. This can be connected by either a wireline or wireless network.

Later we'll look at the anatomy of a VoiceXML document, but before doing that let's summarize the document processing sequence in this architecture. When a user connects to the XML server, it:

  • recognizes the user agent,
  • creates a device profile,
  • runs the Xinclude processor to modify the document's infoset,
  • selects a style sheet associated with the device profile, and
  • transforms the login.xml document into the appropriate rendering format.

When the device in question is a voice browser, the xdoc-vxml.xsl style sheet transforms the original XML document into a VoiceXML document. Finally, the VoiceXML document is sent to and interpreted by the voice browser.

The conditional inclusion mechanism (explained in last week's article) creates an xdoc document including only one <xform> element. Only the <xpart> elements matching the "vxml" device profile are replaced with external contents. For voice browsers, only one form is included:

<xdoc xmlns:xinclude=""


<xform action=""


            title="log in" 



           <group name="LogIn">

                  <string name="userID"/>

                  <string name="password"/>





The style sheet used to transform this document into VoiceXML is posted online.

Anatomy of a VoiceXML Document

A VoiceXML document contains a single <vxml> element, which is the root element. The basic units of a VoiceXML document are dialogs, specified as <form> elements, and menus, identified by <menu> elements.

Basic structure of a VoiceXML              document
Basic structure of a VoiceXML document

In our example, we'll be using the form to obtain a user ID and a password from the user. Because we are using a phone, we'll ask the user for five digits for the user ID and four digits for the password. Digits are easily typed on phone keys, or simply spoken into a phone.

The XForm <string> elements from the above XForm document -- userID and password -- are transformed into VoiceXML <form> elements. Our document structure is now as shown below:

VoiceXML document structure              with login form
VoiceXML document structure with login form

Each <field> element should contain a <grammar> element. VoiceXML grammar is a set of rules that specify what will be recognized by the voice server. By including a <grammar> in a field element, we limit the scope of the grammar rules to the field's context. For instance, in our experiment, we use the Tellme default library grammar named "Five-digit" for the <field> element associated with the user ID, and we use the "Four-digit" grammar library for the password.

Login VoiceXML document with              grammars
Login VoiceXML document with grammars

Now we have defined acceptable grammar for the form elements, we need to set up the dialog between user and browser. This is achieved with the <prompt> element, which specifies what the voice browser will say. For instance, the first form asks for the user ID:


       <audio>Please dial or say your five-digit user ID</audio>


The <audio> element instructs the voice browser to use the text-to-speech engine to read the element's data content and say it on the phone. After the prompt, the voice browser waits for an answer from the interlocutor at the other end of the phone line. When the person says or dials the user ID, then the element <filled> causes the browser to provide feedback to the interlocutor by saying what the speech recognition engine understood. Then, it branches to the next form by following <goto> element -- this time to ask for the password.


     <audio>I heard you say {document.login.userID}</audio>

     <goto next="#password"/>


The <filled> element is used in the second form to post the user ID and password variables to the server for further processing. (Check out the XSLT stylesheet that generates the VoiceXML document to see where this is used).

In each form, we also added a very rudimentary error-handling mechanism to catch when the user does not dial or say the user ID or password. This is handled by the <noinput> element, which specifies what to do when, after a certain time, a request remains answerless.

So, the whole dialog created by the VoiceXML document can be summarized as follows:

Tellme: Please dial or say your five-digit user ID
Interlocutor: 23465
Tellme: I heard you say 23465
Tellme: Please dial or say your four-digit password
Interlocutor: 3412

The login process is now ended, and the HTTP server receives the user ID and password variables to be processed.

Trying It Out

Use the following link to try this the login process in action: This HTML page will allow you to browse the WAP application without the need for a mobile phone, or seeing the HTML version. Instructions for trying out the VoiceXML version are also on that page.


The best way to achieve device independence is to create an abstract model, then to transform this abstract model into the proper rendering format. It is possible, however, for XSLT style sheets to get too complex by including environment-specific logic. The technique of conditional inclusion is useful for reducing the stylesheet complexity by modifying the abstract model for a particular device profile.


A special thank you is due to to Stephane Larocque, who helped build this experiment.