Adapting Content for VoiceXML
This article is part two of Write Once, Publish Everywhere, a project to create an XML-based server for PC, WAP (Wireless Application Protocol), and voice browsers.
Author's note: Between the time of writing and publication of the second half of this article, the W3C's XForms working group changed its mind about the model specification. In their recent working draft they observed that the former draft "has now been obsoleted while the Working Group studies how to combine XForms with XML Schema." Our experiment in this article is based on the previous draft -- which is not based on XML Schemas, but on a model particular to XForms.
Not all devices are able to accept and transform an XML document into something we can perceive with our senses. Most of the time, the target device supports only a rendering interpreter. Therefore we need to transform the abstract model -- as held in our server-side XML document, into a format the rendering interpreter understands. Browsers with more resources at their disposal will probably in the near future perform the transformation on the client side. But devices with less available resources require that any transformation from an XML model to a rendering format occurs on the server side.
To transform the <xdoc> element and its
content, we'll use XSLT (Extensible Stylesheet Language Transformations). Dependent on the device
profile, we perform one of three different transformations:
xdoc into HTML, for PC web browsers. XSLT source: xdoc-html.xsl
xdoc into WML (Wireless Markup Language) for WAP devices. XSLT
source: xdoc-wml.xsl
xdoc into VoiceXML for voice browsers (plain old phones
and mobile phones). XSLT source: xdoc-vxml.xsl
In this article, we'll introduce the VoiceXML rendering
language. We'll demonstrate how the login.xml document can
be heard by a user, and how a form can be filled out verbally through a voice browser or by using the number keys on the wireless device. The voice browser used for this lab
experiment is the Tellme voice browser. XML
developers are able to test their VoiceXML applications for free (until October 31, 2000) using the Tellme studio and the toll-free number. To create a VoiceXML
developer account, go to http://studio.tellme.com/. (The
studio even incorporates a debugger.)
As you can see in the illustration below, the VoiceXML architecture bears some resemblance to the WAP architecture. Between the phone and the HTTP server sits a voice server. This voice server interprets the VoiceXML documents, and acts as a middleware processor between the HTTP server and the phone. (However, the VoiceXML interpreter could be included in a rendering device such a car radio that is wirelessly connected to the Web). In this last case, the voice browser is located on the client side and there is no need for an additional voice server. However, most of the time, the VoiceXML architecture will be structured as in the illustration below.

VoiceXML serving architecture
Inside the VoiceXML interpreter resides a voice recognition and synthesis engine used to automate a conversation between a machine and a human being. This can be connected by either a wireline or wireless network.
Later we'll look at the anatomy of a VoiceXML document, but before doing that let's summarize the document processing sequence in this architecture. When a user connects to the XML server, it:
login.xml document into the appropriate rendering
format.
When the device in
question is a voice browser, the xdoc-vxml.xsl style sheet transforms
the original XML document into a VoiceXML document. Finally, the VoiceXML document
is sent to and interpreted by the voice browser.
The conditional inclusion mechanism (explained in
last week's article) creates
an xdoc document including only one <xform> element. Only the
<xpart> elements matching the "vxml" device profile are replaced with
external contents. For voice browsers, only one form is
included:
<xdoc xmlns:xinclude="http://www.w3.org/1999/XML/xinclude"
xmlns:xlink="http://ww.w3.org/TR/xlink">
<xform action="http://www.talva.dyndns.org/home/login.asp"
method="post"
title="log in"
id="logIn_Form">
<model>
<group name="LogIn">
<string name="userID"/>
<string name="password"/>
</group>
</model>
</xform>
</xdoc>
The style sheet used to transform this document into VoiceXML is posted online.
A VoiceXML document contains a single
<vxml> element, which is the root element.
The basic units of a VoiceXML document
are dialogs, specified as <form> elements,
and menus, identified by
<menu> elements.

Basic structure of a
VoiceXML document
In our example, we'll be using the form to obtain a user ID and a password from the user. Because we are using a phone, we'll ask the user for five digits for the user ID and four digits for the password. Digits are easily typed on phone keys, or simply spoken into a phone.
The XForm <string> elements from the
above XForm document -- userID and password -- are transformed into
VoiceXML <form> elements. Our document structure is now
as shown below:

VoiceXML document
structure with login form
Each <field> element should contain a <grammar> element. VoiceXML grammar is a set of rules that specify what will be recognized by the
voice server. By including a <grammar> in a field element, we limit the
scope of the grammar rules to the field's context. For instance, in our
experiment, we use the Tellme default library grammar named "Five-digit" for the
<field> element associated with the user ID, and we use
the "Four-digit" grammar library for the password.

Login VoiceXML document with
grammars
Now we have defined acceptable grammar for the form
elements, we need to set up the dialog between user and
browser.
This is achieved with the <prompt> element, which
specifies what the voice browser will say. For instance, the first
form asks for the user ID:
<prompt>
<audio>Please dial or say your five-digit user ID</audio>
</prompt>
The <audio> element instructs the voice
browser to use the text-to-speech engine to read the element's data content and
say it on the phone. After the prompt, the voice browser waits for an answer
from the interlocutor at the other end of the phone line. When the
person says or dials the user ID, then the element
<filled> causes the browser to provide feedback to
the interlocutor
by saying what the speech recognition engine understood. Then, it
branches to the next form by following <goto>
element -- this time to ask for the password.
<filled>
<audio>I heard you say {document.login.userID}</audio>
<goto next="#password"/>
</filled>
The <filled> element is used in the second
form to post the user ID and password variables to the server for
further processing. (Check out the XSLT
stylesheet that generates the VoiceXML document to see where
this is used).
In each form, we also added a very rudimentary error-handling mechanism to catch when the user does not dial or say the user ID or
password. This is handled by the <noinput> element, which
specifies what to do when, after a certain time, a request remains
answerless.
So, the whole dialog created by the VoiceXML document can be summarized as follows:
The login process is now ended, and the HTTP server receives the user ID and password variables to be processed.
Use the following link to try this the login process in action: http://www.talva.com/demo/login.htm. This HTML page will allow you to browse the WAP application without the need for a mobile phone, or seeing the HTML version. Instructions for trying out the VoiceXML version are also on that page.
The best way to achieve device independence is to create an abstract model, then to transform this abstract model into the proper rendering format. It is possible, however, for XSLT style sheets to get too complex by including environment-specific logic. The technique of conditional inclusion is useful for reducing the stylesheet complexity by modifying the abstract model for a particular device profile.
A special thank you is due to to Stephane Larocque, who helped build this experiment.
XML.com Copyright © 1998-2006 O'Reilly Media, Inc.