Hello, Voice World
September 6, 2000
In our last trip to Didier's Lab, we encountered the aural world of XML made possible by the VoiceXML language. This week I'll explain more about VoiceXML and create the classic "Hello World" application. But this time instead of seeing the result, you'll listen to it. People intrigued by the last article asked me if and how VoiceXML documents are used to build voice applications. Answering this question presents an opportunity to highlight VoiceXML's features, and the way its basic concepts make it very different from HTML or XHTML.
A VoiceXML application is a collection of dialogs. A dialog is the basic interaction unit between the VoiceXML interpreter and an interlocutor. A dialog unit can either be a form or a menu. A form consists of a collection of fields which are filled by the interlocutor. A menu is a choice made by an interlocutor. The figure below shows an example VoiceXML application with the links between the various dialogs shown.
Figure 1: VoiceXML dialog collection
Here is the classic "Hello World" application in VoiceXML:
<?xml version="1.0"?> <!DOCTYPE vxml PUBLIC "-//Tellme Networks//Voice Markup Language 1.0//EN" "http://resources.tellme.com/toolbox/vxml-tellme.dtd"> <vxml version="1.0" base="" lang="en" application=""> <meta name="Author" content="Didier PH Martin"/> <meta name="Document" content="The classical Hello World Sample"/> <form> <block> <audio src="http://talva.dyndns.org/vxml/helloWorld.wav"> Hello world </audio> </block> </form> </vxml>
Since we are dealing with a talking machine, our "Hello World" application has nothing to show for itself: but it definitely has something to say.
The first line should be familiar. It's a DOCTYPE declaration indicating where the document type definition file is located. Normally, if validation is unnecessary, or if external entities are not required, the DOCTYPE declaration can be omitted. But if you're testing this "Hello World" application within the Tellme environment, you'll need to include the Tellme DOCTYPE declaration since its implementation is slightly different than the one recommended by the VoiceXML consortium. The DOCTYPE declaration is mandatory for the Tellme environment but not necessarily mandatory for other VoiceXML interpreters.
The root element (or the document type element),
version, base, language, and application attributes. The most important of these is
application attribute. It represents a major point of difference between XHTML and
applications. In the XHTML world, the contents of the
<html> element are
rendered, in most current browsers, as an independent scrollable page. In the VoiceXML
world, the contents of the
<vxml> element are integrated into a larger
whole: an application session. Session duration is simply the duration of the user's
connection; that is, the time the interlocutor is connected to the VoiceXML interpreter.
session ends when the interlocutor hangs up, or when a VoiceXML document asks the
interpreter to hang up.
A VoiceXML application is a set of documents sharing a common application document.
application attribute in VoiceXML documents indicates to the interpreter its ownership
particular application. Our sample document is part of the Tellme application that defines
such standard behaviors as what to do when the interlocutor says "Tellme menu", or
what to do when the
* key is pressed twice, or when the interlocutor says
"Goodbye". The following diagram shows the relationship between the application and
Figure 2: Hierarchy of VoiceXML Documents
<meta> elements in our VoiceXML document mean basically the same
thing as in HTML: they provide information about this document for use by a classification
engine. We could have included
<rdf> elements for the same purpose, but
<meta> element is accepted as a valid element by the VoiceXML
Moving further into the document, note that even if we do not require fields to be
by the user, we still use the
<form> element to enclose the
<block> element. Thus, the
<form> element allows the
user to input into fields, or it causes the interpreter to say something. My recent
Adapting Content for
VoiceXML, contains a sample VoiceXML form for user input.
<block> contains executable elements. Just think of it as a "block" of
instructions to be processed by the VoiceXML interpreter. Within
<audio> element is specific to the Tellme engine. A fully compliant
VoiceXML document would use the
So if you test the "Hello World" application in the Tellme environment, you must use
<audio> element. But if you are using the IBM VoiceXML environment
(available as a free
download), replace the
<audio> element with the
<prompt> element as recommended by the VoiceXML consortium.
In fact, the
<audio> element is a valid element in the VoiceXML
v1.0 specification document, but it's used to refer to a pre-recorded audio stream.
the rendering of a pre-recorded "Hello World" in the VoiceXML 1.0 specification would
<prompt> <audio src="http://talva.dyndns.org/vxml/helloWorld.wav"/> </prompt>
For the Tellme engine, the same expression would be
<audio src="http://talva.dyndns.org/vxml/helloWorld.wav"> Hello world </audio>
If the Tellme engine doesn't find the audio file, then the data contained in the audio element is converted into voice. If the Tellme engine does find the relevant WAV, it's downloaded, cached, and played.
A pre-recorded voice obviously offers better audio quality than synthesized voice. It's better, then, for any static audio content to refer to a pre-recorded audio file in addition to text, which in this case functions as a fail-safe rendering if something goes wrong with the audio file, as well as for documentation purposes.
Download the alphaWorks voiceXML interpreter, or use the Tellme studio, and test your own version of the "Hello World" application.
IBM VoiceXML interpreter: This tool is freely available from the IBM alphaWorks site.
You can also register with the Tellme studio, which is freely available until October 31 2000, at http://studio.tellme.com.