Hello, Voice World
In our last trip to Didier's Lab, we encountered the aural world of XML made possible by the VoiceXML language. This week I'll explain more about VoiceXML and create the classic "Hello World" application. But this time instead of seeing the result, you'll listen to it. People intrigued by the last article asked me if and how VoiceXML documents are used to build voice applications. Answering this question presents an opportunity to highlight VoiceXML's features, and the way its basic concepts make it very different from HTML or XHTML.
A VoiceXML application is a collection of dialogs. A dialog is the basic interaction unit between the VoiceXML interpreter and an interlocutor. A dialog unit can either be a form or a menu. A form consists of a collection of fields which are filled by the interlocutor. A menu is a choice made by an interlocutor. The figure below shows an example VoiceXML application with the links between the various dialogs shown.

Figure 1: VoiceXML dialog collection
Hello World
Here is the classic "Hello World" application in VoiceXML:
<?xml version="1.0"?>
<!DOCTYPE vxml PUBLIC "-//Tellme Networks//Voice Markup Language 1.0//EN"
"http://resources.tellme.com/toolbox/vxml-tellme.dtd">
<vxml version="1.0" base="" lang="en" application="">
<meta name="Author" content="Didier PH Martin"/>
<meta name="Document" content="The classical Hello World Sample"/>
<form>
<block>
<audio src="http://talva.dyndns.org/vxml/helloWorld.wav">
Hello world
</audio>
</block>
</form>
</vxml>
Since we are dealing with a talking machine, our "Hello World" application has nothing to show for itself: but it definitely has something to say.
The first line should be familiar. It's a DOCTYPE declaration indicating where the document type definition file is located. Normally, if validation is unnecessary, or if external entities are not required, the DOCTYPE declaration can be omitted. But if you're testing this "Hello World" application within the Tellme environment, you'll need to include the Tellme DOCTYPE declaration since its implementation is slightly different than the one recommended by the VoiceXML consortium. The DOCTYPE declaration is mandatory for the Tellme environment but not necessarily mandatory for other VoiceXML interpreters.
The root element (or the document type element),
<vxml>, contains version, base, language, and
application attributes. The most important of these is the application
attribute. It represents a major point of difference between XHTML and
VoiceXML applications. In the XHTML world, the contents of the
<html> element are rendered, in most current
browsers, as an independent scrollable page. In the VoiceXML world,
the contents of the <vxml> element are integrated
into a larger whole: an application session. Session duration is
simply the duration of the user's connection; that is, the time the
interlocutor is connected to the VoiceXML interpreter. A session ends
when the interlocutor hangs up, or when a VoiceXML document asks the
interpreter to hang up.
A VoiceXML application is a set of documents sharing a common
application document. The application attribute in VoiceXML documents
indicates to the interpreter its ownership by a particular
application. Our sample document is part of the Tellme
application that defines such standard behaviors as what to do when
the interlocutor says "Tellme menu", or what to do when the
* key is pressed twice, or when the interlocutor says
"Goodbye". The following diagram shows the relationship between the
application and dialog documents.

Figure 2: Hierarchy of VoiceXML Documents
The <meta> elements in our VoiceXML document
mean basically the same thing as in HTML: they provide information
about this document for use by a classification engine. We could have
included <rdf> elements for the same purpose, but
only the <meta> element is accepted as a valid
element by the VoiceXML DTD.
Moving further into the document, note that even if we do not
require fields to be filled by the user, we still use the
<form> element to enclose the
<block> element. Thus, the
<form> element allows the user to input into
fields, or it causes the interpreter to say something. My recent
article, Adapting
Content for VoiceXML, contains a sample VoiceXML form for user
input.
A <block> contains executable elements. Just
think of it as a "block" of instructions to be processed by the
VoiceXML interpreter. Within <block>, the
<audio> element is specific to the Tellme engine. A
fully compliant VoiceXML document would use the
<prompt>Hello World</prompt>
construct instead.
So if you test the "Hello World" application in the Tellme
environment, you must use the <audio> element. But
if you are using the IBM VoiceXML environment (available as a free
download), replace the <audio> element with the
<prompt> element as recommended by the VoiceXML
consortium.
In fact, the <audio> element is a valid
element in the VoiceXML v1.0 specification document, but it's used to
refer to a pre-recorded audio stream. Thus, the rendering of a
pre-recorded "Hello World" in the VoiceXML 1.0 specification would
look like
<prompt> <audio src="http://talva.dyndns.org/vxml/helloWorld.wav"/> </prompt>
For the Tellme engine, the same expression would be
<audio src="http://talva.dyndns.org/vxml/helloWorld.wav"> Hello world </audio>
If the Tellme engine doesn't find the audio file, then the data contained in the audio element is converted into voice. If the Tellme engine does find the relevant WAV, it's downloaded, cached, and played.
A pre-recorded voice obviously offers better audio quality than synthesized voice. It's better, then, for any static audio content to refer to a pre-recorded audio file in addition to text, which in this case functions as a fail-safe rendering if something goes wrong with the audio file, as well as for documentation purposes.
Homework
Download the alphaWorks voiceXML interpreter, or use the Tellme studio, and test your own version of the "Hello World" application.
Resources
IBM VoiceXML interpreter: This tool is freely available from the IBM alphaWorks site.
You can also register with the Tellme studio, which is freely available until October 31 2000, at http://studio.tellme.com.
The VoiceXML version 1.0 specification is available either from the VoiceXML Consortium or the W3C Consortium.