Hello, Voice World

September 6, 2000

In our last trip to Didier's Lab, we encountered the aural world of XML made possible by the VoiceXML language. This week I'll explain more about VoiceXML and create the classic "Hello World" application. But this time instead of seeing the result, you'll listen to it. People intrigued by the last article asked me if and how VoiceXML documents are used to build voice applications. Answering this question presents an opportunity to highlight VoiceXML's features, and the way its basic concepts make it very different from HTML or XHTML.

A VoiceXML application is a collection of dialogs. A dialog is the basic interaction unit between the VoiceXML interpreter and an interlocutor. A dialog unit can either be a form or a menu. A form consists of a collection of fields which are filled by the interlocutor. A menu is a choice made by an interlocutor. The figure below shows an example VoiceXML application with the links between the various dialogs shown.

Figure 1: VoiceXML dialog collection

Hello World

Here is the classic "Hello World" application in VoiceXML:


<?xml version="1.0"?>

<!DOCTYPE vxml PUBLIC "-//Tellme Networks//Voice Markup Language 1.0//EN" 

"http://resources.tellme.com/toolbox/vxml-tellme.dtd">

<vxml version="1.0" base="" lang="en" application="">

<meta name="Author" content="Didier PH Martin"/>

<meta name="Document" content="The classical Hello World Sample"/>

<form>

<block>

<audio src="http://talva.dyndns.org/vxml/helloWorld.wav">

    Hello world

</audio>

</block>

</form>

</vxml>

Since we are dealing with a talking machine, our "Hello World" application has nothing to show for itself: but it definitely has something to say.

The first line should be familiar. It's a DOCTYPE declaration indicating where the document type definition file is located. Normally, if validation is unnecessary, or if external entities are not required, the DOCTYPE declaration can be omitted. But if you're testing this "Hello World" application within the Tellme environment, you'll need to include the Tellme DOCTYPE declaration since its implementation is slightly different than the one recommended by the VoiceXML consortium. The DOCTYPE declaration is mandatory for the Tellme environment but not necessarily mandatory for other VoiceXML interpreters.

The root element (or the document type element), <vxml>, contains version, base, language, and application attributes. The most important of these is the application attribute. It represents a major point of difference between XHTML and VoiceXML applications. In the XHTML world, the contents of the <html> element are rendered, in most current browsers, as an independent scrollable page. In the VoiceXML world, the contents of the <vxml> element are integrated into a larger whole: an application session. Session duration is simply the duration of the user's connection; that is, the time the interlocutor is connected to the VoiceXML interpreter. A session ends when the interlocutor hangs up, or when a VoiceXML document asks the interpreter to hang up.

A VoiceXML application is a set of documents sharing a common application document. The application attribute in VoiceXML documents indicates to the interpreter its ownership by a particular application. Our sample document is part of the Tellme application that defines such standard behaviors as what to do when the interlocutor says "Tellme menu", or what to do when the * key is pressed twice, or when the interlocutor says "Goodbye". The following diagram shows the relationship between the application and dialog documents.

Figure 2: Hierarchy of VoiceXML Documents

The <meta> elements in our VoiceXML document mean basically the same thing as in HTML: they provide information about this document for use by a classification engine. We could have included <rdf> elements for the same purpose, but only the <meta> element is accepted as a valid element by the VoiceXML DTD.

Moving further into the document, note that even if we do not require fields to be filled by the user, we still use the <form> element to enclose the <block> element. Thus, the <form> element allows the user to input into fields, or it causes the interpreter to say something. My recent article, Adapting Content for VoiceXML, contains a sample VoiceXML form for user input.

A <block> contains executable elements. Just think of it as a "block" of instructions to be processed by the VoiceXML interpreter. Within <block>, the <audio> element is specific to the Tellme engine. A fully compliant VoiceXML document would use the

<prompt>Hello World</prompt>

construct instead.

So if you test the "Hello World" application in the Tellme environment, you must use the <audio> element. But if you are using the IBM VoiceXML environment (available as a free download), replace the <audio> element with the <prompt> element as recommended by the VoiceXML consortium.

In fact, the <audio> element is a valid element in the VoiceXML v1.0 specification document, but it's used to refer to a pre-recorded audio stream. Thus, the rendering of a pre-recorded "Hello World" in the VoiceXML 1.0 specification would look like

<prompt>

<audio src="http://talva.dyndns.org/vxml/helloWorld.wav"/>

</prompt>

For the Tellme engine, the same expression would be


<audio src="http://talva.dyndns.org/vxml/helloWorld.wav">

Hello world

</audio>

If the Tellme engine doesn't find the audio file, then the data contained in the audio element is converted into voice. If the Tellme engine does find the relevant WAV, it's downloaded, cached, and played.

A pre-recorded voice obviously offers better audio quality than synthesized voice. It's better, then, for any static audio content to refer to a pre-recorded audio file in addition to text, which in this case functions as a fail-safe rendering if something goes wrong with the audio file, as well as for documentation purposes.

Homework

Download the alphaWorks voiceXML interpreter, or use the Tellme studio, and test your own version of the "Hello World" application.

Resources

IBM VoiceXML interpreter: This tool is freely available from the IBM alphaWorks site.

You can also register with the Tellme studio, which is freely available until October 31 2000, at http://studio.tellme.com.

The VoiceXML version 1.0 specification is available either from the VoiceXML Consortium or the W3C Consortium.