XML.com: XML From the Inside Out
oreilly.comSafari Bookshelf.Conferences.

advertisement

Speech Synthesis Markup Language: An Introduction
by Peter Mikhalenko | Pages: 1, 2

The Structure and Examples

The root element of an SSML document is <speak/>. The <meta/>, <metadata/> and <lexicon/> elements must occur before all other elements and text contained within the root <speak/> element. There are no other ordering constraints on the elements in the specification. The root element must have a mandatory xml:lang attribute specifying the language of the root document. xml:lang attribute can be used in <voice/>, speak/>, <p/> and <s/> elements. Also root element must have version attribute and must have the value "1.0". The root element can only contain text to be rendered and the following elements: <audio/>, <break/>, <emphasis/>, <lexicon/>, <mark/>, <meta/>, <metadata/>, <p/>, <phoneme/>, <prosody/>, <say-as/>, <sub/>, <s/>, <voice/>. This is how xml:lang can be used:

<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <p>I don't speak French.</p>
  <p xml:lang="fr">Bonjour monsieur!</p>
</speak>

An SSML document may reference one or more external pronunciation lexicon documents. A lexicon document is identified by a URI with an optional media type. No standard lexicon media type has yet been defined as the default for SSML specification. A lexicon document contains pronunciation information for tokens that can appear in a text to be spoken. The pronunciation information contained within a lexicon is used for tokens appearing within the referencing document. Lexicons can be included thusly:

<speak version="1.0"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
       xml:lang="en-US">

  <lexicon uri="http://www.xml.com/lexicon.file"/>
  <lexicon uri="http://www.xml.com/slang-words.file"
           type="media-type"/>
  ...
</speak>

You can include metadata for the document using a metadata schema. The recommended metadata format is the XML serialization of RDF.

For logical and physical division purposes <p/> and <s/> elements exist. The former represents a paragraph, the latter a sentence. This is example of their usage:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <p>
    <s>This is the first sentence of the paragraph.</s>
    <s>Here's another sentence.</s>
  </p>
</speak>

The <say-as/> element allows the author to indicate information on the type of text construct contained within the element and to help specify the level of detail for rendering the contained text. Defining a comprehensive set of text format types is difficult because of the variety of languages that have to be considered and because of the innate flexibility of written languages. SSML only specifies the say-as element, its attributes, and their purpose. It does not enumerate the possible values for the attributes. The <say-as/> element has three attributes: interpret-as, format, and detail.

The interpret-as attribute is always required; the other two attributes are optional. The legal values for the format attribute depend on the value of the interpret-as attribute. The interpret-as attribute indicates the content type of the contained text construct. Specifying the content type helps the synthesis processor to distinguish and interpret text constructs that may be rendered in different ways depending on what type of information is intended. In addition, the optional format attribute can give further hints on the precise formatting of the contained text for content types that may have ambiguous formats. The detail attribute is an optional attribute that indicates the level of detail to be read aloud or rendered. Every value of the detail attribute must render all of the informational content in the contained text; however, specific values for the detail attribute can be used to render content that is not usually informational in running text but may be important to render for specific purposes.

The <phoneme/> element provides a phonemic/phonetic pronunciation for the contained text. The phoneme element may be empty. However, it is recommended that the element contain human-readable text that can be used for non-spoken rendering of the document. The ph attribute is a required attribute that specifies the phoneme/phone string. This element is designed strictly for phonemic and phonetic notations and is intended to be used to provide pronunciations for words or very short phrases. The phonemic/phonetic string does not undergo text normalization and is not treated as a token for lookup in the lexicon. Briefly, phonemic strings consist of phonemes, language-dependent speech units that characterize linguistically significant differences in the language; loosely, phonemes represent all the sounds needed to distinguish one word from another in a given language. The alphabet attribute is an optional attribute that specifies the phonemic/phonetic alphabet. An alphabet in this context refers to a collection of symbols to represent the sounds of one or more human languages. The only valid values for this attribute are "ipa" and vendor-defined strings of the form "x-organization" or "x-organization-alphabet". Here is an example of element usage:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <phoneme alphabet="ipa" ph="t&#x259;mei&#x325;&#x27E;ou&#x325;"> tomato </phoneme>
</speak>

The <sub/> element is employed to indicate that the text in the alias attribute value replaces the contained text for pronunciation. This allows a document to contain both a spoken and written form. The required alias attribute specifies the string to be spoken instead of the enclosed string. The processor should apply text normalization to the alias value. An example of such substitution might look something like the following:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
  <sub alias="World Wide Web Consortium">W3C</sub>
  <!-- World Wide Web Consortium -->
</speak>

The <voice/> element is a production element that requests a change in speaking voice. The <emphasis/> element requests that the contained text be spoken with emphasis (also referred to as prominence or stress). The <break/> element is an empty element that controls the pausing or other prosodic boundaries between words. The use of the break element between any pair of words is optional. If the element is not present between words, the synthesis processor is expected to automatically determine a break based on the linguistic context. The <prosody/> element permits control of the pitch, speaking rate and volume of the speech output. It has quite complicated attributes, so it's better to read the original specification for further investigating. Here's a typical SSML document:

<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
         xml:lang="en-US">
	<voice gender="female" age="20">
		<p>
		You have an incoming message from 
		<emphasis>Peter Mikhalenko</emphasis> in your mailbox.
		Mail arrived at <sayas class="time">7am</sayas> today.
		</p>
	</voice>

	<voice gender="male" age="30">
		<p>
		Hi, Steve!
		<break/>
		Hope you're OK.
		</p>

		<p>
		Sincerely yours, Peter.
		</p>
	</voice>
</speak>

Implementations

There are several implementations of SSML available, some of them are open source, but others are proprietary, industry implementations. For an open source example, see FreeTTS. Speech technologies and telecommunications industry leaders include France Telecom, Loquendo S.p.A., ScanSoft, Voxpilot. All of them have provided implementation reports to W3C; for more information see W3C's SSML 1.0 Implementation Report.