Speech Synthesis Markup Language: An Introduction
October 20, 2004
Speech Synthesis Markup Language Specification (SSML 1.0), introduced in September 2004, is one of the standards enabling access to the Web using spoken interaction. It's designed to provide a rich, XML-based markup language for assisting the generation of synthetic speech in web and other applications. The essential role of SSML is to provide authors of synthesizable content a standard way to control aspects of speech such as pronunciation, volume, pitch, rate, etc., across different synthesis-capable platforms.
Background
The SSML specification is based upon JSML and JSGF specifications, which are owned by Sun. Originally JSML (JSpeech Markup Language) was developed as a very simple XML format used by applications to annotate text input to speech synthesizers. JSML had characteristics very similar to SSML: it defined elements that described the structure of a document, provided pronunciations of words and phrases, indicated phrasing, emphasis, pitch and speaking rate, and controlled other important speech characteristics. The letter "J" in the markup language name has come from the Java(TM) Speech API, introduced by Sun in collaboration with leading speech technology companies, for incorporating speech technology into user interfaces of applets and applications based on Java technology. The design of JSML elements and its semantics are quite simple. Here is the typical self-explaining example:
<jsml> <voice gender="female" age="20"> <div type="paragraph"> You have an incoming message from <emphasis>Peter Mikhalenko</emphasis> in your mailbox. Mail arrived at <sayas class="time">7am</sayas> today. </div> </voice> <voice gender="male" age="30"> <div type="paragraph"> Hi, Steve! <break/> Hope you're OK. </div> <div> Sincerely yours, Peter. </div> </voice> </jsml>
The JSpeech Grammar Format (JSGF) is a representation of grammars for use in speech recognition. It defines a platform- and vendor-independent way to describe one type of grammar, a rule grammar (also known as a command and control grammar or regular grammar). Grammars are used by speech recognizers to determine what the recognizer should listen for and so describe the utterances a user may say. JSGF is not an XML format and is out of scope of this article.
SSML's Place in the Global Scope
Voice browsers are a very important part of Multimodal Interaction and Device Independence, making web applications accessible with multiple modes of interaction. A voice browser is a device that interprets a markup language and is capable of generating voice output or interpreting voice input, and possibly other input/output modalities. There is a whole set of markup specifications for voice browsers developed at W3C, and SSML is a part of it. Speech synthesis is a process of automatic generation of speech output from data input which may include plain text, marked up text or binary objects. It must be practical to generate speech synthesis output from a wide range of existing document representations. The common requirement to speech synthesis markup is that speech output from HTML, HTML with CSS, XHTML, XML with XSL, and DOM must be possible. The intended use of SSML is to improve the quality of synthesized content.
Language Use
The key concepts of SSML are
- interoperability, or interacting with other markup languages (VoiceXML, SMIL etc.);
- consistency, or providing predictable control of voice output across platforms and across speech synthesis implementations; and
- internationalization, or enabling speech output in a large number of languages within or across documents.
The system of automatic generation of speech output from text or annotated text input that supports SSML must render a document as spoken output using the information contained in the markup to render the document as intended by the author. There are several steps in a speech synthesis process.
- XML parse. The incoming text document is parsed and the document tree with content are extracted.
- Structure analysis. The structure of a document influences the way in which a document should be read. For example, there are common speaking patterns associated with paragraphs and sentences.
- Text normalization. All written languages have special constructs that require a
conversion of the written form (orthographic form) into the spoken form. Text
normalization is an automated process of the synthesis processor that performs this
conversion. For example, for English, when "$1000" appears in a document it may
be spoken as "one thousand dollars." The orthographic form "1/2" may
be potentially spoken as "one half," "January second," "February
first," "one of two," and so on. By the end of this step the text to be
spoken has been converted completely into tokens. The exact details of what constitutes
a
token are language-specific. A special
<say-as/>
element can be used in the input document to explicitly indicate the presence and type of these constructs and to resolve ambiguities. - Text-to-phoneme conversion. After the processor has determined the set of words
to be spoken, it must derive pronunciations for each word. Word pronunciations may
be
conveniently described as sequences of phonemes, which are units of sound in a language
that serve to distinguish one word from another. Each language has a specific phoneme
set.
This step is quite hard and complex according to several reasons. First of all, there
are
differences between written and spoken forms of a language, and these differences
can lead
to indeterminacy or ambiguity in the pronunciation of written words. For example,
in
English, "read" may be spoken as "reed" (I will read the book) or
"red" (I have read the book). Both human speakers and synthesis processors can
pronounce these words correctly in context but may have difficulty without context.
The
<phoneme/>
element of SSML allows a phonemic sequence to be provided for any word or word sequence. - Prosody analysis. Prosody is the set of features of speech output that
includes the pitch (also called intonation or melody), the timing (or rhythm), the
pausing, the speaking rate, the emphasis on words and many other features. Producing
humanlike prosody is important for making speech sound natural and for correctly conveying
the meaning of spoken language. In SMIL there are special elements
<break/>
,<emphasis/>
and<prosody/>
for prosody purposes, which I will describe below. - Waveform production. This is a final step in producing audio waveform output from
the phonemes and prosodic information. There are many approaches to this processing
step
so there may be considerable processor-specific variation. The
<voice/>
element in SSML allows the document creator to request a particular voice or specific voice qualities (e.g. a young male voice).
SSML provides a standard way to specify gross properties of synthetic speech production such as pronunciation, volume, pitch, rate, etc. Exact specification of synthetic speech output behavior across disparate processors, however, is beyond the scope of the SSML specification. It should be noticed that markup values are merely indications rather than absolutes. For example, it is possible for an author to explicitly indicate the duration of a text segment and also indicate an explicit duration for a subset of that text segment. If the two durations result in a text segment that the synthesis processor cannot reasonably render, the processor is permitted to modify the durations as needed to render the text segment.
The Structure and Examples
The root element of an SSML document is <speak/>
. The
<meta/>
, <metadata/>
and
<lexicon/>
elements must occur before all other elements and text
contained within the root <speak/>
element. There are no other ordering
constraints on the elements in the specification. The root element must have a mandatory
xml:lang
attribute specifying the language of the root document.
xml:lang
attribute can be used in <voice/>
,
speak/>
, <p/>
and <s/>
elements.
Also root element must have version
attribute and must have the value
"1.0". The root element can only contain text to be rendered and the following
elements: <audio/>
, <break/>
,
<emphasis/>
, <lexicon/>
, <mark/>
,
<meta/>
, <metadata/>
, <p/>
,
<phoneme/>
, <prosody/>
,
<say-as/>
, <sub/>
, <s/>
,
<voice/>
. This is how xml:lang
can be used:
<?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p>I don't speak French.</p> <p xml:lang="fr">Bonjour monsieur!</p> </speak>
An SSML document may reference one or more external pronunciation lexicon documents. A lexicon document is identified by a URI with an optional media type. No standard lexicon media type has yet been defined as the default for SSML specification. A lexicon document contains pronunciation information for tokens that can appear in a text to be spoken. The pronunciation information contained within a lexicon is used for tokens appearing within the referencing document. Lexicons can be included thusly:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <lexicon uri="http://www.xml.com/lexicon.file"/> <lexicon uri="http://www.xml.com/slang-words.file" type="media-type"/> ... </speak>
You can include metadata for the document using a metadata schema. The recommended metadata format is the XML serialization of RDF.
For logical and physical division purposes <p/>
and
<s/>
elements exist. The former represents a paragraph, the latter a
sentence. This is example of their usage:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>This is the first sentence of the paragraph.</s> <s>Here's another sentence.</s> </p> </speak>
The <say-as/>
element allows the author to indicate information on the
type of text construct contained within the element and to help specify the level
of detail
for rendering the contained text. Defining a comprehensive set of text format types
is
difficult because of the variety of languages that have to be considered and because
of the
innate flexibility of written languages. SSML only specifies the say-as element, its
attributes, and their purpose. It does not enumerate the possible values for the attributes.
The <say-as/>
element has three attributes: interpret-as
,
format
, and detail
.
The interpret-as
attribute is always required; the other two attributes are
optional. The legal values for the format attribute depend on the value of the
interpret-as
attribute. The interpret-as
attribute indicates the
content type of the contained text construct. Specifying the content type helps the
synthesis processor to distinguish and interpret text constructs that may be rendered
in
different ways depending on what type of information is intended. In addition, the
optional
format attribute can give further hints on the precise formatting of the contained
text for
content types that may have ambiguous formats. The detail
attribute is an
optional attribute that indicates the level of detail to be read aloud or rendered.
Every
value of the detail attribute must render all of the informational content in the
contained
text; however, specific values for the detail attribute can be used to render content
that
is not usually informational in running text but may be important to render for specific
purposes.
The <phoneme/>
element provides a phonemic/phonetic pronunciation for
the contained text. The phoneme element may be empty. However, it is recommended that
the
element contain human-readable text that can be used for non-spoken rendering of the
document. The ph
attribute is a required attribute that specifies the
phoneme/phone string. This element is designed strictly for phonemic and phonetic
notations
and is intended to be used to provide pronunciations for words or very short phrases.
The
phonemic/phonetic string does not undergo text normalization and is not treated as
a token
for lookup in the lexicon. Briefly, phonemic strings consist of phonemes, language-dependent
speech units that characterize linguistically significant differences in the language;
loosely, phonemes represent all the sounds needed to distinguish one word from another
in a
given language. The alphabet
attribute is an optional attribute that specifies
the phonemic/phonetic alphabet. An alphabet in this context refers to a collection
of
symbols to represent the sounds of one or more human languages. The only valid values
for
this attribute are "ipa"
and vendor-defined strings of the form
"x-organization"
or
"x-organization-alphabet"
. Here is an example of element usage:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <phoneme alphabet="ipa" ph="təmei̥ɾou̥"> tomato </phoneme> </speak>
The <sub/>
element is employed to indicate that the text in the alias
attribute value replaces the contained text for pronunciation. This allows a document
to
contain both a spoken and written form. The required alias attribute specifies the
string to
be spoken instead of the enclosed string. The processor should apply text normalization
to
the alias value. An example of such substitution might look something like the
following:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <sub alias="World Wide Web Consortium">W3C</sub> <!-- World Wide Web Consortium --> </speak>
The <voice/>
element is a production element that requests a change in
speaking voice. The <emphasis/>
element requests that the contained text
be spoken with emphasis (also referred to as prominence or stress). The
<break/>
element is an empty element that controls the pausing or other
prosodic boundaries between words. The use of the break element between any pair of
words is
optional. If the element is not present between words, the synthesis processor is
expected
to automatically determine a break based on the linguistic context. The
<prosody/>
element permits control of the pitch, speaking rate and
volume of the speech output. It has quite complicated attributes, so it's better to
read the
original specification for further investigating. Here's a typical SSML document:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <voice gender="female" age="20"> <p> You have an incoming message from <emphasis>Peter Mikhalenko</emphasis> in your mailbox. Mail arrived at <sayas class="time">7am</sayas> today. </p> </voice> <voice gender="male" age="30"> <p> Hi, Steve! <break/> Hope you're OK. </p> <p> Sincerely yours, Peter. </p> </voice> </speak>
Implementations
There are several implementations of SSML available, some of them are open source, but others are proprietary, industry implementations. For an open source example, see FreeTTS. Speech technologies and telecommunications industry leaders include France Telecom, Loquendo S.p.A., ScanSoft, Voxpilot. All of them have provided implementation reports to W3C; for more information see W3C's SSML 1.0 Implementation Report.