Speech Synthesis Markup Language: An Introduction
by Peter Mikhalenko
|
Pages: 1, 2
The Structure and Examples
The root element of an SSML document
is <speak/>. The <meta/>, <metadata/>
and <lexicon/> elements must occur before all other
elements and text contained within the
root <speak/> element. There are no other ordering
constraints on the elements in the specification. The root element
must have a mandatory xml:lang attribute specifying the
language of the root document. xml:lang attribute can be
used
in <voice/>, speak/>, <p/>
and <s/> elements. Also root element must
have version attribute and must have the value "1.0". The
root element can only contain text to be rendered and the following
elements: <audio/>, <break/>, <emphasis/>, <lexicon/>, <mark/>, <meta/>, <metadata/>, <p/>, <phoneme/>, <prosody/>, <say-as/>, <sub/>, <s/>, <voice/>. This
is how xml:lang can be used:
<?xml version="1.0"?>
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>I don't speak French.</p>
<p xml:lang="fr">Bonjour monsieur!</p>
</speak>
An SSML document may reference one or more external pronunciation lexicon documents. A lexicon document is identified by a URI with an optional media type. No standard lexicon media type has yet been defined as the default for SSML specification. A lexicon document contains pronunciation information for tokens that can appear in a text to be spoken. The pronunciation information contained within a lexicon is used for tokens appearing within the referencing document. Lexicons can be included thusly:
<speak version="1.0"
xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<lexicon uri="http://www.xml.com/lexicon.file"/>
<lexicon uri="http://www.xml.com/slang-words.file"
type="media-type"/>
...
</speak>
You can include metadata for the document using a metadata schema. The recommended metadata format is the XML serialization of RDF.
For logical and physical division purposes <p/>
and <s/> elements exist. The former represents a
paragraph, the latter a sentence. This is example of their usage:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<p>
<s>This is the first sentence of the paragraph.</s>
<s>Here's another sentence.</s>
</p>
</speak>
The <say-as/> element allows the author to
indicate information on the type of text construct contained within
the element and to help specify the level of detail for rendering the
contained text. Defining a comprehensive set of text format types is
difficult because of the variety of languages that have to be
considered and because of the innate flexibility of written
languages. SSML only specifies the say-as element, its attributes, and
their purpose. It does not enumerate the possible values for the
attributes. The <say-as/> element has three
attributes: interpret-as, format,
and detail.
The interpret-as attribute is always required; the
other two attributes are optional. The legal values for the format
attribute depend on the value of the interpret-as
attribute. The interpret-as attribute indicates the
content type of the contained text construct. Specifying the content
type helps the synthesis processor to distinguish and interpret text
constructs that may be rendered in different ways depending on what
type of information is intended. In addition, the optional format
attribute can give further hints on the precise formatting of the
contained text for content types that may have ambiguous
formats. The detail attribute is an optional attribute
that indicates the level of detail to be read aloud or rendered. Every
value of the detail attribute must render all of the informational
content in the contained text; however, specific values for the detail
attribute can be used to render content that is not usually
informational in running text but may be important to render for
specific purposes.
The <phoneme/> element provides a phonemic/phonetic
pronunciation for the contained text. The phoneme element may be
empty. However, it is recommended that the element contain
human-readable text that can be used for non-spoken rendering of the
document. The ph attribute is a required attribute that
specifies the phoneme/phone string. This element is designed strictly
for phonemic and phonetic notations and is intended to be used to
provide pronunciations for words or very short phrases. The
phonemic/phonetic string does not undergo text normalization and is
not treated as a token for lookup in the lexicon. Briefly, phonemic
strings consist of phonemes, language-dependent speech units that
characterize linguistically significant differences in the language;
loosely, phonemes represent all the sounds needed to distinguish one
word from another in a given language. The alphabet
attribute is an optional attribute that specifies the
phonemic/phonetic alphabet. An alphabet in this context refers to a
collection of symbols to represent the sounds of one or more human
languages. The only valid values for this attribute
are "ipa" and vendor-defined strings of the
form "x-organization"
or "x-organization-alphabet". Here is an example of
element usage:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<phoneme alphabet="ipa" ph="təmei̥ɾou̥"> tomato </phoneme>
</speak>
The <sub/> element is employed to indicate that
the text in the alias attribute value replaces the contained text for
pronunciation. This allows a document to contain both a spoken and
written form. The required alias attribute specifies the string to be
spoken instead of the enclosed string. The processor should apply text
normalization to the alias value. An example of such substitution
might look something like the following:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<sub alias="World Wide Web Consortium">W3C</sub>
<!-- World Wide Web Consortium -->
</speak>
The <voice/> element is a production element
that requests a change in speaking
voice. The <emphasis/> element requests that the
contained text be spoken with emphasis (also referred to as prominence
or stress). The <break/> element is an empty
element that controls the pausing or other prosodic boundaries between
words. The use of the break element between any pair of words is
optional. If the element is not present between words, the synthesis
processor is expected to automatically determine a break based on the
linguistic context. The <prosody/> element permits
control of the pitch, speaking rate and volume of the speech
output. It has quite complicated attributes, so it's better to read
the original specification for further investigating. Here's a
typical SSML document:
<speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
xml:lang="en-US">
<voice gender="female" age="20">
<p>
You have an incoming message from
<emphasis>Peter Mikhalenko</emphasis> in your mailbox.
Mail arrived at <sayas class="time">7am</sayas> today.
</p>
</voice>
<voice gender="male" age="30">
<p>
Hi, Steve!
<break/>
Hope you're OK.
</p>
<p>
Sincerely yours, Peter.
</p>
</voice>
</speak>
Implementations
There are several implementations of SSML available, some of them are open source, but others are proprietary, industry implementations. For an open source example, see FreeTTS. Speech technologies and telecommunications industry leaders include France Telecom, Loquendo S.p.A., ScanSoft, Voxpilot. All of them have provided implementation reports to W3C; for more information see W3C's SSML 1.0 Implementation Report.