Skip to main content

SSML

SSML (Speech Synthesis Markup Language) is a markup language that enables to control certain features in the audio adding tags to text. It can be used to control many aspects of a speech, such as voice timbre, pronunciation, pace, prosody...

The following sections explain which are the current supported SSML elements by RevoiceIt and how to use them.

Our Text-to-Speech engine currently supports a subset of the most commonly used tags. Specifically, RevoiceIt supports three tags:

  • break: tag used to add pauses and hesitations into the speech between words.
  • phoneme: tag used to control the pronunciation of specific words.
  • say-as: tag used to control the way we want to pronounce certain elements in the text (e.g. dates).
  • emphasis: tag used to emphasize a specific part of the text.

NOTE: Currently the system does not support multiple SSML tags on the same portion of text, that is, if a tag is applied to one or more words, another tag cannot be applied to them.

Style SSML tags

Style SSML tags do not impact the way a word is pronounced, but that directly modify the prosody of a part of the sentence. They follow the format:

<ssml_tag attribute="value"/>

One tag can support one or more attribute=value pairs. The amount of attributes and their possible combinations depend on the specifications of the ssml tag.

break

A break is used to add a pause or hesitation in the speech. It allows to inject a pause for dramatic effect, or emphasize a specific word. The <break> tag accepts two (mutual) attributes:

  • time: attribute used to control the length of the pause in seconds. The float number are divided by a ., i.e. 0.2.

  • strength: attribute used to control the length/intensity of the pause with discrete classes. The possible classes that can be used with this attribute are:

    • none: No pause should be outputted. This can be used to remove a pause that would normally occur.
    • weak: Treat adjacent words as if separated by a single comma.
    • strong: Make a sentence break.
    • x-strong: Make a more emphasized pause, comparable to paragraph break.

An example usage is:

This <break time="0.2s"/> is known as <break time="2s"/> the Shatner pause. We can even add a discrete pause <break strength="strong"/> for a dramatic effect.

Word SSML tags

Word SSML tags affect a specific word of the sentence. Differently from Style SSML tags, their format contains a tag, one or more attributes and the word that needs to be pronounced in the specified manner. They follow the format:

<ssml_tag attribute="value">word</ssml_tag>

Like for the Style SSML tags there can be more than one pair attribute=value.

phoneme

The <phoneme> tag is used to customize the pronunciation of the words inline. Our Text-to-Speech accepts the IPA phonetic alphabet. The <phoneme> tag accepts the following attributes:

  • alphabet: this attribute refers to the dictionary of phonetic symbols. As for now only the ipa value is allowed. The attribute alphabet can be omitted.
  • ph: this attribute refers to the phonetic sequence that will be pronunced. NOTE: ph is mutual with ph_lang.
  • ph_lang: this attribute refers to the language that will be used to phonemize the inline word. Our phonemization system will phonemize in the specified language. It will throw an error if the ph attribute is set as well.
  • pr_lang: this attribute refers to the language that will be used to read the phonemized sentence. For example, a certain word can be phonemized in English, but pronounced in Italian - though giving an Italian accent.
  • transliteration: this attribute allows to fix some specific word pronunciation.

The possible combinations of attributes for the <phoneme> tag are:

- ['alphabet', 'ph_lang', 'pr_lang']
- ['alphabet', 'ph', 'pr_lang']
- ['alphabet', 'ph_lang']
- ['ph_lang', 'pr_lang']
- ['alphabet', 'ph']
- ['ph', 'pr_lang']
- ['ph_lang']
- ['ph']

All of the above combinations are also valid when adding transliteration.

Each application of the <phoneme> tag explicits the pronunciation of one word or expression. E.g.:

Last year I went to <phoneme alphabet="ipa" ph_lang="it" pr_lang="en-us">Roma</phoneme> to spend the summer holidays.

In this example, the word Roma is phonemized with the italian phonemization roma instead of the American English version ɹoʊmə, but it is pronounced with an American English Accent.

The office is the <phoneme ph_lang="en-us" transliteration="one o one">101</phoneme>, at the end of the corridor.

The example above allows to pronounce the number 101 (one hundred one) as one o one.

say-as

The <say as> tag is used to edit the text construct of the inline word. This tag requires one mandatory attribute, interpret-as, which determines how the value is spoken. Other optional attributes can be added depending on the value assigned to the attribute interpret-as.

This current version supports for the attribute interpret-as the following values:

  • date: the inline word will be read as a date. In this scenario is mandatory to set also the attribute format. The latter requires a sequence of date field character codes. Supported field character codes are {y, m, d} which stand for year, month, and day (of the month) respectively. Accepted formats are:

    - mdy
    - dmy
    - ymd
    - md
    - dm
    - ym
    - my
    - d
    - m
    - y

    Below you can find a sentence in which, for example, you want to read a date as day-month

    When is the concert? <say-as interpret-as="date" format="dm">06-05</say-as>.

    By default, the symbol that divide the value of years, month and days is set to -. If you want to use another one, it is sufficient to set the optional attribute delimiter. E.g.

    When is the concert? <say-as interpret-as="date" format="dmy" delimiter="/">06/05/2022</say-as>.

    NOTE: If the target language is en-us, the date will be transformed in six may two thousand twenty-two.