Useful SSML for Alexa Skills

When you’re looking to personalize your skills and create engaging experiences for the Alexa platform, you really need to use SSML in your skills. The standard text output gets old quick and your carefully crafted responses can be misinterpreted, resulting in output other than what you intended.

There’s a whole world of possibilities outside the regular Alexa responses. In this tutorial, you will find a lot of examples of useful SSML elements for your Alexa Skills.

Audio

Reference: https://developer.amazon.com/en-US/docs/alexa/custom-skills/speech-synthesis-markup-language-ssml-reference.html#audio

Add custom audio to your skill by including an <audio> element to your speech output. Simply set the src attribute to your file’s URL and you should be able to hear your audio in the skill.

Ideally, you should upload your own mp3 files to S3 (and maybe distribute via CDN with CloudFront). Otherwise, there are a few rules that you need to follow in order for this file to be used with Alexa. See the reference link for the most up-to-date information regarding mp3 files requirements.

Here’s an example I used for one of my skills:

<!-- Output: -->
<audio src="https://gtmenterprises.s3-us-west-1.amazonaws.com/projects/eight-values-political-quiz/audio/processed/messages/welcome_01.mp3" />

NOTE: audio files in your skill’s responses need to be properly encoded. See this link for more information: https://developer.amazon.com/en-US/docs/alexa/custom-skills/speech-synthesis-markup-language-ssml-reference.html#h3_converting_mp3

NOTE: as of this article’s publication, you can only include 5 audio elements in a single Alexa response. These audio files cannot exceed 240 seconds combined, so make sure to account for this before recording long audio sessions These rules change as Amazon sees fit, so check the reference link to be sure.

Alexa Skills Kit Sound Library

Reference: https://developer.amazon.com/en-US/docs/alexa/custom-skills/ask-soundlibrary.html

The ASK Sound Library has a large variety of stock sounds that you can use to enhance your skill experience for users. The list is long and vast, so check it out before you decide to handle uploading your own properly encoded audio files to S3.

Here are some examples:

<!-- Output: Sci-Fi Laser shooting sounds -->
<audio src="soundbank://soundlibrary/scifi/amzn_sfx_scifi_laser_gun_fires_04"/>
<!-- Output: long cat screech -->
<audio src="soundbank://soundlibrary/animals/amzn_sfx_cat_angry_screech_1x_01"/>
<!-- Output: small dog bark -->
<audio src="soundbank://soundlibrary/animals/amzn_sfx_dog_small_bark_2x_01"/>

Breaks and Pauses

Reference: https://developer.amazon.com/en-US/docs/alexa/custom-skills/speech-synthesis-markup-language-ssml-reference.html#break

Sometimes, you want to add a pause in your responses to add a more conversational style. Pausing can also be useful to add between audio files, especially if there should be a natural pause between them.

Here are some examples of how you pause:

<!-- a 3000 millisecond pause-->
<break time="3000ms" />
 <!-- a 3 second pause-->
<break time="3s" />

NOTE: There is a strength attribute that you can add to the <break /> element. However, I find its functionality to not work as expected. You’re better off with normal punctuation in your text responses, other SSML tags, and simple <break time=""/> pauses for more granular control.

amazon:domain

Reference: https://developer.amazon.com/en-US/docs/alexa/custom-skills/speech-synthesis-markup-language-ssml-reference.html#amazon-domain

This SSML tag will alter how Alexa reads the speech. You should try out the following example to hear the results, as the responses will differ slightly.

<speak>
    <p>
        <amazon:domain name="news">
        On their fourth album, Marigold, the indie-folk band Pinegrove still possess their signature warmth, but the charm of their heartfelt confessionals has dimmed. 
        </amazon:domain>
    </p>
    <p>
        <amazon:domain name="music">
        On their fourth album, Marigold, the indie-folk band Pinegrove still possess their signature warmth, but the charm of their heartfelt confessionals has dimmed. 
        </amazon:domain>
    </p>
    <p>On their fourth album, Marigold, the indie-folk band Pinegrove still possess their signature warmth, but the charm of their heartfelt confessionals has dimmed. 
    </p> 
</speak>

say-as

Reference: https://developer.amazon.com/en-US/docs/alexa/custom-skills/speech-synthesis-markup-language-ssml-reference.html#amazon-domain

The default Alexa speech response can be surprising when we start introducing things like numbers and dates to our Skill output. As developers, we need to instruct Alexa to interpret these data strings into something a human can recognize.

The <say-as> SSML element gives us a lot of flexibility to customize our voice experiences.

Spelling out a word

Using the attribute interpret-as with the value spell-out (for example,<say-as interpret-as="">spellitout</say-as>), we can tell Alexa to spell out a word.

<speak>
    <s>supercalifragilisticexpialidocious</s>
    <say-as interpret-as="spell-out">supercalifragilisticexpialidocious</say-as>
</speak>

Telephone numbers

<!-- Simple US phone number without area code -->
<say-as interpret-as="telephone">8675309</say-as>
<!-- Simple US phone number with area code -->
<say-as interpret-as="telephone">2128675309</say-as>
<!-- Phone number with extension -->
<!-- Alexa will say the word "Extension" at the "x" character -->
<say-as interpret-as="telephone">2128675309x1234</say-as>

Dates

Dates are an important type of data that we can respond to our user with. Using the <say-as interpret-as="date"> element, we can get a human-understandable date uttered back to our users.

Note: The Amazon Alexa SSML docs left out an important detail when using the format attribute: you must use a separator between calendar units when specifying a date format other than YYYYMMDD. Use either the dash character (“-“) or the forward slash (“/”) .

<!-- You can use "?" to ignore year and/or month -->
<!-- leaving out the format attribute with default to YYYYMMDD
<!-- Output: "First" -->
<say-as interpret-as="date">??????01</say-as>

<!-- Output: "January First" -->
<say-as interpret-as="date">????0101</say-as>
<!-- Output: "January First, Twenty Twenty" -->
<say-as interpret-as="date">20200101</say-as>

<!-- Output: "January First, Twenty-Twenty "-->
<say-as interpret-as="date" format="mdy">01-01-20</say-as>
<say-as interpret-as="date" format="dmy">01-01-20</say-as>
<say-as interpret-as="date" format="ymd">20/01/01</say-as>

<!-- Output: "January First" -->
<say-as interpret-as="date" format="dm">01-01</say-as>
<say-as interpret-as="date" format="md">01/01</say-as>
<!-- Output: "March Twenty-Twenty" -->
<say-as interpret-as="date" format="ym">2020-03</say-as>
<say-as interpret-as="date" format="my">03-2020</say-as>
<!-- Output: "Thirty-First" -->
<say-as interpret-as="date" format="d">31</say-as>
<!-- Output: "March" -->
<say-as interpret-as="date" format="m">03</say-as>
<!-- Output: "Twenty Twenty" -->
<say-as interpret-as="date" format="y">2020</say-as>

First, Second, Third – Ordinal Numbers

If you are designing a scoreboard or leader place, you might find ordinal numbers to be useful. Ordinal numbers are formatted like so:
<say-as interpret-as="ordinal">#</say-as>, where # is a valid integer greater than 0.

<!-- Output: "James is in first place." -->
James is in <say-as interpret-as="ordinal">1</say-as> place.
<!-- Output: "Bill is in second place." -->
Bill is in <say-as interpret-as="ordinal">2</say-as> place.
<!-- Output: "Raj is in third place." -->
Raj is in <say-as interpret-as="ordinal">3</say-as> place.

Expletives, Curse Words, and Bleeping

Make your skill friendlier to sensitive ears by adding in a say-as element. The content surrounded by this word will be bleeped out. The length of the beep is controlled by the length of the word enclosed by the XML tag.

Taylor is a <say-as interpret-as="expletive">fudging</say-as> liar.
Your favorite word is <say-as interpret-as="expletive">supercalifragilisticexpialidocious</say-as>, right?

Units and fractions

This one is absolutely a game changer for skills that deal with recipes, scientific equations, construction projects, and other uses that deal with fractions and units like feet, meter, yard, inches, etc.

<!-- Output: "Three sixteenths" -->
<say-as interpret-as="fraction">3/16</say-as>
<!-- "Three sixteenths of an inch" -->
<say-as interpret-as="unit">3/16in</say-as>
<!-- "Three hundred pounds" -->
<say-as interpret-as="unit">300lb</say-as>

Time

If you find that your skill uses formatted time data (for instance, from a race result), you might find the time value for the interpret-as attribute valuable:

<!-- Output: "One minute and twenty-one seconds"-->
<say-as interpret-as="time">1'21"</say-as>

Digits

If you want your a number uttered digit by digit, use the digits value to get your numeric value read correctly:

<!-- Output: "three point one four one five nine two" etc. -->
<say-as interpret-as="digits">3.14159265358979323846 2643383279502884197169399375105820974944</say-as>

Addresses

Addresses are fairly common, but hard to get to sound right if you’re using standard text. Luckily, the Alexa developers have you covered!

<!-- Output: "Sixteen Hundred Pennsylvania Avenue Northwest, Washington, D, C, two zero five zero zero" -->
<say-as interpret-as="address">1600 Pennsylvania Ave NW, Washington, DC 20500</say-as>

Interjections

Alexa has a number of speechcons that it supports for use with your Skill’s locale. For examples, here’s the English (US) list.

<!-- You can hear the difference when played sequentially -->
<s>Wow</s>
<say-as interpret-as="interjection">Wow</say-as>

NOTE: You really should avoid using interjects with anything other than the listed speechcons unless you want the speech to sound extremely stilted and unnatural.

Voices

Amazon offers a wide variety of voices other than the standard Alexa voice for the en-US locale. Simply wrap your text with the <voice> tag with the name attribute specified.

<!-- en-US locale voices -->
<voice name="Ivy">Hi, my name is Ivy.</voice>
<voice name="Joanna">Hi, my name is Joanna. This is what my voice sounds like.</voice>
<voice name="Joey">Hi, my name is Joey. This is what my voice sounds like.</voice>
<voice name="Justin">Hi, my name is Justin. This is what my voice sounds like.</voice>
<voice name="Kendra">Hi, my name is Kendra. This is what my voice sounds like.</voice>
<voice name="Kimberly">Hi, my name is Kimberly. This is what my voice sounds like..</voice>
<voice name="Matthew">Hi, my name is Matthew. This is what my voice sounds like.</voice>
<voice name="Salli">Hi, my name is Salli. This is what my voice sounds like.</voice>

You can also use voices from other regions, just add an inner tag to wrap your content with the <lang> tag set with the appropriate locale set. See the following examples for more information.

<!-- en-AU locale voices -->
<voice name="Nicole"><lang xml:lang="en-AU">Hi, my name is Nicole. This is what my voice sounds like.</lang></voice>
<voice name="Russell"><lang xml:lang="en-AU">Hi, my name is Russell. This is what my voice sounds like.</lang></voice>

<!-- de-DE locale voices -->
<voice name="Hans"><lang xml:lang="de-DE">Hi, my name is Hans. This is what my voice sounds like.</lang></voice>
<voice name="Marlene"><lang xml:lang="de-DE">Hi, my name is Marlene. This is what my voice sounds like.</lang></voice>
<voice name="Vicki"><lang xml:lang="de-DE">Hi, my name is Vicki. This is what my voice sounds like.</lang></voice>

<!-- en-GB locale voices -->
<voice name="Amy"><lang xml:lang="en-GB">Hi, my name is Amy. This is what my voice sounds like.</lang></voice>
<voice name="Brian"><lang xml:lang="en-GB">Hi, my name is Brian. This is what my voice sounds like.</lang></voice>
<voice name="Emma"><lang xml:lang="en-GB">Hi, my name is Emma. This is what my voice sounds like.</lang></voice>

<!-- en-IN locale voice -->
<voice name="Aditi"><lang xml:lang="en-IN">Hi, my name is Aditi. This is what my voice sounds like.</lang></voice>
<voice name="Raveena"><lang xml:lang="en-IN">Hi, my name is Raveena. This is what my voice sounds like.</lang></voice>

<!-- es-ES locale voices -->
<voice name="Conchita"><lang xml:lang="es-ES">Hi, my name is Conchita. This is what my voice sounds like.</lang></voice>
<voice name="Enrique"><lang xml:lang="es-ES">Hi, my name is Enrique. This is what my voice sounds like.</lang></voice>

<!-- it-IT locale voices -->
<voice name="Carla"><lang xml:lang="it-IT">Hi, my name is Carla. This is what my voice sounds like.</lang></voice>
<voice name="Giorgio"><lang xml:lang="it-IT">Hi, my name is Giorgio. This is what my voice sounds like.</lang></voice>

<!-- ja-JP locale voices -->
<voice name="Mizuki"><lang xml:lang="ja-JP">Hi, my name is Mizuki. This is what my voice sounds like.</lang></voice>
<voice name="Takumi"><lang xml:lang="ja-JP">Hi, my name is Takumi. This is what my voice sounds like.</lang></voice>

<!-- fr-FR locale voices -->
<voice name="Celine"><lang xml:lang="fr-FR">Hi, my name is Celine. This is what my voice sounds like.</lang></voice>
<voice name="Lea"><lang xml:lang="fr-FR">Hi, my name is Lea. This is what my voice sounds like.</lang></voice>
<voice name="Mathieu"><lang xml:lang="fr-FR">Hi, my name is Mathieu. This is what my voice sounds like.</lang></voice>

NOTE: The pronunciation for many of these voices will be hilarious.

Conclusion

I hope you learned a lot about some of the most useful Alexa SSML elements and will integrate them into your Skills. If you find any other useful elements, feel free to leave them in the comments below.

Until next time, cheers!