Latest updates on Azure Neural TTS: new voices for casual conversations

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

This post is co-authored with Melinda Ma, Yueying Liu, Garfield He and Sheng Zhao

Neural Text to Speech (Neural TTS), a powerful speech synthesis capability of Cognitive Services on Azure, enables you to convert text to lifelike speech which is close to human-parity.

Since its launch, Azure Neural TTS has been widely applied to all kinds of scenarios, from voice assistants to news reading and audiobook creation, etc. and we have seen more customer asks to support natural conversations that are casual and less formal. Today we are glad to announce a few updates on Neural TTS with a focus on the new voices that are optimized for casual conversation scenarios.

Conversational voices: scenarios and challenges

More TTS voices are used to support human-machine conversations, or machine-facilitated interpersonal communications (e.g, human conversations supported with speech-to-speech translation). In these scenarios, a more relaxed and casual speaking style is usually expected. We outline three typical scenarios for conversational voices or conversational styles below.

Customer service bot

Many enterprises are using voice-enabled chatbots or IVR systems to provide more efficient customer services and transform their traditional customer care. For example, Vodafone successfully created a natural-sounding customer service bot, TOBi, and used the AI and natural language processing capabilities in Azure to give TOBi a clear personality that could make conversations natural and fun, which drives better engagement. After a customer gives their name, instead of a dry request like, “Now tell me your address,” TOBi might say, “Hey, that’s a great name. Now I’d like to know where you live.” In such scenarios, the AI voice is usually expected to sound comforting, friendly, warm, while being professional. Besides providing answers to the customer inquiries, the AI voice is also frequently used to give cheerful greetings and show empathy to customers.

Personal assistant

With the emerging of virtual assistant and virtual reality technology, we’ve seen more customers using neural TTS in supporting chit-chats and daily conversations. One challenge in making the AI-human chat more natural is for the bot to understand the chat language that usually contains special characters, modal particles like “hehe”, “haha”, “ouch”, emojis like , repeated letters like “soooo good” and provide instant responses in the tones that are natural. In addition, expressing different emotions with different messages is also a high-demanded ask so the chat bot can better resonate human feelings.

Simultaneous speech translation

Speech-to-speech translation is another typical scenario where a conversational AI voice can be used. With a broad coverage of over 70 languages and variances, Azure Neural TTS has been used to provide speech output for various translations. During translation, it has been challenging, however, to keep the original speaker’s styles when his/her speech is translated to another language. Especially in the casual speech scenarios, the simultaneous speaking tones often provides the subtle nuances of the speech and help the audience build emotional connections with the speaker. In such cases, an AI voice that can support simultaneous speech and capture the casual speaking styles can make the speech-to-speech translation more vivid and engaging.

Next we introduce the latest updates in Azure Neural TTS conversational voices in different languages.

Sara: a new chatbot voice in English (US)

Sara, a new conversational voice in English (US), represents a young female adult that talks more casually and fits best for the chatbot scenarios. On her day 1 release, she is built in with three emotional styles: cheerful, sad and angry. In addition, she is capable of reading emojis and make laughters, sighs, or special angry sounds and express emphasis such as “soooo good”, just like a human being.

Check out how these sound effects are like with below examples.

Text input	With emoji support
That's great. I'm not working right now.
Uhhh, let me ... let me think, I eat hamburger for dinner.

Below is an example of Sara used in a chat bot scenario making natural conversations with a human user. (This sample comes from a chitchat between the bot and the human user, and the language is casual and may contain errors.)

With the new Sara voice, additionally, you can adjust the speaking style using SSML and switch between the neutral, cheerful, sad, and angry tones.

Style	Script	TTS output
Cheerful	I’m so happy to see you.
Sad	She felt disheartened when she was not chosen to be on the school team.
Angry	Jack’s father was fuming with anger when he could not find Jack in his room.
Chat	File this under missed connections cuz i'm lost

Xiaochen and Xiaoyan: new voices in Chinese optimized for spontaneous speech and customer service scenarios

Two new conversational voices are released in Chinese (Mandarin, simplified): Xiaochen, best used for creating spontaneous speech, and Xiaoyan, best used for customer service scenarios.

These two voices are highlighted with below characteristics:

A more relaxed and casual speaking style

Conversational voices are different from voices for reading, broadcasting, or storytelling. In conversations, the voices are usually more relaxed, casual, and the prosody changes often. When people talk casually, the pronunciation of each word may not be complete, the sentence may not be accurate, and the control of the voice does not need to be perfect or professional. The new voices, Xiaochen and Xiaoyan, are produced to resonate this casual speaking style very well.

More natural oral expressions

In spontaneous speech, sentences are often short, and the structure can be simple, or even incomplete. Repetitions, disconnections, supplements, interruptions, disfluency, and redundancy are often observed in spontaneous speech. Both the Xiaochen and Xiaoyan voices deal with the speech expression in these situations well, with our advanced modeling technology. The imperfections in human expression are carefully designed and modeled so the AI voices can learn from these imperfect features, and sound more realistic.

The following is a simulated conversation demo in a customer service scenario. In this sample, Xiaoyan acts as a customer service assistant, and Xiaochen acts as a customer. Hear how relaxed and natural Xiaochen and Xiaoyan are when talking to each other.

Xiaoyan	喂，你好。
Xiaochen	喂，你好，我刚才接到这个电话打来的，然后我想问一下是有什么包裹吗，还是什么东西。
Xiaoyan	哦，您是要查包裹对吗？
Xiaochen	呃对，刚接到这个电话他说我有个包裹，但是我不确定，因为我没有寄东西。
Xiaoyan	嗯，我这里是总机，刚刚可能是分机给您去的电话吧？
Xiaochen	对，然后他叫我打这个电话。
Xiaoyan	嗯，那这样吧，麻烦您提供一下姓名，我帮您查一下。
Xiaochen	晓辰。
Xiaoyan	哪个辰？
Xiaochen	星辰的辰，晓是那个破晓的晓。
Xiaoyan	嗯好的，您稍等一下好吗？我刚才帮您看了一下，确实有一份由晓辰姓名签收的包裹。号码是一二三四五六七八九八七，这是您本人吗？
Xiaochen	是我本人。
Xiaoyan	嗯，因为这个包裹当时是由于地址不详，没有办法准确投递。这样您把这个详细地址跟我讲一下，我马上安排工作人员给您送过去好吗?
Xiaochen	哦，我现在在出差。不过也没关系，我到时候找人帮我签收，然后写我名字就可以了，是吧？
Xiaoyan	嗯，对的。
Xiaochen	寄到鼓楼大街1号吧。那能查到是谁寄的吗？
Xiaoyan	上面没有写的。
Xiaochen	啊那好吧。
Xiaoyan	哦，不过这个包裹显示是从北京寄出的。
Xiaochen	呃您稍等一下哈。诶，是从中关村寄出的吗？
Xiaoyan	嗯，是的。
Xiaochen	啊，那我知道了。就是我可不可以报一个电话号码给你，然后叫派送的工作人员直接跟这个人联系，可以吗？
Xiaoyan	您说的这个人是也是在原来的地址是吧？
Xiaochen	对，你到时候跟她联系的话，就直接送过去，拿给她就行。
Xiaoyan	嗯，好的。
Xiaochen	好，谢谢你呀，那有什么问题我还是可以打这个电话对吗？
Xiaoyan	对的，没问题。
Xiaochen	行，谢谢哈，给您添麻烦了。
Xiaoyan	嗯，不客气。
Xiaochen	好，那再见。
Xiaoyan	麻烦您对我的服务进行评价，再见。

New styles for Nanami in Japanese

Nanami is a popular Japanese voice. Three new styles are now available with Nanami: chat, customer service, and cheerful. These styles can be used to make your voice experience more engaging and enriched in various scenarios.

Voice	Style	Description
ja-JP-NanamiNeural	style="customerservice"	Expresses a friendly and helpful tone for customer support
	style="chat"	Expresses a casual and relaxed tone
	style="cheerful"	Expresses a positive and happy tone

Try the samples below:

Style	Script	TTS output
style="customerservice"	注文番号もありますか？
style="chat"	家賃はとても安いと思います。
style="cheerful"	みなさんお楽しみに！

Updates on other languages

With more customers adopting Azure Neural TTS, we also collected more feedback on the pronunciation accuracy of our voices in different cases. With our latest release, 5 voices have been updated with significant improvements in the accuracy and naturalness. This can bring you the better pronunciation and more natural tone in four languages: id-ID, th-TH, da-DK, and vi-VN.

Hear how the improvement goes with the samples below.

Locale	Voice	Improvement	Sample script	Before	After
id-ID	Ardi	Overall quality	La lahir pada dua April seribu sembilan ratus sembilan puluh di Surakarta, Indonesia.
th-TH	Premwadee	Overall quality	เริ่มจ่ายเงินผ่าน ธ.ก.ส.ถึงมือชาวนาได้ตั้งแต่วันที่ 6 ธ.ค. 62 – 30 ก.ย.63
da-DK	Christel	Overall quality	Sagde du noget til mig?
vi-VN	HoaiMy	Pronunciation with the Southern accent	Năm 1990, Liên Xô tan rã.
vi-VN	NamMinh	Pronunciation with the Southern accent	Năm 1990, Liên Xô tan rã.

Get started

With these updates, we’re excited to be powering natural and intuitive voice experiences for more customers. Text to Speech offers over 170 neural voices across over 70 languages . In addition, the Custom Neural Voice capability enables organizations to create a unique brand voice in multiple languages and styles.

For more information:

Try the demo
See our documentation
Check out our sample code