This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Tech Community.

Cognitive Services adds Brazilian Portuguese to Neural Text to Speech

This post was co-authored by Sheng Zhao, Anny Dow, Edward Un, Yueying Liu, Garfield He and Yang Zheng.

(voiced by Neural TTS)

Neural Text to Speech (Neural TTS) converts text to lifelike speech for more natural interfaces. With natural-sounding speech that matches the stress patterns and intonation of human voices, neural TTS significantly reduces listening fatigue when users are interacting with AI systems, enabling scenarios from audiobooks to voice assistants.

Brazilian Portuguese neural voice now available

We’re excited to share that we are expanding our available neural TTS voices with Francisca, our new Brazilian Portuguese (pt-BR) voice. Francisca features the same human-like natural prosody of the other neural TTS voices on Azure — Guy (American English Male), Jessa (American English Female), Katja (German Female), Elsa (Italian Female), and Xiaoxiao (Mandarin Chinese Female).

With a powerful base model created using a large volume of speech samples, we were able to build Francisca’s voice from much less training data than it would require otherwise. The neural TTS base model learns different speaking styles from multiple speakers, and through transfer learning, can easily adapt its style to a target speaker. Like other neural voices, Francisca can generate realistic speech waveforms for a given text input, matching the patterns of stress and intonation transitions in spoken language seamlessly.

Besides the capability to synthesize speech, developers can also tailor the voice for different scenarios with different voice styles using the neural TTS. For example, the new pt-BR voice can also speak with a “cheerful” tone. The “cheerful” style can be used to express an emotion that is positive and happy. This is particularly useful in chat bot scenarios. You can adjust the speaking styles easily with the <mstts:express-as> element in SSML.

We conducted MOS (Mean Opinion Score) studies to evaluate the naturalness of Francisca. In a crowd-sourcing test with more than 60 native speakers, we examined 30 audios produced by Francisca in the neutral style and another 30 in the cheerful style. Overall impressions were rated on a 1-5 Likert scale, with naturalness in rhythm variations, pitch variations, stresses, pauses, and intelligibility considered. Human speech and a pt-BR voice from another cloud service provider (company X) were used as benchmarks. Results showed very positive feedback on Francisca in both the neutral (4.44) and cheerful (4.38) styles.

Figure 1. MOS comparison of Francisca with human speech and company X

Hear what Francisca sounds like.

Example 1: Francisca (neutral)

Example 2: Francisca (cheerful)

High fidelity and controllable output

Like other neural voices, Francisca is created using 24khz sampling rate. You can maximize the fidelity of neural voice outputs with 24khz related formats:

raw-24khz-16bit-mono-pcm
riff-24khz-16bit-mono-pcm
audio-24khz-160kbitrate-mono-mp3
audio-24khz-96kbitrate-mono-mp3
audio-24khz-48kbitrate-mono-mp3

For scenarios where lower sampling rate is required, for example playing back for phone calls, Francisca and other neural voices can also be easily sampled down with a lower bit rate. Learn more about the output format supported.

Hear text aloud with Read Aloud in the new Edge browser

Neural TTS is powering Microsoft services at scale. The Francisca voice is now supported in the new Microsoft Edge, enabling you to anytime, anywhere with natural voices.

Figure 2. Neural TTS in Edge Read-Aloud

Edge Read Aloud also makes it easy to follow along with text, supporting the output of word boundaries so each word being read out is simultaneously highlighted in the UI. This is an essential feature for immersive reading scenarios. To build your own Read Aloud apps, check out SynthesisWordBoundaryEventAsync function in our sample codes.

Create a custom voice in Brazilian Portuguese

The same transfer learning technology is now shipped in the the Custom Neural Voice capability, enabling organizations to create their one-of-a-kind digital voices with 5X less data while still delivering high-fidelity audio outputs.

With Brazilian Portuguese (pt-BR) added to the family, seven locales are now supported in the custom neural voice online training portal - American English (en-US), British English (en-UK), Indian English (en-IN), German, French, Chinese (zh-CN) and Brazilian Portuguese (pt-BR). More locales are available through customer engagement. Submit a request to create your custom voice using the neural TTS technology.

Get started

With these updates, we’re excited to be powering natural and intuitive voice experiences. Text to Speech has more than 75 standard voices in over 45 languages and locales in addition to our growing list of neural voices. Learn more about how you can get started.