Introducing super realistic AI voices optimized for conversations

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Now, in human-bot conversational interactions, AI can produce more natural, fluent, and high-quality responses than ever before, thanks to the power of Large Language Models (LLMs) such as Azure OpenAI GPT. Consequently, when engaging in verbal conversations, the demand for naturalness and expressiveness in Text-to-Speech (TTS) voices is higher than ever. We are introducing these new voices specifically designed for conversational scenarios. Whether you are creating a speech-based chatbot, a voice assistant, or a conversational agent, these new voices will ensure your interactions are more realistic, lifelike, and engaging.


The new realistic voices are perfect matches for any application necessitating lifelike speech interactions, including chatbots, voice assistants, gaming, e-learning, entertainment, and more.


Meet four new voices we introduce today: en-US-AndrewNeural, en-US-BrianNeural, en-US-EmmaNerual and zh-CN-YunjieNeural, all optimized for conversational scenarios, available in public preview in three regions: East US, South East Asia and West Europe. 


Check out the voice samples


Demo of new vs. other Neural voices



New voice

other Neural voice

 I can help you with a lot of things! I can answer questions, provide information on a wide range of topics, help you find things on the web, and more. If you have a specific question or task in mind, feel free to ask me and I'll do my best to assist you.



I'm not sure what you're asking. If you're asking for a paraphrase of the sentence "I learn about myself that I can lead a team", then it means that the speaker has discovered that they have the ability to lead a team. Is there anything else I can help you with?



风筝有风,海豚有海 ,而您有我,感谢您的光临。么么哒!




Demo of new voices


New voice

I understand. It sounds like a place that is both impressive and terrifying. I wonder what kind of tea they serve there. Is it made from the sun's rays or from something else? And who are the people who live there? Are they loyal to the Empire or do they have their own agendas?


Yes, that is what I said. A maximin strategy is the one that maximizes the minimum payoff of a player, regardless of what the other players do. It is a way of ensuring that the player gets at least a certain amount of payoff, even in the worst case scenario.


If you can't find the information, you may want to consider contacting your state's insurance department. They may be able to help you locate any life insurance policies that were taken out on your husband. I hope this helps. Please let me know if you have any other questions.





Demo of full conversation


Conversation between Andrew and Emma


Conversation between Yunjie and Xiaochen



Integrate these new voices with Azure OpenAI


You can effortlessly incorporate these new neural Text-to-Speech (TTS) voices into your applications using the Azure Speech SDK or REST API. Additionally, you can employ the Azure Bot Framework to develop intelligent bots capable of utilizing new neural TTS voices for speech synthesis.

To minimize latency during the integration of Large Language Models (LLMs) and TTS, it is advised to send text to the TTS service while the LLM is still generating a response. You can find a demo sample here that demonstrates generating TTS responses in a streaming manner.


Technology behind


We began by crafting the persona of each voice as if it were a real person who is , friendly, and optimistic about life, always eager to assist others and share intriguing or practical knowledge. The speaking style of the voice resembles a conversation with an acquaintance over a cup of tea, maintaining a natural and unexaggerated tone.

Furthermore, we continuously enhance our Text-to-Speech (TTS) modeling techniques to improve the quality of our AI voices. Our most recent projects, such as DelightfulTTS 2,  and MuLanTTS, have significantly narrowed the quality gap between AI voices and professional human recordings, producing more natural and realistic voices than ever before. These technological advancements serve as the foundation upon which these new AI voices are built.



Get started


Microsoft offers over 400 neural voices covering more than 140 languages and locales. With these Text-to-Speech voices, you can quickly add read-aloud functionality for a more accessible app design or give a voice to chatbots to provide a richer conversational experience to your users. In addition, with the Custom Neural Voice capability, you can easily create a brand voice for your business.


For more information













Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.