Getting Started with OpenAI Whisper on Azure

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

In March of 2024 OpenAI Whisper for Azure became generally available, you can read the announcement here.  From the documentation, “The Whisper model is a speech to text model from OpenAI that you can use to transcribe(and translate) audio files. The model is trained on a large dataset of English audio and text. The model is optimized for transcribing audio files that contain speech in English. The model can also be used to transcribe audio files that contain speech in other languages. The output of the model is English text.” At this time, translation from 57 Languages is supported.  I wanted to spend this time to cover a few topics to help you make sense of the two flavors that will be available to you, Azure AI Speech Service, and Azure OpenAI Service.


To get started, there is a table in the documentation referenced above to give you the use cases for Whisper for Azure vs. Azure AI Speech Model.   There is a matrix at the following link to provide a recommended supportability matrix to give you an idea when to choose each service.


I will call out from the documentation that there are limitations to the Azure Open AI Whisper model;

Whisper Model via Azure OpenAI Service might be best for:

  • Quickly transcribing audio files one at a time
  • Translate audio from other languages into English
  • Provide a prompt to the model to guide the output
  • Supported file formats: mp3, mp4, mpweg, mpga, m4a, wav, and webm

Whisper Model via Azure AI Speech might be best for:

  • Transcribing files larger than 25MB (up to 1GB). The file size limit for the Azure OpenAI Whisper model is 25 MB.
  • Transcribing large batches of audio files
  • Diarization to distinguish between the different speakers participating in the conversation. The Speech service provides information about which speaker was speaking a particular part of transcribed speech. The Whisper model via Azure OpenAI doesn't support diarization.
  • Word-level timestamps
  • Supported file formats: mp3, wav, and ogg
  • Customization of the Whisper base model to improve accuracy for your scenario (coming soon)


Getting started with a Python Sample for OpenAI Whisper on Azure


We do have a sample doc for this, however as many data scientist know, packages move fast and change often.  As of this writing, the following code sample works with the OpenAI package > 1.  Also, the api_version is correct as of this writing, I will keep this blog updated with any necessary changes for future versions. I will collect some samples and publish them to a GitHub repository and link them here in the near future. In the meantime, read the prerequisites here and get started.  You will need an Azure subscription, access to the Azure OpenAI Service in your subscription, and add a Whisper model deployment. I will not comment on the region availability as it is constantly expanding, but you can keep an eye on this page to keep up with region availability.


Once you have your deployment created, you will need to copy the URL endpoint and one of the two Open AI Keys from Azure OpenAI under the resource management section of your Azure OpenAI resource.



This code sample will read in an audio file from local disk. A variety of audio samples can be found here in the Azure Speech SDK github repo.




The result you get from one of the samples will look something like this;


Translation(text='As soon as the test is completed, which is displayed successfully by the status change, you will receive a VIN number for both models in your test. Click on the test name to show the page with the test details. This detail page lists all statements in your data set and shows the recognition results of the two models next to the transcription from the transmitted data set. For a simpler examination of the opposite, you can activate or deactivate various error types such as additions, deletions and substitutions.')


Using Azure AI Services Speech


The alternative option is to use Azure OpenAI Whisper model in the Azure AI Speech Service.  The Azure AI Speech Service offers a lot of capability, captioning, audio content creation, transcription, as well as real-time speech to text and text to speech.


If you have been using the Azure AI Speech Service, you likely have much of the code written to take advantage of the Azure OpenAI Whisper model. There is a migration guide to move you from REST API v3.1 to Version 3.2 which supports the Whisper model.  You should provide multiple files per request or point to an Azure Blob Storage container with the audio files to transcribe. The batch transcription service can handle a large number of submitted transcriptions. The service transcribes the files concurrently, which reduces the turnaround time.


If you are using this as part of a batch transcription process, you can find the documentation here.  The most important note about making sure you are using the Whisper model is to make sure you set the model version to 3.2, but keep in mind region availability which is linked here.


I hope this article has helped you determine which service is right for you. Keep in mind, all Azure AI services are fast moving, so keep an eye on the docs linked in this post as well as the constantly expanding Microsoft learn site.  


Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.