Create a Simple Speech REST API with Azure AI Speech Services

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

speech-features-highlight.png

Create a Simple Speech REST API with Azure AI Speech Services

Explore the world of Speech recognition and Speech Synthesis with Azure AI Services. In this tutorial, you will learn how to create your own simple Speech REST API using Azure AI Speech Synthesis and Azure OpenAI services or OpenAI API. Experience the power of speech synthesis using Azure and explore the infinite number of possibilities today unveiled to you by Azure AI Services to create powerful products.

Overview of the Demo REST API

In this Speech API we are going to build, we have 3 simple end points that all make a GET request to the Azure AI Speech Service and to OpenAI API. A request with your very own voice data is sent over to Azure AI Speech Service and a text response of what you said is returned, moreover, the returned text sent is sent to OpenAI Chat Completion API and a response is sent back. Finally, the response is sent to Azure AI Speech Service, and it returns a voice response with an accent you choose, as per your region.

Prerequisites

Create an Azure AI Speech Resource

First things first, we will need to create an Azure Speech Resource where will be able to get our API Keys from. To do this head over to your browser and open the Azure portal.

  1. In the search bar, type in Speech Service. Once the page loads, press the Create button. You will be then redirected to the Create Speech Services page.

theophilusO_0-1698576419964.png

 

 

  1. Choose a resource group or create a new one, you can choose another region or leave it as default. Type in a name for your service. Choosing your current pricing tier and click Review + create. Wait for the resource to deploy successfully then click Go to resource.
  2. Now we are going to fetch our API keys. To do this, head over to left tab and click Keys and Endpoint.

 

theophilusO_1-1698576419973.png

 

  1. Copy the any of the API keys and paste then in your notepad, or keep note of them.

theophilusO_2-1698576419977.png

 

Get Your OpenAI API Keys

In case you don’t have your OpenAI API Keys, you can head over to your account portal and generate your API Keys. Note that for you to use the service you must have enough credits in your account. Alternatively, if you have access to Azure OpenAI Service you can follow this guide on how to navigate and get your API Keys after creating your own instance of the service.

Installing necessary libraries

This tutorial focuses on using JavaScript and Node.js, therefore the following libraries will be installed using node package manager (npm). If you would like to review other languages, after the tutorial, there will be a set of links in the Learn More Section.

  1. Head over to your terminal in VS Code after creating a new project folder. You can choose to clone this repository https://github.com/tiprock-network/azureSpeechAPI or code along.
  2. Let’s start by creating our necessary folders. In your VS Code Editor create the folders controllers, public, routes, speech_files. In the public folder add a new sub-folder and name it synthesized.
  3. Now in your terminal, run this commands line by line in order to initialize our new project, and install necessary libraries.npm init -y npm i express dotenv npm i request openai npm i microsoft-cognitiveservices-speech-sdk npm I nodemon -D

 

The above installs dotenv, a library that we will need for our environment variables, openai api library, microsoft-cognitiveservices-speech-sdk and nodemon is installed so that we avert restarting our node express server manually.

  1. If you cloned the repository, you are only required to type in the command, npm install and all of these dependencies shall be installed.
  2. Let’s continue coding along, head over to the package.json file created and change the scripts section to the code below."scripts": { "start": "node app", "dev": "nodemon app" },
  1. Now create the app.js and .env files in the root folder.

 

Create your Voice Data

Open your favorite voice recording app, in this case I am using Audactiy. Record a few simple and short voice messages between 5 – 20 seconds, saying something you want feedback on. Save and export your .wav files to speech_files folder. Give the recording files a short and relevant names according to what you have recorded e.g. greeting, food, etc.

In case you cloned the repository, there are a few .wav files pre-recorded and you can utilize them in this project.

Create REST API using Express.js

  1. Open app.js and add the following code to create an express server.const express = require('express') const dotenv = require('dotenv') dotenv.config() const app = express() app.use(express.static('public')) app.use('/api',require('./routes/speechRoute')) const PORT=process.env.PORT || 5005 app.listen(PORT,()=>console.log(`Service running on port - ${PORT}...`))
  1. Open .env and add the following variables. Remember to add the region that you chose as shown below.PORT=5001 API_KEY_SPEECH=add speech service api key here API_SPEECH_REGION=eastus API_KEY_OPENAI=add openai api key here
  1. Create routes by heading over to routes folder and in it create the file speechRoute.js. Add the following code to create the routes that perform a GET request with a specific endpoint.const express = require('express') const router = express.Router() const speech2text = require('../controllers/speechtotextController') const textcompletion = require('../controllers/chatcompletionController') const text2speech =require('../controllers/texttospeechController') router.get('/voicespeech',speech2text) router.get('/completetext',textcompletion) router.get('/talk',text2speech) module.exports = router
  1. In the controllers folder create three different files with the names chatCompletionController.js, speechtotextController.js and texttospeechController.js. Add the following codes respectively.

 

Code for chat completion API with OpenAI

 

const dotenv = require('dotenv') dotenv.config() const OpenAI = require('openai') const request = require('request') const openai = new OpenAI({apiKey: process.env.API_KEY_OPENAI, }); const chatcompletion = async (req,res) =>{ request('http://localhost:5001/api/voicespeech',async (error,response,body)=>{ const prompt=JSON.parse(body).text const chatCompletion = await openai.chat.completions.create({ messages: [{ role: 'user', content:`${prompt}` }], model: 'gpt-3.5-turbo', }); res.status(200).json({ response: chatCompletion.choices[0].message.content}); }) } module.exports = chatcompletion

 

 

Code for Speech to Text using Azure AI Speech Service Node.js Library

Note that you can choose from a wide variety of dialect or accent that you want for example for English US you would use en-US. The example below use English Kenya to detect Kenyan accent. To see a full list of supported languages check out Language and Support for Speech Service.

 

 

const dotenv = require('dotenv') dotenv.config() const fs = require("fs"); const sdk = require("microsoft-cognitiveservices-speech-sdk"); const getSpeechText = async (req,res) =>{ // This example requires environment variables named "SPEECH_KEY" and "SPEECH_REGION" const speechConfig = sdk.SpeechConfig.fromSubscription(process.env.API_KEY_SPEECH, process.env.API_SPEECH_REGION); speechConfig.speechRecognitionLanguage = "en-KE"; let audioConfig = sdk.AudioConfig.fromWavFileInput(fs.readFileSync("./speech_files/fastfoodNairobi.wav")); let speechRecognizer = new sdk.SpeechRecognizer(speechConfig, audioConfig); const result = await new Promise((resolve, reject) => {speechRecognizer.recognizeOnceAsync(resolve, reject);}); let myspeech = '' if(result.reason == sdk.ResultReason.RecognizedSpeech) res.status(200).json({text:`${result.text}`}); else if(result.reason !== sdk.ResultReason.RecognizedSpeech) res.status(400).json({text:`Sorry but couldn't get what you said.`}); speechRecognizer.close(); } module.exports = getSpeechText

 

 

Code for Text to Speech using Azure AI Speech Service Node.js Library

 

const dotenv = require('dotenv') dotenv.config() const sdk = require('microsoft-cognitiveservices-speech-sdk') const request = require('request') const audio_file='./public/synthesized/speechoutput.wav' const getAudioResponse= async (req,res) =>{ const speechConfig = sdk.SpeechConfig.fromSubscription(process.env.API_KEY_SPEECH, process.env.API_SPEECH_REGION) const audioConfig = sdk.AudioConfig.fromAudioFileOutput(audio_file) //voice in preferred dialect speechConfig.speechSynthesisVoiceName= 'en-KE-AsiliaNeural' const synthesizer = new sdk.SpeechSynthesizer(speechConfig,audioConfig) request('http://localhost:5001/api/completetext',async (error,response,body)=>{ const prompt=JSON.parse(body).response //start synthesizer and await results synthesizer.speakTextAsync(`${prompt}`,(result)=>{ if(result.reason == sdk.ResultReason.SynthesizingAudioCompleted) res.status(200).json({message:'Synthesis finished successfully'}); else res.status(400).json({message:{Error:result.errorDetails}}) synthesizer.close() synthesizer = null }) console.log("Now synthesizing to: " + audio_file) }) } module.exports = getAudioResponse

 

 

Test Your API

Let’s head over to the terminal and start our app. To start the app in development mode. Run the command below.

 

npm run dev

 

After the “server listening on port 5001” message is shown head over to the VS Code Postman Extension. Click the Postman icon on the left pane as shown below and let’s test our different endpoints. In order to test the endpoints you will have to create a new workspace then start a new collection then add a new request. Note that you will be prompted to login into your account whether you are using the desktop client or using the Visual Studio Code Extension.

theophilusO_3-1698576419988.png

 

Notice in the image my endpoint, http://localhost:5001/api/voicespeech, returns my speech data according to the .wav file I had set the path to in the speechtotextController.js file.

 

let audioConfig = sdk.AudioConfig.fromWavFileInput(fs.readFileSync("./speech_files/fastfoodNairobi.wav"));

 

You can change this according to the different .wav files you recorder, or incase you cloned the repository you can change the path to different audio files.

 

Getting Chat Completion from OpenAI API

The next endpoint, http://localhost:5001/api/completetext, gets a chat completion or chat response from OpenAI API.

theophilusO_4-1698576419999.png

 

To get completion and chat completion from Azure Open AI Service, read the documentation on Azure OpenAI library for JavaScript.

 

Get Voice Data Generated by Azure AI Speech Synthesis

Run a similar GET request with this endpoint, http://localhost:5001/api/talk. Upon completion of the request with a Response code of 200 OK returned, the following response body should be returned.

 

{ "message": "Synthesis finished successfully" }

 

 

While your VS Code terminal should have the same output as below.

theophilusO_5-1698576420002.png

 

Open the new speechoutput.wav file in the synthesized folder in public and listen to the generated speech as per the chat completion.

 

Congratulations you have just created your very own simple Speech REST API. You can now further improve it and do a range of tasks with this awesome service.


Learn More

Explore how to do the same in other programming languages.

Azure OpenAI Service REST API reference.

Learn how to work with the GPT-35-Turbo and GPT-4 models.

Get started using GPT-35-Turbo and GPT-4 with Azure OpenAI Service.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.