Azure OpenAI Architecture Patterns and implementation steps

Posted by

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.



A comprehensive overview of the most frequently used and discussed architecture patterns among our customers in various domains.


1) AOAI with Azure Frontdoor for loadbalancing

  • Use Azure Front Door for cross region global load balancing of requests across multiple Azure OpenAI endpoints.
  • In this architecture below Azure Front Door routes requests to multiple instances of Azure OpenAI hosted on Private Endpoints locked to a VNET.
    AFD uses health check on the path /status-0123456789abcdef to determine the health and proximity of each Azure OpenAI endpoints.
  • The deployment name should be the same if you are load balancing, since that would be in the URL path.

Architecture diagram:


Key Highlights:

  • Global load balancing across multiple Azure OpenAI endpoints in multiple regions with intelligent health probe monitoring.
  • AFD provides scale out and improved performance to your AOAI endpoints using Microsoft’s global cloud CDN and WAN.
  • Unified static and dynamic delivery offered in a single tier of AFD to accelerate and scale through caching, SSL offload, and layer 3-4 DDoS protection.
  • Protection against OWASP top 10 attacks, Common Vulnerabilities and Exposures (CVEs) and malicious bot attack through AFD WAF. Refer more here:
  • Define your own custom domain with AFD and AFD provided autorotation managed SSL certificates.
  • Azure OpenAI service is not directly to client applications and public access disabled through Private Endpoints. AFD can connect to your AOAI origin using Private Endpoint embracing Zero Trust access model. 

If you set equal weights for all origins and a high latency sensitivity in Azure Front Door, it will consider all origins that have a latency within the specified range of the fastest origin as eligible for routing traffic. So, all the origins should receive approximately equal amounts of traffic, provided their latencies are within the specified range.


However, it’s important to note that this doesn’t guarantee a perfect round-robin distribution. The actual distribution can vary based on factors like network conditions and changes in latency. If you need strict round-robin load balancing, you might need to consider other services or features that specifically support this method.






Use Postman for testing:

Request 1:


Request 2:



For perfect round robin distribution, you can use Azure Application Gateway with the same health check endpoints.


2) AOAI with APIM


Architecture diagram:



Key highlights:

  • You can use APIM to manage the access, usage, and billing of your Azure OpenAI APIs, and apply policies such as authentication, caching, rate limiting, and transformation.
  • You can monitor and analyze the performance and health of your Azure OpenAI APIs, and troubleshoot any issues using APIM’s built-in tools and integrations with Azure Monitor and Application Insights.
  • You can publish your Azure OpenAI APIs to a developer portal, where you can provide documentation, samples, and interactive testing for your consumers.
  • You can use APIM to create composite APIs that can orchestrate multiple Azure OpenAI models or integrate with other Azure services and external APIs.

a) Round Robin load balancing with Retry logic


<policies> <inbound> <base /> <cache-lookup-value key="backend-counter" variable-name="backend-counter" /> <choose> <when condition="@(!context.Variables.ContainsKey("backend-counter"))"> <set-variable name="backend-counter" value="0" /> <cache-store-value key="backend-counter" value="0" duration="100" /> </when> </choose> <choose> <when condition="@(int.Parse((string)context.Variables["backend-counter"]) == 0)"> <set-backend-service base-url="" /> <set-variable name="backend-counter" value="1" /> <cache-store-value key="backend-counter" value="1" duration="100" /> </when> <when condition="@(int.Parse((string)context.Variables["backend-counter"]) == 1)"> <set-backend-service base-url="" /> <set-variable name="backend-counter" value="0" /> <cache-store-value key="backend-counter" value="0" duration="100" /> </when> </choose> </inbound> <backend> <retry condition="@(context.Response.StatusCode >= 500 || context.Response.StatusCode >= 400)" count="6" interval="10" first-fast-retry="true"> <choose> <when condition="@((context.Response.StatusCode >= 500 || context.Response.StatusCode >= 400) && (int.Parse((string)context.Variables["backend-counter"])) == 0)"> <set-backend-service base-url="" /> <set-variable name="backend-counter" value="1" /> <cache-store-value key="backend-counter" value="1" duration="100" /> </when> <when condition="@((context.Response.StatusCode >= 500 || context.Response.StatusCode >= 400) && (int.Parse((string)context.Variables["backend-counter"])) == 1)"> <set-backend-service base-url="" /> <set-variable name="backend-counter" value="0" /> <cache-store-value key="backend-counter" value="0" duration="100" /> </when> </choose> <forward-request buffer-request-body="true" /> </retry> </backend> <outbound> <base /> </outbound> <on-error> <base /> </on-error> </policies>



Testing on round robin load balancing using APIM :






b) AAD authentication from APIM to Azure OpenAI


Step 1 – Enable Managed Identity in APIM


Step 2 – Provide necessary RBAC:

In the IAM of Azure OpenAI service add the OpenAI user role for the APIM Managed Identity (Managed Identity will have the same name of APIM).






 Step 3 - Add the Managed Identity policy in APIM:


<policies> <inbound> <base /> <authentication-managed-identity resource="" /> </inbound> <backend> <base /> </backend> <outbound> <base /> </outbound> <on-error> <base /> </on-error> </policies>



Testing for Managed Identity Policy:



c) Policy to extract callerID (Subject from APIM)


For extracting other details from JWT, refer - 

Azure API Management policy expressions | Microsoft Learn



<validate-jwt header-name="Authorization" failed-validation-httpcode="401" failed-validation-error-message="Token is invalid" output-token-variable-name="jwt-token"> <issuers> <issuer>{{myIssuer}}</issuer> </issuers> </validate-jwt> <!-- Extract the subject and add it to a header --> <set-header name="caller-objectid" exists-action="override"> <value>@(((Jwt)context.Variables["jwt-token"]).Subject)</value> </set-header>



d) Logging and Monitoring using APIM:


Use Azure monitor and APIM to enable enhanced logging and monitoring of the published AOAI APIs. Learn more - Tutorial - Monitor published APIs in Azure API Management | Microsoft Learn 







Sample log queries for prompt completion:


ApiManagementGatewayLogs | extend model = tostring(parse_json(BackendResponseBody)['model']) | extend prompttokens = parse_json(parse_json(BackendResponseBody)['usage'])['prompt_tokens'] | extend completiontokens = parse_json(parse_json(BackendResponseBody)['usage'])['completion_tokens'] | extend responsetext = (parse_json(parse_json(BackendResponseBody)['choices'])[0]['message']) | extend prompttext = (parse_json(RequestBody)['messages'])


For more queries refer to documentation here: Implement logging and monitoring for Azure OpenAI large language models - Azure Architecture Center | Microsoft Learn


e) For advanced logging, more than 8192 bytes refer to the documentation here: openai-python-enterprise-logging/advanced-logging at main · Azure-Samples/openai-python-enterprise-logging · GitHub


f) For Budgets and cost management using APIM refer this blog - Azure Budgets and Azure OpenAI Cost Management - Microsoft Community Hub



3) AOAI with Frontdoor and APIM multi-region deployment for a full-fledged multi-region availability

Refer to the DR documentation - Deploy Azure API Management instance to multiple Azure regions - Azure API Management | Microsoft Learn



a. In Frontdoor give both APIM regional gateway URLs as backend Origins, example &

b. Configure the API Management regional status endpoints - e.g.

c. Sample policy to be used to make the regional gateways route to respective backends.


<policies> <inbound> <base /> <choose> <when condition="@("West Europe".Equals(context.Deployment.Region, StringComparison.OrdinalIgnoreCase))"> <set-backend-service base-url="" /> </when> <when condition="@("Japan East".Equals(context.Deployment.Region, StringComparison.OrdinalIgnoreCase))"> <set-backend-service base-url="" /> </when> <otherwise> <set-backend-service base-url="" /> </otherwise> </choose> </inbound> <backend> <base /> </backend> <outbound> <base /> </outbound> <on-error> <base /> </on-error> </policies>


In conclusion, this article will be a starting point to implement scalable architecture patterns using Azure OpenAI models with other Azure services. As we continue to explore the potential of AI, we’ll continue to update our patterns and documents, guiding us towards smarter and more efficient systems.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.