Scaling strategies for large scale Azure Cognitive Services deployments

Posted by

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

While utilizing Cognitive Services, it's important to understand that each resource comes with a pre-configured static call rate measured in transactions per second (TPS). This call rate acts as a limit on the number of concurrent requests that clients can initiate towards the backend service within a specified time frame. By limiting the number of concurrent calls for the customers, the service provider can allocate resources efficiently and maintain the overall stability, reliability service. This also plays a role in the security aspect of the service, so the service itself can protect customers from malicious behavior attacking their service Additionally, this call rate allows customers to estimate their usage and manage costs effectively by preventing unexpected spikes in traffic that can result in unexpected charges. 


For highly scalable systems it is important that the system scales independently within these constraints without affecting the customer experience. In this article, we'll delve deeper into what this means for the developers and also look at strategies to scale these services within these constraints to maximize the potential of Cognitive Services in large deployments.


Below strategy consisting of four steps can be employed to minimize issues related to throttling and ensure that application scales seamlessly




1. Understanding application peak load
To effectively design an app, it's essential to have a comprehensive understanding of the traffic patterns it will encounter along with the business teams, both in terms of regular usage and peak load. In some cases, it is important to break down the app into different modules that use different Cognitive Services and understand its usage patterns. This knowledge enables architects to make informed decisions about infrastructure and resource allocation, ensuring that the app can handle the expected traffic without performance degradation or downtime.


E.g., For a multilingual chatbot also supporting speech below table can be used to define the load –


~Estimated Traffic   in (%)
Avg. TPS
Peak Load TPS
Expected Latency (Sec)
User enters non-English text 20 Translator 60 300 3
Users uses speech interface 10 Speech 30 120
Frequently asked questions from customers 40 QnA Maker 30 100
LUIS 60 Language Understanding 25 200
Custom API Call 30 Custom Microservice 20 200


2. Understanding the limits for Cognitive Services

Below are the limits for every cognitive service: -

Note – This may vary based on your configuration and knowledge base.
For some services like Custom Question Answering, (QnA v2.0), it also requires provisioning Cognitive Search. So, the throughput depends on the SKU selection. E.g., An Azure cognitive search with configuration S3HD,6 replicas, 1 partition per instance can provide throughput upto 800-1000 Queries per Second.


3. Scaling strategies for Cognitive Services

Following section represents the scaling strategies for cognitive services.


1. New Support Request

Typically, different Cognitive Service resources have a default rate limit. If you need to increase the limit, submit a ticket by following the New Support Request link on your resource's page in the Azure portal. Remember to include a business justification in the support request. 

Note - Trade-off is between the scale and the latency, if the latency increase is fine for a high traffic situation, then the rate shouldn't be pushed to the limit.


2. Cognitive Services autoscale feature

The autoscale feature will automatically increase/decrease a customer's resource's rate limits based on near-real-time resource usage metrics and backend service capacity metrics. Autoscale will check the available capacity projection section to see whether the current capacity can accommodate a rate limit increase and respond within five minutes. If the available capacity is enough for an increase, autoscale will gradually increase the rate limit cap of your resource else autoscale feature will wait five minutes and check again. If you continue to call your resource at a high rate that results in more 429 throttling, your TPS rate will continue to increase over time.


This feature is disabled by default for every new resource using Azure Portal :-



 Using CLI :-


az resource update --namespace Microsoft.CognitiveServices --
resource-type accounts --set
properties.dynamicThrottlingEnabled=true --resource-group
{resource-group-name} --name {resource-name}


Please refer link to check for cognitive services that support the autoscale feature.


  • You could end up paying significantly higher bill if there is an error in client application repeatedly calling the service at a high frequency or any misuse. 
  • There is no guarantee that autoscale will always be invoked as the ability to trigger an autoscale is contingent on the availability of capacity. If there isn't enough available capacity to accommodate an increase, 429 error will be returned and autoscale feature will be re-evaluated after 5 minutes.
  • Autoscale will gradually increase the rate limit cap of your resource. If your application triggers a very high sudden spike, you may get 429 error for rate limit excess as TPS rate will continue to increase over time till it reaches the maximum rate (up to 1000 TPS) currently available at that time for that resource.

This strategy is advisable for medium scale workloads with gradual increase in traffic which can be predicted using thresholding approaches. We strongly advise testing your development and client updates against a resource that has a fixed rate limit before making use of the autoscale feature.


3. Using Azure Cognitive Services containers

Azure Cognitive Services provides several Docker containers that let you use the same APIs that are available in Azure in a containerized environment. Using these containers gives you the flexibility to bring Cognitive Services closer to your data for compliance, security, or other operational reasons. Container support is currently only available for a subset of Azure Cognitive Services. Please refer to Azure cognitive Services Container for more details. Azure Cognitive Services provide customers the ability to scale for high throughput and low latency requirements by enabling Cognitive Services to run physically close to their application logic and data. Containers don't cap transactions per second (TPS) and can be made to scale both up and out to handle demand if you provide the necessary hardware resources.


  • Running containers hardware and infrastructure resources that you will have to manage and maintain. This can add to your costs and operational overhead. 
  • The responsibility for maintaining and supporting the containers falls on you, and it can be challenging to keep up with version upgrades, updates, patches, and bug fixes. 
  • Ensuring security and compliance can be more challenging when running containers.
  • Customers are charged based on consumption so you could end up paying significantly higher bill if there is an error in client application repeatedly calling the service at a high frequency or any misuse. So, this needs to be taken care.

This strategy is advisable for large scale workloads and is more resilient to spikes, but not the ultimate solution to this type of traffic. Currently the container support is available for only a subset of Azure Cognitive Services, so it is important to validate the availability.


4. Use Load Balancer to scale Cognitive Service Instances based on load

In this strategy, different Instances are added based on load pattern & a load balancer (e.g., Azure Traffic Manager, Application Gateway) is used to load balance different instance of cognitive services.




In the above example, S1 SKU for translator service has limits of 40 million characters per hour but to get 120 TPS with 15 word each request and 8 characters per word 2 S1 instances are needed else S3 instance needs to be selected that can support 120 million characters may not be cost effective solution. One instance can be removed during off-peak hour.


This whole process to add and remove instances and configure the Traffic Manager can be automated. App will get the instance based on the Load balancing policy of Traffic Manager. Once the load patterns are understood an algorithm can be created to add/remove instances based on load patterns.


This strategy is much more reliable than the out of box auto scale which depends on regions capacity availability. This also help with predictable cost management as each deployed instance have their rate limits. Create an additional service resource in different regions and distribute the workload among them will also help in improving performance and high availability. Also sometime,
multiple service resources in the same region may not affect the performance, because all resources will be served by the same backend cluster.


You can also have your own traffic management (even at client side) that is based on rules or external to the service to distribute the load in a meaningful manner either per user, or per session or a round robin list that is distributed to the service to pick out it.
It is recommended to implement retry logic and backoff logic in your application.


  • This may need custom development for add/remove instance automation, load balancing entry and algorithm to invoke the same based on load patterns.
  • Additional resources like Traffic Manager etc. needs to be managed. Also, Traffic manager can be the bottleneck if not handled well.
  • This is only suitable for predictable load. This may not be suitable if there is a sudden spike in request.

Please find below table comparing above strategies under different type of loads.


Type of load Rate increase Autoscale Containers
Load Balancer (Traffic Manager)
Gradual increase Planned on the highest rate for a given latency
limit and will be a consistent rate through
A suitable solution that doesn’t entail a lot of
overhead of management. With the
right thresholding, it would be able to handle
scaling appropriately with the necessary lead and hysteresis 
Would allow for a bigger
limit compared to the rate
increase but would be with the extra cost of the
dedicated hardware. So
long as the load is within
the boundaries of Azure as well as issues of BCDR,
reliability and security
Will handle the distribution across
geographical locations and services, however
the limits would be mostly in the traffic
manager if it is beyond the boundaries of the
Spike increase, without
Would handle within the limits of the spike. Provisioning beyond the regular average service would be an inefficient to the service backend and
can be rejected.
Would not be very responsive and can cause 429s unexpectedly Would be more resilient to spike since the limits defined can be higher, but with the extra cost of hosting, BCDR, reliability etc. Would leverage the accumulated TPS in the resources, but the traffic manager would be the most critical aspect.
Pre-determined spike increase Hard to justify and keep changing Can be planned by provisioning auto scale policies that are pre-planned Will be within the limits of the container Can be provisioned for the deployment of more resources to accommodate the spike
Recurring increase Within the bounds of the
Efficient if the increase is
Will be within the limits of
the container
Can be provisioned with distribution of traffic



  • There is a trade-off between the limits of the service wrt rate and the latency of the service. Proper planning of traffic rates can help absorb short bursts of high traffic with minimal latency impact. However, if there are prolonged spikes or consistent increases in traffic, the system may not be able to handle the increased load, which can lead to performance issues and service disruptions.
  • Retries and backoff policies can be ways to handle spiky traffic, especially that is not prolonged and not planned. Embedding systematic retries would help in the management of cost and security while not affecting the service experience to a great extent. Below is the Java code snippet for implementing exponential backoff strategy where the initial backoff time is multiplied by a power of 2 for each retry.


import java.util.Random;

public class BackoffExample {
    private static final int MAX_RETRIES = 5;
    private static final int INITIAL_BACKOFF_TIME = 1000; // in milliseconds

    public static void main(String[] args) {
        int retries = 0;
        while (retries <= MAX_RETRIES) {
            try {
                // Make REST API call
                String response = makeApiCall();
                System.out.println("API call succeeded with response: " + response);
                break; // Exit loop if call is successful
            } catch (Exception e) {
                System.out.println("API call failed with error: " + e.getMessage());
                if (retries == MAX_RETRIES) {
                    System.out.println("Max retries exceeded. Giving up.");
                int backoffTime = INITIAL_BACKOFF_TIME * (int) Math.pow(2, retries);
                System.out.println("Retrying in " + backoffTime + " milliseconds.");
                try {
                    Thread.sleep(backoffTime + new Random().nextInt(1000)); // Add random jitter
                } catch (InterruptedException ie) {
                    System.out.println("Thread interrupted while sleeping.");

    private static String makeApiCall() throws Exception {
        // Make REST API call here
        // ...
        // Return response
        return "Sample Response";


  • Architectural considerations: The architecture can dictate the manner with which scaling should happen. Gating services that represent the entry point of all services (or aggregation of all services) would need to be considered for scaling, and with that limit other services that are less critical or less contributing to the load would then be adjusted accordingly.

Hybrid approaches of the above-mentioned elements can be the most effective. Provisioning a load balancer along with backoff and autoscaling for the expected traffic patterns can all be deployed to make sure the service is running smoothly in different conditions. Customers should also manage the load/latency trade-off and accept a larger latency with huge loads and not provision unnecessary services that would increase the development and maintainability of the service.


4. Performance Testing Apps before productionizing

To ensure that cognitive services and the entire workload are appropriately sized and scaled, it's critical to conduct load testing using a tool like JMeter. Load testing enables you to simulate a high volume of traffic and requests to assess the system's ability to handle such demands. Without load testing, you may not accurately determine the capacity requirements for your workload, which could lead to performance issues, increased costs, or other problems. Therefore, it's recommended to perform load testing regularly to optimize your use of cognitive services and ensure that your workload can handle anticipated traffic.


The parameters for performance testing may vary depending on the workload being tested. For example, in the case of a Virtual Assistant, the focus would be on metrics such as response time, the length of responses in terms of word count, and the number of steps involved in resolving a user's query etc. For a virtual assistant’s performance testing, its 2 main components will be considered -The bot service & Channel tests. The bot service can be tested on Directline. Please inform and work with the Microsoft support team before testing on Directline since load test on Directline service can be from limited set of IP’s and can be restricted. The important aspect that needs to be considered is the Peak-to-Average ratio, and the planning needs to mostly target the average and find ways that are suitable to handling peaks. Adding reasonable safeguards beyond these limits is also advisable.


By following the best practices outlined in this paper, you can effectively deploy and manage Cognitive Services, ensuring their scalability and efficient operation. I hope this paper has been informative and useful in helping you leverage the full potential of Cognitive Services in large-scale deployments.


Special Thanks to Nayer Wanas for reviewing and working with us on this blog!!!

If you have any feedback or questions about the information provided above, please leave a comment below or email us at or Thank you!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.