Strategies for Optimizing High-Volume Token Usage with Azure OpenAI

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

Strategies for Optimizing High Volume Token Usage with Azure OpenAI.jpeg

 

Strategies for Optimizing High-Volume Token Usage with Azure OpenAI

 

Addressing the challenges of building AI solutions with high-volume token usage, explore strategic recommendations for overcoming token limits, optimizing model deployments, and practical techniques for maximizing token usage with Azure OpenAI.

 

Key Challenges

 

Reaching maximum token limits in Azure OpenAI

The available models for Azure OpenAI Service, including GPT-3.5 Turbo and GPT-4, have hard maximum token limits per request. These ensure the models operate efficiently and produce relevant, cohesive responses. While token limits increase with newer models, token limits still require ISVs and Digital Natives to explore alternative approaches to overcome them for their project needs.

 

Taking advantage of appropriate LLM techniques for use cases

Different models have different capabilities and limitations. GPT-3.5 provides the most cost-effective deployment and is significantly cheaper to run. However, this comes at the expense of limited tokens. GPT-4 offers a far more extensive data set with the ability to solve more complex queries with greater accuracy. ISVs and Digital Natives must consider appropriate techniques to utilize LLMs for their business needs to maximize their token usage.

 

Optimizing multiple service and model deployments in Azure OpenAI

Achieving scalability while avoiding underutilization or overloading of model deployments is a significant hurdle. Using a shared Azure OpenAI Service instance among multiple tenants can lead to a Noisy Neighbour problem. This can result in service degradation for certain users of an application. Single deployments pose a challenge as a user base grows requiring ISVs and Digital Natives to consider how to provide efficient mechanisms for multiple deployments and cost allocations to customers.

 

Recommendations

As ISVs and Digital Natives creating reliable AI solutions with high-volume token usage, you should:

 

  • Take a step-by-step approach to discovering the potential use cases for specific models in Azure OpenAI. Identify where one or more can be deployed to achieve a cost-effective solution. Recognize that using multiple models in conjunction for different use cases can optimize token usage and overall performance.
  • Experiment with strategies for creating embeddings for providing related context to your AI prompts. This can, in most cases, reduce the overall number of tokens without compromising response quality. Combine with prompt engineering techniques to craft precise and targeted prompts, minimizing unnecessary token usage while achieving a desired output.
  • Maximize overall token availability by deploying multiple instances across multiple regions, employing load balancing techniques for even distribution of requests and global reach. Implement appropriate monitoring tools to observe token usage to support further improvement and enhancement to your AI solution.

 

Optimizing AI solutions with high-volume token usage

ISVs and Digital Natives are increasingly leveraging the power of the Azure OpenAI Service in new and existing multitenant, software-as-a-service (SaaS) architectures to push the boundaries of their solutions to meet their customers’ changing expectations. In a 2023 report published by Stanford Institute for Human-Centered Artificial Intelligence (HAI), companies adopting AI solutions has increased to 50-60%. This highlights an increase in demand for AI from consumers of solutions provided by ISVs and Digital Natives.

 

However, engineering teams transitioning from well-established development processes to this fast-paced innovative technology face new challenges. Not only are they tasked with integrating with the APIs, but they need to consider the adoption and management of services and models to provide a reliable AI service across their user base.

 

This leads ISVs and Digital Natives to ask, “How do we establish best practices in our AI solutions for handling high volumes of tokens?”

 

This article explores the key focus areas of high-volume token usage with Azure OpenAI. It highlights where ISVs and Digital Natives can make improvements to deliver reliable multitenant SaaS AI solutions.

 

Understanding tokens and limits in Azure OpenAI Service

Tokens are made up of individual characters, words, or parts of sentences and are tokenized so that models, such as the OpenAI GPT family, can process them for text generation, translation, or summarization.

 

Showcasing how tokenization works for processing and generation in large language models (LLMs)Showcasing how tokenization works for processing and generation in large language models (LLMs)

 

Using the Byte-Pair Encoding (BPE) tokenization method, the most frequently occurring pairs of characters merge into a single token. The models learn to understand the statistical relationships between these tokens and excel at producing the next token in a sequence of tokens.

 

The architecture of each model determines a maximum number of tokens that can be processed in a single request. For example, GPT-3.5 Turbo has a token limit of 4,096. This means that it can manage 4,096 tokens in one go, including both the prompt and the completion.

 

Comparing Azure OpenAI Service models by token and token rate limits

Model

Token Limit

Tokens Per Minute

gpt-35-turbo

4,096

240-300K

gpt-35-turbo-16k

16,384

240-300K

gpt-4

8,192

20-40K

gpt-4-32k

32,768

60-80K

gpt-4-turbo

132,096 (128K in, 4K out)

80-150K

text-embedding-ada-002

8,191

240-350K

 

Azure OpenAI Service applies additional rate limits on top of these model specific limitations for each model deployment per region. Tokens-per-minute (TPM) is a configurable limit set per model per region within the API that provides a best prediction of your expected token usage over time. The requests-per-minute (RPM) rate limit is also set proportionally to the TPM based on 6 RPM per 1000 TPM. These additional quota limits help to manage the compute resources required by the models for processing customer requests. The more tokens a model must process, the more compute is required to process them.

 

It is important to consider the specific model token limits as well as the additional Azure OpenAI quota limits when architecting AI solutions.

 

Choosing the right model for specific use cases

Before choosing a specific model, define business objectives to help you understand how each can help you achieve your goals and define use cases.

 

The GPT family models are best used for natural language processing tasks such as chatbots, Q&A, language translation, text generation, and summarization. These models can generate high-quality content that is coherent and contextually relevant.

 

Text embedding models, on the other hand, perform better for tasks such as document search, sentiment analysis, content filtering, and classification. These models can represent text as a vector, a numerical representation which can be used to measure the similarity between different texts.

 

Conduct workshops with your engineering teams to collaboratively map out potential use cases to the various models. Identify specific uses where GPT models will support your requirements for natural language tasks, while exploring where you can optimize your cost-effectiveness using text embedding models for semantic analysis.

 

Avoid a one-size-fits-all approach when considering your models. Recognize that each model excels in distinct areas. Tailor your choices based on the specific requirements of your use cases to achieve significant cost-efficiency in your token usage. Consider that you may use multiple models in conjunction for your use cases to optimize your token usage further.

 

Taking advantage of embeddings to provide semantic context in prompts to GPT models

Embeddings are the numerical representation of any text you provide to a model such as text-embedding-ada-002 that capture the contextual relationships and meaning behind it.

 

Embeddings serve as a powerful tool in enriching a prompt to GPT models with semantic understanding from your existing data. By locating related text using embeddings, GPT models are provided with condensed, semantic context which results in fewer tokens used. This is crucial when considering high-volume token scenarios, contributing to cost savings without compromising the quality of responses.

 

When generating embeddings, it is important to note that token limits apply for the amount of content that is processed in single transaction. Unlike GPT models, a prompt is not required for these requests. However, appropriate strategies need to be made to segment the text into chunks. This is required so the semantic relationship in the text is captured effectively.

 

Consider splitting text by the most appropriate method for your use cases, such as by paragraph, section, key phrases, or applying clustering algorithms to group text into similar segments. Experiment, iterate, and refine your strategies for embeddings to optimize performance.

 

Utilizing prompt engineering to minimize unnecessary token usage

Prompt engineering involves crafting input queries or instructions in a way that extracts the most relevant information from the model while minimizing the number of tokens used. It is a strategic approach to achieve precise and resource-efficient interactions with Azure OpenAI.

 

Appropriately applying prompt engineering is a crucial aspect of maximizing the efficiency and reliability of LLM solutions. It is important to choose succinct and targeted prompts for use cases that convey the desired output. Employ the understanding of the GPT model’s tokenization to truncate and segment the instruction in prompts to reduce the overall prompt size without sacrificing the quality of the response. Avoiding unnecessary verbosity ensures token usage is minimal.

 

Test multiple prompts and context retrieval techniques for scenarios to validate the accuracy and reliability of the generated content. Utilize tools such as Prompt Flow in the Azure AI Studio to streamline the development of AI applications and evaluate the performance of your prompts.

 

Scaling out to increase Azure OpenAI service availability

With Azure OpenAI becoming a critical component of AI workloads, strategies for ensuring reliability and availability of this functionality are vital. With the limitations set by the service for model deployments per region, maximizing token usage can be achieved through multiple deployments across regions.

 

Applying load balancing techniques in front of each Azure OpenAI Service instance provides even distribution of requests across regions ensuring high availability for customers. Load balancing provides additional support for resiliency, enabling a seamless failover to another region if the rate limits for one region are met.

 

Example architecture using load balanced, multi-region deployments of Azure OpenAI ServiceExample architecture using load balanced, multi-region deployments of Azure OpenAI Service

For scenarios where global reach is a requirement, approaching multi-region deployments of the same Azure OpenAI infrastructure can provide a better user experience for customers. Requests from client applications can be routed to an appropriate, nearest region while taking advantage of the load balancing and failover to another region to maximize token usage.

 

Although it is possible to deploy multiple of the same model in a single instance, the limitations in the TPM/RPM of model deployments per region limit the usages of per-tenant deployments of models.

 

As high-volume token usage increases, consider solutions for tracking token usage across customers in multitenant scenarios by introducing monitoring tools, such as Azure Managed Grafana, to simplify the process.

 

Adopt best practices in DevOps including infrastructure-as-code when deploying and managing a complex, multi-region Azure AI infrastructure. This approach will simplify the deployment process, minimize human-error, and ensure consistency across all regions.

 

Consider all limitations when architecting a high-volume token usage scenario including TPM and RPM per model deployment in each region.

 

Conclusion

Creating reliable AI solutions with high-volume token usage with Azure OpenAI requires a strategic and multifaceted approach. ISVs and Digital Natives must navigate the constraints of model token limits, choosing appropriate models for their use cases, exploring multiple prompts and context request combinations, and optimize their model deployments to maximize their token usage.

 

As the demand for AI solutions continues to grow, ISVs and Digital Natives are challenged to establish best practices for production. With a collaborative, systematic approach, they can push the boundaries of possibilities with Azure OpenAI to deliver reliable AI solutions that meet their evolving customer expectations.

 

Further Reading

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.