New generative AI app evaluation and monitoring capabilities in Azure AI Studio

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Advancements in language models like GPT-4 and Llama 3 offer promising opportunities, such as improving customer service and enhancing personalized education. These technologies can also pose serious risks if not properly managed, such as perpetuating biases or misinformation, security threats such as prompt injection attacks, or simply providing end users with low quality experiences that can degrade trust. Developers are tasked with addressing these risks while also ensuring applications run efficiently at scale.

In this blog, we will cover new automated evaluation and monitoring capabilities in Azure AI Studio to support more secure, trustworthy, and delightful AI applications.

Evaluators, now in preview, enable developers to design AI-assisted evaluation assets to evaluate any combination of defined metrics using an assigned Azure OpenAI model deployment. Evaluators can be customized, versioned, and shared across an organization for improved evaluation consistency across projects. Developers can run evaluators locally and log results in the cloud using the prompt flow SDK or run them as part of an automated evaluation within the Azure AI Studio UI.
Monitoring for generative AI applications, now in preview, enables organizations to monitor key metrics such as token usage, generation quality, request count, latency, and error rate in production from within Azure AI Studio. Users can visualize trends, receive timely alerts, and consume per-request tracing to inform continuous improvements to application design.

Customize your automated evaluations with evaluators

As organizations scale app development and deployment, they need a way to perform consistent, systemic evaluations to assess whether that application’s outputs align to their business criteria and goals before deployment. Azure AI Studio enables this through automation, industry-standard measurements, and high-quality test dataset generation when customers need it. Developers can run, save, update and compare automated evaluation results for pre-built metrics for quality and safety and/or customize or create their own metrics tailored to their unique tasks and objectives.

With evaluators, customers can take evaluation customization and scale one step further. Evaluators are immutable prompt flow assets that contain evaluation metrics, parameters, and an Azure OpenAI model deployment to perform the assessment. Microsoft provides pre-built evaluators for specific quality and safety metrics and customers can also create their own evaluators that contain their preferred metrics and parameters.

For example, Contoso Camping may have multiple internal and external-facing copilots, each with a unique purpose. Despite this, they may want to evaluate every copilot they build for its ability to stay on brand. So, they create an “Contoso Brand Evaluator” consisting of some pre-built metrics (e.g. coherence) and custom metrics (e.g. friendliness) and use that as a consistent benchmark to assess whether each copilot meets their organizational standards.

Evaluators can be customized, versioned, and shared across an organization for improved consistency across projects. Developers can run evaluators locally and log results in the cloud using the prompt flow SDK or run them as part of an automated evaluation within the Azure AI Studio UI.

Create an evaluator

You can create code-based and prompt-based evaluators:

Code-based evaluators offer flexibility to define metrics based on functions or callable classes without requiring a large language model, as seen in as seen in examples like answer_length.py.
Prompt-based evaluators enable the creation of custom evaluators using prompty files, which contain YAML-formatted metadata defining model configuration and expected inputs, as illustrated by the example apology.prompty file.

Generate a test dataset in Azure AI Studio

High quality test data is essential for assessing your generative AI applications. You may already possess data in-house from sources such as customer care centers or red teamers. However, there could be instances where you lack adequate test data to evaluate the quality and safety of your AI systems. Furthermore, the existing datasets you own may not sufficiently cover all potential adversarial queries from real-world users. This is where synthetic adversarial dataset generation can be valuable. To create this data, Azure AI will role play with your app using targeted adversarial prompt templates developed by Microsoft Research to generate an adversarial dataset for risk and safety evaluations.

Run an evaluator

You have two options for running evaluators: the code-first approach and the UI low-code approach. If you want to evaluate your applications with a code-first approach, you’ll use the evaluation package of our prompt flow SDK.

When using AI-assisted quality metrics, you must specify an Azure OpenAI Service model to perform the assessment. Choose a deployment with either GPT-3.5, GPT-4, or the Davinci model for your calculations and set it as your model_config. You can run the built-in evaluators by importing the desired evaluator class. Ensure that you set your environment variables.

import os from promptflow.core import AzureOpenAIModelConfiguration # Initialize Azure OpenAI Connection with your environment variables model_config = AzureOpenAIModelConfiguration( azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"), api_key=os.environ.get("AZURE_OPENAI_API_KEY"), azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT"), api_version=os.environ.get("AZURE_OPENAI_API_VERSION"), ) from promptflow.evals.evaluators import RelevanceEvaluator # Initialzing Relevance Evaluator relevance_eval = RelevanceEvaluator(model_config) # Running Relevance Evaluator on single input row relevance_score = relevance_eval( response="The Alpine Explorer Tent is the most waterproof.", context="From the our product list," " the alpine explorer tent is the most waterproof." " The Adventure Dining Table has higher weight.", query="Which tent is the most waterproof?", ) print(relevance_score)

When utilizing AI-assisted risk and safety metrics, you don't need to bring your own Azure OpenAI Service model deployment. Instead of model_config, input your azure_ai_project details. This connects to the Azure AI Studio safety evaluations backend service, which provides a GPT-4 model capable of producing content risk severity scores and explanations.

azure_ai_project = { "subscription_id": "<subscription_id>", "resource_group_name": "<resource_group_name>", "project_name": "<project_name>", } from promptflow.evals.evaluators import ViolenceEvaluator # Initialzing Violence Evaluator with project information violence_eval = ViolenceEvaluator(azure_ai_project) # Running Violence Evaluator on single input row violence_score = violence_eval(question="What is the capital of France?", answer="Paris.") print(violence_score)

Alternatively, you can create an evaluation run from the evaluation and prompt flow pages in the Azure AI Studio UI and the evaluation wizard will guide you through the process of setting up an evaluation run.

Visualize evaluation results

After submitting your evaluation run, you can locate it within the run list in the Evaluations page in Azure AI Studio and click on any run to view results. You’ll see relevant details including evaluation details such as task type, prompt, temperature, and more. You can also view summary charts or dig into individual data samples to check that the scoring aligns to expectations or need adjustment.

To summarize, automated evaluation in Azure AI Studio can provide the metrics and systematic testing necessary to guide development towards quality and safety. These evaluations inform targeted mitigation steps, such as prompt engineering and content filtering, to help developers ensure that an app performs as desired. Once you feel comfortable about the safety and quality of your AI applications, it’s time to put them in production and enable monitoring!

Continuously monitor your generative AI applications

Once your application is successfully deployed in production, it’s imperative that you begin continuously monitoring its performance to inform ongoing improvements and interventions. The world of generative AI is dynamic; changes in consumer behavior, data, and environmental factors can influence your application’s performance over time. These variables can cause your deployed application to become outdated, resulting in exposure to compliance, economic, and reputational risks. As a result, prioritizing monitoring as part of your Large Language Model Operations (LLMOps) strategy is more important now than ever before.

Azure AI Studio monitoring for generative AI applications, now in preview, enables you to monitor your prompt flow applications in production for token usage, generation quality, and operational metrics. Azure OpenAI Service customers can also access risk and safety monitoring to understand what model inputs, outputs, and end users are triggering content filters to inform safety mitigations, coming soon to Azure AI Studio.

With this information, your organization can make proactive improvements to deployed applications based on fresh signals. Capabilities of Azure AI Studio monitoring include:

Apply generation quality such as groundedness, coherence, fluency, and relevance, which are interoperable with prompt flow evaluation metrics.
Monitor prompt, completion, and total token usage across each model deployment in your prompt flow.
Monitor operational metrics, such as request count, latency, and error rate.
Enable continuous alerting and metric computations for nonstop observability.
Consume data visualizations and per-request tracing data in Azure AI Studio.

To enable monitoring for your deployed prompt flow application, begin by navigating to the deployment within your Azure AI Studio project. Enable generation quality monitoring for the application and complete the required configurations.

Enable generation quality monitoring for your deployed Prompt Flow Generative AI application.

After your application has been used in production (for example, via, the REST API), navigate to the Monitoring tab to view the monitoring results. First, begin by adjusting the time selector to your desired time window. Then, you can view the monitoring results within the comprehensive dashboard.

Monitor the token usage of your application.

The bar at the top showcases aggregate level information such as the total number of requests as well as the number of prompt, completion, and total tokens. The card below provides visibility into the number of generation quality violations. Navigate between the Token usage, Generation quality, and Operational tabs to view the monitoring metrics.

Monitor the generation quality of your application.

If you select the Trace button for an individual query, you can view the detailed tracing information. This affords you the ability to drilldown into potential issues that may have arisen within your production application. You can then use this information to proactively improve your Generative AI application to ensure it is meeting your customers' needs.

Monitor per-request tracing.

Get started in Azure AI Studio

Evaluation and monitoring are critical elements of the generative AI application lifecycle. With greater observability into the performance of your applications pre- and post-production, you can deliver applications that are more efficient and trustworthy AI by design.

Learn more about how to evaluate and monitor your generative AI applications in Azure AI Studio in our documentation: