Build an LLM-based application, benchmark models and evaluate output performance with Prompt Flow

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .



In this article, I’ll be covering some of the capabilities of Prompt Flow, a development tool designed to streamline the entire development cycle of AI applications powered by Large Language Models (LLMs). Prompt Flow is available through Azure Machine Learning and Azure AI Studio (Preview).

Through Prompt Flow, I will:

  • Create a NER (Named-Entity Recognition) application;
  • Test different LLMs (GPT-3.5-Turbo vs. GPT-4-Turbo) through variant capability;
  • Evaluate output performance using a built-in evaluation method (QnA F1 Score Evaluation)


Note: As Azure AI Studio is in preview currently (February 2024), I’ll leverage Prompt Flow through the Azure Machine Learning Studio. After the preview period, everyone should use Azure AI Studio with Prompt Flow.

Create a NER application

On the Azure portal, go on Azure Machine Learning service, create a workspace, and launch the studio.





On the Azure Machine Learning Studio, you will find different features such as:

  • Prompt Flow, is a feature that allows you to author flows. Flows are executable workflows often consist of three parts:
    • Inputs: Represent data passed into the flow. There can be different data types like strings, integers, or boolean.
    • Nodes: Represent tools that perform data processing, task execution, or algorithmic operations. Tools are LLM tool (enables custom prompt creation utilizing LLMs), Python tool (allows the execution of custom Python scripts), Prompt tool (prepares prompts as strings for complex scenarios or integration with other tools)
    • Outputs: Represent the data produced by the flow.

  • Model Catalog, is the hub to deploy a wide-variety of third-party (Mistral, Meta, Hugging Face, Deci, Nvidia, etc.) open source as well as Microsoft developed foundation models pre-trained for various language, speech and vision use-cases. You can consume some of these models directly through their inference API endpoints called “Models as a Service” (e.g. Meta and Mistral) or deploy a real-time endpoint on a dedicated infrastructure (e.g. GPU) that you manage;

  • Notebook, to allow data scientists to create, edit, and run Jupyter notebooks in a secure, cloud-based environment;

  • Compute, a managed cloud-based workstation for data scientists. Each compute instance has only one owner, although you can share files between multiple compute instances. Use a compute instance as your fully configured and managed development environment in the cloud for machine learning. They can also be used as a compute target for training and inferencing for development and testing purposes.


Now, click on “Prompt Flow”, create a Flow by selecting a “Standard flow”. Now you should have a flow similar to this one:




This flow represents an application with different blocks. Let me go through each block (called node):

  1. Inputs
    1. Takes as input (prompt) a topic
  1. Joke (LLM node)
    1. Condition the LLM “to tell good jokes” through a system message. Takes as input the initial prompt.
    2. You need to enable a Connection to make this action able to interact with an endpoint (e.g. LLM inference API, Vector Index such as Azure Search, Azure OpenAI deployed models, Qdrant, etc). To do that, you need to create this connection in a dedicated pane within Prompt Flow by specifying the provider, the endpoint, and credentials.


  2. Echo (Python node)
    1. Python script which takes as input the output (completion) of the LLM and echo it.


  3. Output
    1. Outputs the … output of the Python script.


To test your flow, provide an input and click on Run. On the outputs tab can review the outputs:





Now that I have a flow, I want to edit it to become a NER (Named Entity Recognition) flow that leverages an LLM to find entities from a given text content. To do that, I’ll edit the LLM node (previously named “joke”) and the Python node (previously named “echo”).


LLM node

I’ll rename the LLM node in “NER_LLM” with the configuration below.



To perform the NER I’ll use this prompt:




Your task is to find entities of a certain type from the given text content.
If there're multiple entities, please return them all with comma separated, e.g. "entity1, entity2, entity3".
You should only return the entity list, nothing else.
If there's no such entity, please return "None".

Entity type: {{entity_type}}
Text content: {{text}}




Python node

I’ll rename the Python node in “cleansing” with the configuration below.



It runs the following Python code:



from typing import List
from promptflow import tool
def cleansing(entities_str: str) -> List[str]:
    # Split, remove leading and trailing spaces/tabs/dots
    parts = entities_str.split(",")
    cleaned_parts = [part.strip(" \t.\"") for part in parts]
    entities = [part for part in cleaned_parts if len(part) > 0]
    return entities




Basically, this code snippet takes as input an entity (or entities if more than one) and cleanses a comma-separated string by removing extraneous whitespace, tabs, and dots from each element and returns a list of non-empty, trimmed strings.


Test the flow

To test the flow, I asked an LLM to provide me with an example. I asked the model to provide me in JSON format an “entity_type” and a “text” that contains entity or entities to extract through NER process. I took GPT-4-Turbo model through the Azure OpenAI Playground interface. Here is the example:

{"entity_type": "location", "text": "Mount Everest is the highest peak in the world."}  

Then, I pass those into the inputs node on my flow:



Finally, I can run my flow. Basically, it will execute nodes after nodes from inputs to the outputs nodes:




I can see the result in the Outputs section:



And get more information in the Trace section such as API calls in which node, time to proceed, number of tokens process on the prompt and on the completion side, etc.





If you want to test different prompts, system messages, even different models you can create Variants. A variant refers to a specific version of a tool node that has distinct settings such as another models, different temperature, different top_p parameter, another prompts, etc. This way you’re able to perform basic A/B testing.

Let’s say you want to compare results between two of the most used OpenAI’s models that are GPT-3.5-Turbo and GPT-4-Turbo.

To make that happens, go on the LLM node, and click on “Generate variants”. On this example I’ll keep same prompt, same temperature, same top_p parameter, but I’ll change the LLM to interact with (from GPT-4-Turbo to GPT-3.5-Turbo).


To test multiple variants at the same time, click on Run, and select all my variants (variant_0 refers to GPT-4-Turbo and variant_1 refers to GPT-3.5-Turbo), so that I aggregate results within same outputs tab:



We can see that we obtain the same results, independently of the LLM used behind. Let’s be honest, this example isn’t very complex and can be easily handled by smaller model than GPT-4-Turbo but let’s keep it simple as the complexity of the task is not the main purpose of this blog post.



Now that I have a NER flow and having been able to test the application with different LLMs, I’d like to evaluate the output performance. This is where the Evaluate capability of Prompt Flow comes in. The evaluation feature enables you to select built-in evaluation methods and build your own custom evaluation methods.

Here, I’ll use the built-in “QnA F1 Score Evaluation” method. I won’t go deep into the details, but this evaluation method computes the F1 score based on words in the predicted answers and the ground truth.

Then, I need data samples to run the flow and then evaluate the outputs at a larger scale than one example. One of the use-case around Generative AI is the way of these models generate data samples so I’ll use GPT-4-Turbo to generate 50 examples that will serve to run flows and evaluate outputs.

Here is my system message:

Your task is to generate in .jsonl format a data set that will be used to evaluate an LLM-based application.


This application is performing NER (Named Entity Recognition) with 2 inputs: "entity_type" as a string (e.g. "job title") and "text" as a string (e.g. "The software engineer is working on a new update for the application."). The desired output is the entity or entities if they're multiple (e.g. "software engineer").

Here is my prompt:

Generate 50 samples:


Here are the first five lines of the completion:



{"entity_type": "person", "text": "Elon Musk has announced a new Tesla model.", "entity": "Elon Musk"}
{"entity_type": "organization", "text": "Google is planning to launch a new feature in its search engine.", "entity": "Google"}
{"entity_type": "job title", "text": "Dr. Susan will take over as the Chief Medical Officer next month.", "entity": "Chief Medical Officer"}
{"entity_type": "location", "text": "The Eiffel Tower is one of the most visited places in Paris.", "entity": "Eiffel Tower"}
{"entity_type": "date", "text": "The conference is scheduled for June 23rd, 2023.", "entity": "June 23rd, 2023"}




Once I’m happy with my sample, I select Evaluate in Prompt Flow, where I can edit the run display name, add description and tags, select for each LLM nodes the variants I want to evaluate. In my case I select the two variants I created:




Now I need to select a runtime, upload my sample, do the inputs mapping:



Then I select the evaluation method I want to perform (QnA F1 Score Evaluation method here). Here I need to specify data sources for the ground_truth (from the sample) and for the answer (generated by the LLM within the flow):



Finally I can click on “Review + Submit”. Behind the scene, Prompt Flow is executing my flow in 2 separate runs, one with variant_0 and the other with variant_1. Once these runs will be completed, it will perform the QnA F1 Score Evaluation method to both runs.

We can see the results of the executions on the Runs tab:




First observation is the duration of each execution:

  • The execution based on variant_0 (GPT-4-Turbo) took 1mn 14s to be completed;
  • The execution based on variant_1 (GPT-3.5-Turbo) took 14s to be completed.

One thing to keep in mind in the LLM world is that using a larger model will - most of the time - result in longer inference speed.


Now let’s have a look at the evaluations. By selecting both evaluation runs we can output results:



We can observe that the flow with highest F1 score is the one leveraging GPT-4-Turbo model (F1 score == 0.95) compared to GPT-3.5-Turbo model (F1 score == 0.89).

Although the larger model results in better output performance evaluation, leveraging GPT-3.5-Turbo model results in 80% faster inference speed and more cost effective as well.

Inference speed and tokens pricing model are some of the trade-off that you need to make in order to make sure you choose the right model to answer your need.




In conclusion, we've been covering Prompt Flow within Azure Machine Learning and Azure AI Studio to build and evaluate AI applications powered by Large Language Models (LLMs). This blog post walks through the process of creating a Named-Entity Recognition (NER) application, testing it with different LLMs (GPT-3.5-Turbo and GPT-4-Turbo), and evaluating the output performance using the built-in QnA F1 Score Evaluation method.

We've been demonstrating the use of variants to perform A/B testing between different models and a performance evaluation using generated data samples to calculate the F1 score, highlighting the trade-offs between inference speed, model size, and cost-effectiveness.

Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.