NL to SQL Architecture Alternatives

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.



The recent advances in LLMs have created opportunities for businesses to gain new insight into their unstructured text data. A classic example of this has been using the popular Retrieval Augmented Generation pattern (e.g. in which enterprise documents are used as context to answer user questions. These implementations can reduce time spent sifting through lengthy documentation and promote awareness throughout an organization.


But what about using LLMs to extract insights from structured relational data such as SQL databases? Many enterprises possess enormous amounts of SQL data that currently requires technical expertise to query or build business dashboards to make this data accessible to end users. However, it is costly to build these dashboards and they are limited in their scope.


With LLMs, there is a new potential to enable non-technical users to extract information from SQL databases by using an LLM to dynamically generate SQL queries, execute those SQL queries, and use the results as context to an LLM to provide a response to the end user. This use case has potential to enhance understanding of business data and reduce time and money spent on extracting data from SQL databases.


This guide will show multiple approaches to building a chat application that leverages SQL data as context. We will start with the simplest architecture and work progressively toward more sophisticated methods to address limitations.


Note: Truncated prompts will be provided in the examples below, but more complete and comprehensive prompts will be included in the final section.


Architecture 1: Few Shot SQL Generation

In this first implementation of an NL to SQL chat app, the LLM builds a SQL query based on the user’s question and a fixed SQL schema that is included in the system prompt.

 Architecture 1.png

Figure 1 – Few Shot SQL Generation


  1. User asks a question: through a front end UI, the user submits a question. For example, “What was the product line with the highest revenue in 2022?”
  2. App makes a request to the LLM: the LLM is invoked with a prompt that might look like the following (for more comprehensive prompts see Appendix A),

“You are an expert at providing facts from a SQL Database. Given the user question, produce a PostgresSQL query which, when executed provides the correct answer to the question. Only use the tables and schema given below to answer the question. SQL Schema: {Schema}

User Question: {Question}”


  1. LLM responds with a SQL query: the LLM will generate a SQL query string as a response that might look like, “SELECT p.product_line, SUM(s.revenue) AS total_revenue … “. This string gets stored as a variable, to be used in the next step.
  2. SQL Query string is executed, result is saved: the previous string that was generated is sent to the SQL query to be executed. The result of the SQL query will be saved as a text string (e.g. “Home Appliances, $12,500,000”)  Note: this poses a potential risk of SQL injection, ensure you are limiting permissions, and see final section regarding Security and Safety
  3. Final LLM call performed: A second request is sent to the LLM, this time with the user’s original question, and the result of SQL Query. The prompt might look like:

Given the original question, the corresponding PostgreSQL query, and the execution results, produce a truthful and accurate natural language response to the original question.

Question: {question}”

  1. Final Response: The LLM responds with the answer to the user’s question, based on the SQL query results.


Benefits / Limitations: This architecture is quite simple and benefits in terms of latency by having only two LLM calls. However, this setup is limited because the LLM does not have any examples of valid SQL queries to reference.


Architecture 2: Few Shot SQL Generation with RAG

This implementation is similar to the previous architecture, but rather than having a fixed set of few-shot examples, we create a data preparation step in which example queries are embedded into a vector database and then retrieved at run time.

Architecture 2.png

Figure 2 – Few Shot SQL Generation with RAG


Below, we discuss the new steps that differentiate Architecture 2 from Architecture 1:

  1. Offline Data Prep: In order to best leverage the few shot examples framework, a set of example user questions and corresponding SQL queries (50-100 samples) should be collected and stored in JSON or CSV format. Then, each of these pairs should be embedded in a vector database such as Azure AI Search (Search over JSON blobs - Azure AI Search | Microsoft Learn).{ "question": "Which product line had the highest sales in FY2022?" "sql_query": "SELECT value FROM table_1..." }
  2. (Runtime) User submits question - After the data preparation in step 1, the application starts in the same way as Architecture 1. The user submits a question, and the app sends the question first to the embeddings deployment of Azure OpenAI.
  3. Embed user’s question: Make a request to a text embeddings model, such as text-ada-002, to convert the user’s question to a vector.
  4. Retrieve few shot examples: Then retrieve the top 5 examples with the closest similarity (using Azure AI Search cosine similarity algorithm) and include them in the SQL Generation request to GPT-4.
  5. Steps 5-8: Proceed in the same order as Architecture 1. The SQL string is generated, executed, and the results are used as context to provide the final response to the end user.

Benefits / Limitations: This architecture is quite simple and benefits in terms of latency by having only two LLM calls. However, this setup is limited because the LLM does not have any examples of valid SQL queries to reference.


Architecture 3: Fine Tuning SQL Generation with GPT 3.5T

One of the central challenges in building an NL to SQL based chat application is handling latency. One method to reduce latency is by opting to fine tune a smaller model, such as GPT 3.5T for the SQL generation step.


Architecture 3.png

Figure 3 – Fine Tuning SQL Generation with GPT 3.5T


This architecture follows roughly the same pattern as architecture 1, but with a preparatory fine tuning step.


  1. Offline Fine Tuning – Collect 50-100 NL questions and corresponding SQL Queries. Use Azure OpenAI’s fine tuning capability to fine tune your GPT 3.5 deployment (Customize a model with Azure OpenAI Service - Azure OpenAI | Microsoft Learn)
  2. Steps 2-7 – Same as architecture 1, but directed toward the fine tuned instance of GPT 3.5T.

Note: If sufficient NL / SQL samples are available, fine tuning could be used with RAG to include few shot examples in addition to the fine tuning. Users should experiment with the exact number of samples to include to get best results and ensure to test with unseen samples.



In order to evaluate an NL to SQL project you will need to evaluate the accuracy of the generated SQL query, the final response, and latency for each step. Below, we outline some of the key metrics to capture during the evaluation stage.


Manual Metrics:

The simplest way of evaluating the application is to have a human evaluator look at the final response and determine if it answer’s the user’s original question correctly. An example of this is End-to-End Accuracy.

  1. End-to-End Accuracy – does the final response correctly answer the user’s question?
    • How to implement: Output all question and answers to a spreadsheet, and have a human validator enter 1 or 0 in each row to indicate whether the answer was correct or incorrect. Correct Answers / Total Answers = Accuracy.

One of the limitations with this method is that it requires a human to review each response, which can be labor-intensive. Additionally, if the answer is marked incorrect, it may not be clear which part of the process failed. For this reason, we also introduce a set of automated metrics to evaluate each step of the process.


Automated Metrics:

Below we introduce four automated metrics, which can be executed immediately without human reviewers in a scalable fashion.


  1. Validity of SQL Query – Is the generated SQL query valid?
    • How to Implement: This can be scored using a python library like sqlvalidator, or by implementing logic into the code to log each instance in which the SQL query fails.
  2. SQL Critic Score – Does the SQL Query and result correctly translate the original question?
    • How to implement: Use the LLM to evaluate itself by sending a prompt with instructions to validate the accuracy of the SQL Query and include the original question, SQL query, and result in the request (see Appendix B Example 5)
  3. End-to-End Relevance – Is the generated NL response relevant to the user’s initial question?
    • How to implement: Send a prompt to the LLM with instructions to compare the NL response to the user’s initial question and evaluate its relevance (see Appendix B Example 6)
  4. End-to-End Groundedness – Does the generated NL response logically follow from the context provided to it (i.e. the SQL Query)?
    • How to implement: Send a prompt to the LLM with instructions to compare the NL response to the SQL query and results and determine if the response logically follows (see Appendix B Example 7)


Figure 4 – NL to SQL Evaluation Metrics


Safety and Security

LLMs introduce new risks to a web application. In this case, the LLM is generating SQL code that will be executed. This introduces a attack surface, and the appropriate boundaries should be introduced prior to releasing to public.


NL to SQL systems require various safety precautions to ensure accurate, secure, and unbiased results. Here are some key aspects to consider.



  • SQL Injection
  • Handling ambiguity and vagueness
  • Mitigating bias
  • Unintentionally writing to database



  • Granular permissions and proper authorization – refine access based on roles and groups.
  • Ensure access to known users.
  • Implement parameterized queries that separate data from the query itself.
  • Ensure read-only and only execute permissions.
  • Implement strict input validation and sanitization procedures with whitelisting, regular expressions, data type checks, escaping.
  • Implement logging and monitoring.
  • Explainability and transparency


Additional resources

Appendix A: Prompt Examples

Example 1: NL to SQL System Prompt

"You are an expert in providing facts from an SQL database. Given the user question, produce a PostgreSQL query which, when executed, provides the correct answer to the question. Only use the tables and schema given below to answer the questions. Learn query patterns from the similar question-answer pairs provided as examples below. Directly start your response with the executable query. Do not start with '''sql. Do not make up new table and column names. Only use the ones available in the schema below.


SQL Schema:






Example 2: SQL to NL System Prompt

"Given the original question, the corresponding PostgreSQL query, and the execution results, produce a truthful and accurate natural language response to the original question, that is only based on the execution results. Your answer should be detailed and accurate, for numerical responses make sure you infer the appropriate units. If the question is complex, you can provide a multi-sentence response. Make sure that you provide a response that is easy to understand."


Example 3: NL to SQL User Prompt

"Your response should be an executable query, do not start with '''sql Provide detailed reasoning behind the PostgreSQL query, please provide it as a comment using -- at the beginning of the line. Formatting example:

SELECT some_name, val FROM some_table WHERE a = number AND b = another_number -- This query selects some_name and val from some_table where a = number and b = another_number. Queries should return a minimal amount of rows, and should not return more rows than necessary.


User question: {query}


PostgreSQL query:"


Example 4: SQL to NL User Prompt

"Question: {question} SQL Query: {sql_query} SQL Query Results: {sql_query_results}"


Appendix B: Evaluation Prompt Examples

Example 5: SQL Critic Score Prompt

"Your job is to evaluate the nl-to-sql engine.

The app goes through several steps:

  1. The user question is sent to a Natural Language to SQL converter along with information about the database schema.
  2. The generated SQL query is executed and the output is logged.

Your job is to determine if the generated SQL correctly translates the original question.

User question: {question}

Generated PostgreSQL: {generated_query}

Executed PostgreSQL result: {query_execution_result}

Given the information above, give a numeric score of 0 to the Generated SQL if it doesn't correctly handle the User question, and give a numeric score of 1 if the Generated SQL query correctly handles the User question.

If the SQL query yields an error, give a numeric score of 0.

If the SQL query doesn't error out, but doesn't correctly handle the User question, give a numeric score of 0.

If the SQL execution results are empty, most likely it didn't handle the User's question. Think hard if you want to give it a 1 or a 0.

Score: "


Example 6: End to End Relevance

"You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a government stats  question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.


You will be presented with a CONTEXT and an ANSWER about that CONTEXT. You need to decide whether the ANSWER is entailed by the CONTEXT by choosing one of the following rating:


    10: The ANSWER follows logically from the information contained in the CONTEXT.

    0: The ANSWER is logically false from the information contained in the CONTEXT.

    an integer score between 0 and 10 and if such integer score does not exists, use 0: It is not possible to determine whether the ANSWER is true or false without further information.


    Read the passage of information thoroughly and select the correct answer from the three answer labels. Read the CONTEXT thoroughly to ensure you know what the CONTEXT entails.


    Note the ANSWER is generated by a computer system, it can contain certain symbols, which should not be a negative factor in the evaluation.


    Independent Examples:

    Example Task #1 Input:


    Query: {Sample SQL Query}

    Query results: {Sample SQL Query Result}


    "ANSWER": "{Valid Answer}”


    Example Task #1 Output:



    Example Task #2 Input:


    Query: "{SQL Query}”

    Query results: {SQL Query Results}


    "ANSWER": "{NL Response}"


    Example Task #2 Output:




Example 7: End to End groundedness

“Is the answer relevant and related to the question?

If the answer is yes, output 1.

Respond 0 otherwise.


Question: {question}

Answer: {answer}


Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.