This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.
In recent months, the world of natural language processing (NLP) has witnessed a paradigm shift with the advent of large-scale language models like GPT-4. These models have achieved remarkable performance across a wide variety of NLP tasks, thanks to their ability to capture and understand the intricacies of human language. However, to fully unlock the potential of these pre-trained models, it is essential to streamline the deployment and management of these models for real world applications.
In this blog post, we will explore the process of operationalizing large language models, including prompt engineering and tuning, fine-tuning, and deployment, as well as the benefits and challenges associated with this new paradigm.
How do LLMs work?
Large language models, like GPT-4, use deep learning techniques to train on massive text datasets, learning grammar, semantics, and context. They employ the Transformer architecture, which excels at understanding relationships within text, to predict the next word in a sentence. Once trained, these models can generate human-like text and perform various tasks based on the input provided. This is very different from classical ML models where were train with specific statistical algorithms that deliver pre-defined outcomes.
Large language models outperform traditional machine learning models in terms of generating human-like responses due to their ability to learn from human feedback and the flexibility provided by prompt engineering.
Figure: Differences between ML Models and LLMs
What are the risks of LLMs in real-world applications?
LLMs are designed to generate text that appears coherent and contextually appropriate, rather than adhering to factual accuracy. This leads to various risks as highlighted below:
Bias amplification: LLMs could produce biased or discriminatory outputs.
Hallucination: LLMs may inadvertently generate incorrect, misleading, or false information.
Prompt Injection: Bad actors could exploit LLMs to produce harmful content using prompt injection.
Ethical concerns: The use of LLMs raises ethical questions about accountability and responsibility for the output generated by these models.
How to address the risks of LLMs?
In my opinion, there are two main ways to address the risks of LLMs and make them safe to use in real-world applications.
- Responsible AI Framework: Microsoft has created very detailed technical recommendations and resources to help customers design, develop, deploy, and use AI systems that implement the Azure OpenAI models responsibly. I will not delve more into this topic in this blog so please visit this link to learn more: Overview of Responsible AI practices for Azure OpenAI models
- Leverage MLOps for Large Language Models, i.e., LLMOps: Over the years, MLOps has demonstrated its ability to enhance the development, deployment, and maintenance of ML models, leading to more agile and efficient machine learning systems. MLOps approach enables the automation of repetitive tasks, such as model building, testing, deployment, and monitoring, thereby improving efficiency. It also promotes continuous integration and deployment, allowing for faster model iterations and smoother deployments in production. Though LLMs are pre-trained, we do not have to do the expensive training but MLOps can be leveraged to tune the LLMs, operationalize and monitor them effectively in production. MLOps for Large Language Models is called LLMOps.
MLOps vs LLMOps:
Let us quickly refresh how MLOps works for classical Machine Learning models. Taking ML models from development to deployment to operations involves multiple teams and roles and a wide range of tasks. Below is the flow of a standard ML lifecycle:
Figure: Classical ML Lifecycle workflow
Data Preparation: Gather necessary data, clean and transform into a format suitable for machine learning algorithms.
Model Build and Training: Select suitable algorithms and feed preprocessed data allowing it to learn patterns and make predictions. Improve the accuracy of the model through an iterative hyper parameter tuning and repeatable pipelines.
Model Deployment: Package the model and deploy it as a scalable container for making predictions. Expose the model as APIs to integrate with applications.
Model Management and Monitoring: Monitoring performance metrics, detecting data and model drifts, retraining the model, and communicating the model's performance to stakeholders.
Interestingly enough, the life cycle for LLMs is very similar to classical ML models as outlined above but we do not have to go through expensive model training because the LLMs are already pre-trained. However, we still have to consider tuning the prompts (i.e., prompt engineering or prompt tuning) and if necessary, fine-tune the models to for domain specific grounding. Below is the flow of an LLM lifecycle:
Figure: LLM Lifecycle workflow
Using Azure Machine Learning for LLMOps:
Data Preparation: The first step in the process is to access the data for LLMs similar to ML models. Azure ML provides seamless access to Azure Blob Storage, SQL Databases etc. which can be resisted as Datastores. The data inside those Datastores, i.e., files, tables etc. can be easily accessed using the URIs. For example, azureml://datastores/<data_store_name>/paths/<folder1>/<file>.parquet
For more documentation and examples please refer to the documentation here: Data concepts in Azure Machine Learning
Model Build & Train: One main advantage of LLMs is that we do not have to go through the expensive training process because they are already available models like GPT, Llama, Falcon etc. However, we still have to consider tuning the prompts (i.e., prompt engineering or prompt tuning) and if necessary, fine-tune the models to for domain specific grounding. Below is the flow of an LLM lifecycle.
Foundational Model Catalog: The Model Catalog is a hub for discovering Foundation Models such as Azure OpenAI models, Llama 2, Falcon and many models from HuggingFace. These models are curated, tested thoroughly by Azure Machine Learning to easily deploy and integrate with the applications.
Figure: LLM Foundational Model Catalog in Azure ML
The foundational models can be easily deployed using the UI, Notebooks or CI/CD pipelines from Azure DevOps or GitHub.
Figure: Deploying Llama model as real-time endpoint
Please refer to this link for more detailed documentation:
GitHub Repo with example notebooks:
As highlighted in the blog above, developing efficient prompts is highly crucial to keep the LLMs less risky and safer. Azure Machine Learning Prompt Flow provides a comprehensive solution that simplifies the process of prototyping, experimenting and tuning the prompt engineering process. Below are some important features:
- Create executable flows that link LLMs, prompts, and Python tools.
- Debug, share, and iterate your flows with ease through team collaboration.
- Create prompt variants and evaluate their performance through large-scale testing.
- Deploy the prompt flow as real-time endpoint to integrate into the workflow.
Figure: The Prompt Flow designer UI with integrated notebook feature
A visual flow of how Prompt Flow with the connected building blocks:
Figure: A visual flow with building blocks of a Prompt Flow
Once the Prompt Flow is developed it can be easily deployed as an Endpoint for integrating into the workflow.
Figure: A Prompt Flow endpoint
Please refer to this link for more detailed documentation on Prompt Flows:
Retrieval Augmented Generation (RAG): Another way of reducing the risks of LLMs is by grounding with the domain specific data so the LLMs will look into that data for giving the responses. This is called Retrieval Augmented Generation (RAG). The RAG process works by chunking large data into manageable pieces, then creating vector embeddings that make it easy to understand the relationships between those pieces.
Figure: Retrieval Augmented Generation (RAG) process flow
Creating RAG pipeline is easy with Prompt Flows by connecting various components such as extracting data from Datastores, creating vector embedding and storing vectors in a vector database.
Figure: Q&A Generation with the RAG pipeline
Please refer to the documentation below on RAG capabilities in Azure AML: Use Azure Machine Learning pipelines with no code to construct RAG pipelines (preview)
LLM Fine Tuning: Fine-tuning for large language models is a process where a pre-trained model is adapted to generate answers specific to a particular domain. Fine-tuning allows the model to grasp the nuances and context relevant to that domain, thus improving its performance. The following are the steps involved in Fune tuning:
- Select a relevant dataset: Choose a dataset that represents the specific domain or task you want the model to excel in, ensuring it has adequate quality and size for effective fine-tuning.
- Adjust training parameters: Modify parameters like learning rate, batch size, and the number of training epochs to optimize the fine-tuning process and prevent overfitting.
- Evaluate and iterate: Regularly assess the fine-tuned model's performance using validation data and make necessary adjustments to improve its accuracy and effectiveness in the target domain.
Please refer to this GitHub Repo for more details on Fine tuning: advanced-gen-ai/Instructions/04-finetune-model.md at main (github.com)
Model Deployment: Next phase of the LLMOps is the deployment of the model as an endpoint to integrate with the applications for production use. Azure ML offers highly scalable computers such as CPU and GPUs for deploying the models as containers and to support inferencing at scale:
- Real-time Inference: It supports real-time inferencing through low-latency endpoints, enabling faster decision-making in applications.
- Batch Inference: Azure ML also supports batch inferencing for processing large datasets asynchronously, without the need for real-time responses.
Figure: Endpoints in Azure Machine Learning
Please refer to this GitHub repo that has very detailed information on MLOps using the latest Azure ML SDK V2 and quick tutorials for using MLOps with Azure DevOps or GitHub:
(Please note that Generative AI and LLM capabilities in Azure ML are still fairly new so there will be improvements made on regular basis. The above repo is based on MLOps for ML, but the same process can be leverage for LLMs with some refactoring)
Note: The information and screenshots given below are specific to Model Monitoring for classical ML models, but the same process and tools can be leveraged for LLM models.
Model Management and Monitoring: Finally, once the LLM models are deployed as endpoints and integrated into the applications, it is very important to monitor these models to make sure they are performing as intended and they continue to generate value for the users. Azure ML provides comprehensive model monitoring capabilities including monitoring data for drift, model performance, and infrastructure performance.
Data Drift: Data drift occurs when the distribution of input data used for predictions changes over time. This can lead to a decrease in model performance as the model is trained on historical data but used to make predictions on new data. Azure Machine Learning's data drift detection feature allows you to monitor the input data for changes in distribution. This helps you identify when to update your model and ensure that it remains accurate as the data landscape changes.
Figure: A sample Data Drift from Azure ML
More detailed step by step instructions can be found here on monitoring Datastores for data drift: Detect data drift on datasets (preview) - Azure Machine Learning
Model Metrics: Model monitoring is a comprehensive feature that enables you to track the performance of your deployed models, including accuracy, latency, and other metrics. With Azure Machine Learning, you can set up alerts and notifications to inform you when there are changes in your model's performance or when certain thresholds are crossed. This helps you to maintain high-quality models and proactively address any issues that may arise.
Figure: Setting up Model monitoring signals in Azure ML
For more detailed documentation on how to collect the data with Model Monitoring, please refer to this link: Collect production data from models deployed for real-time inferencing (preview)
Model and Instructure Monitoring: With the monitoring of model and infrastructure, we track model performance in production to understand from both model and operational perspectives. Azure Machine Learning supports logging and tracking experiments using MLflow Tracking. We can log the models, metrics, parameters, and other artifacts with MLflow. This log information is captured inside Azure App Insights which can then be accessed using Log Analytics inside Azure Monitor. Since the LLMs come as pre-trained we may not get deep into the model inferencing logs, but we can effectively track LLM hyperparameters, execution times, prompts and responses.
Figure: Monitoring endpoint for traffic inside Azure ML Studio
Figure: Monitoring endpoint for traffic and metrics inside Azure Portal.
For more detailed information on logging metrics and monitoring the endpoints in Azure ML, please refer to the documentation below:
In conclusion, LLMOps plays a crucial role in streamlining the deployment and management of large language models for real-world applications. Azure Machine Learning offers a comprehensive platform for implementing LLMOps, addressing the risks and challenges associated with LLMs.
Generative AI is a rapidly growing domain and there are new capabilities being added to Azure on regular basis. Consequently, it is vital to stay informed about the latest updates in Azure Machine Learning and LLMOps by monitoring Microsoft's current documentation, tutorials, and examples. This ensures that you utilize the most cutting-edge tools and strategies for effectively deploying, managing, and monitoring your large language models.