Is My LLM Chatbot Ready for Production?

This post has been republished via RSS; it originally appeared at: Microsoft Tech Community - Latest Blogs - .

A 10,000-foot overview of LLMOps and Generative Evaluation

 

The Generative AI space is often referred to as the ‘Wild Wild West’ – largely in part due to the rapid attainment of widespread public use, but also due to the paradigm shift from deterministic to probabilistic models and output compared to “classical” AI or ML.

 

This post seeks to demystify Generative AI development operations (LLMOps) by providing an overview of how to instill confidence putting Generative AI models into Production and tame the West! (No Wyatt Earp required)

 

Why is LLMOps important?

 

There are already countless examples of mishaps with LLM Chatbot implementations: Canada Airlines was ordered to pay a customer misled by a bot, a judge fined lawyers $5,000 for fake case law created by an LLM, or my personal favorite…

 

zsoenen_0-1712679698259.png

(source: x.com)

 

Wrangling and controlling these language models will become increasingly more important as Gen AI adoption continues to increase. The primary goal of a good LLM Operations system should be to provide confidence to product owners, stakeholders, and development team members that the project is ready for release to Production. For most, this means some confirmation that the project is safe and accurate.

 

Safety, while self-explanatory, covers a broad range of potential risks that may vary for each implementation. These risks range from avoiding harmful language, to preventing copyright infringement, to denying access to proprietary data, and plenty more.

 

Accuracy, on the other hand, is much harder to calculate in this new probabilistic world. Evaluation methods range from using a foundational model as an LLM evaluator to good, old-fashioned human grading. Ultimately, most projects end up using a combination of LLM, mathematical, and manual based approaches. Check back for additional blogs in this series that will discuss evaluation and testing practices in more detail!

 

How do I do LLMOps?

 

The implementation of LLMOps can be broken down into three categories:

  1. Evaluation (aka Testing) is done before release to end users
  2. Monitoring is done in (near) real-time with an app deployment
  3. Feedback is analyzed after user interaction

Evaluation framework design and implementation represents most of the up-front workload, but it is critical for rapid development cycles, scalability, and instilling confidence in a production release. A well-designed evaluation framework is repeatable. This allows for quick A/B testing of new prompting techniques, LLMs, or other features. Using a combination of LLM-based, math-based, and/or manual evaluation (including red-teaming) ensures a consistent scale. Ultimately, once the evaluation framework is proven, its inclusion in a Continuous Integration / Continuous Deployment (CI/CD) pipeline provides the ultimate solution.

 

A strong CI/CD pipeline remains critical to the application development and deployment process for Generative applications. And while a classical MLOps pattern of continuous re-training is not needed for foundational models or RAG implementations, it is relevant for cases of Fine-Tuned models where the input data distribution can change over time.

 

As with traditional applications or machine learning models, monitoring is important to capture performance. However, with Generative AI, monitoring is now important to ensure safety as well. Tools such as Azure AI Content Safety are helpful to screen both directions of the human computer interaction. Model outputs are screened for potentially harmful language along with automated detection of adversarial events (i.e. prompt injection attacks) from user inputs. LLMs must be moderated, but moderating users is important as well!

 

Feedback is often forgotten, but it’s the most important component of the LLMOps lifecycle. Because of the probabilistic nature of LLMs, not only do we care about the model’s knowledge but also how the model is contextualizing and delivering that knowledge to users. The best way to understand a model’s semantic knowledge is by leveraging user feedback to drive the development roadmap and prioritize enhancements or bug fixes. Direct feedback - such as survey responses or a thumbs-up/thumbs-down button - is usually the most impactful, however it can be the most difficult to obtain. Indirect feedback, like tracking user session time, daily active users, or the frequency of certain prompts is a great way to supplement direct feedback.

 

As more and more businesses leverage in-house developed LLM chatbots, the attention on LLMOps will rise exponentially. Beginning with the end in mind is critical for your development team to rise to the top of the pack!

 

 

Check back for additional articles in this blog series for detailed posts regarding state-of-the-art approaches for Evaluation, Monitoring, and Feedback of LLM Chatbots.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

This site uses Akismet to reduce spam. Learn how your comment data is processed.