How do I Make my LLM Chatbot Better?

This post has been republished via RSS; it originally appeared at: New blog articles in Microsoft Community Hub.

Earlier posts in this series:

Part 1: Is my Chatbot Ready for Production? – A 10,000 foot overview to LLMOps

Part 2: How do I Evaluate my LLM Chatbot? - A guide to different LLM chatbot evaluation techniques and how to implement them

Part 3: How do I Monitor my LLM Chatbot? -  A guide to monitoring LLM chatbot security and performance


Feedback is a critical, often overlooked, component of the LLMOps lifecycle. User feedback is a great way to gauge application success and guide new features or enhancements on the roadmap. Although feedback can be captured in many ways, the ultimate goal is application improvement.


In the age of LLMs, feedback is more important than ever. The probabilistic nature of LLMs is a paradigm shift away from standard ML models, where results are deterministic. Although there are robust evaluation methods, there is still no substitute for human analysis. Once an application is live, you can “crowdsource” this human analysis by soliciting feedback – both positive and negative – from users.


Types of feedback can be classified into two categories:

  1. Direct Feedback: Soliciting feedback directly from users
  2. Indirect Feedback: Analysis of user behavior and interactions

Both are valuable and provide different flavors of insight into your application. Together, they provide a complete picture of application success.


Direct Feedback


Direct user feedback is the most straightforward method, but it has some limitations and biases. Methods of getting direct feedback from a user include surveys, reviews, in-application prompts/pop-ups, or support tickets, among others. 



Sample Direct Feedback High Level Architecture 


With all these methods of direct feedback intake, we are left with a lot of messy and unstructured text data... a perfect problem for an LLM to solve. After aggregating all user feedback, you can use an LLM to summarize and categorize feedback so that the team quickly knows which application area(s) to focus on. 


Keep in mind, direct feedback sometimes comes with challenges like selection bias and small sample sizes. Often, the most likely users to reply to a survey or write a review are users that are not satisfied – which skews feedback in a negative direction. When users are satisfied, they are less likely to reply to provide feedback – which limits sample size.


Indirect Feedback


Indirect feedback involves analyzing user behavior and application interaction to infer how users feel about the application. This flavor is less straightforward than direct feedback and takes more investment to implement, but it is not as susceptible to some of the biases or skews in the resulting data.


Remember our data collection process from our monitoring system in part 3? We can use collected production data and the same evaluation framework from part 2 to assess how well our app is actually performing in Production. If scores are different in Production than they were during testing the test dataset should be adjusted to better match actual user behavior. 


A key metric to infer response effectiveness based on user behavior is edit distance between user prompts, which can indicate how much users are tweaking their inputs to get desired responses. The smaller the edit distance between prompts, the more likely a user is dissatisfied with a response and repeating the same or a similar question.



Example of Edit Distance. Source: Wikipedia


Other metrics to consider include tracking the average time between a user's prompt and the LLM's response, as well as the average time users take to write their prompts, which are valuable details on how the application is being used. User retention metrics, such as daily or weekly active usage, are also crucial indicators of long-term user engagement and overall application success.


Put it All Together


Let's look at the whole picture: A complete monitoring system collects data on user interactions and performance and a feedback system collects data on user interactions. These datasets are analyzed and used to drive changes to improve the system. Finally, the changes are tested and validated before being deployed to production and repeating the process.


This sample repository is a good reference to get started on implementation. 


Both direct and indirect feedback provide unique and complementary insights that are essential for comprehensive understanding and app improvement. Direct feedback offers immediate user opinions, while indirect feedback reveals user behavior patterns. Together, they enable developers to address issues, optimize performance, and ultimately drive application success.


Leave a Reply

Your email address will not be published. Required fields are marked *


This site uses Akismet to reduce spam. Learn how your comment data is processed.