September 19, 2024 • 6 min read

How LLM Monitoring builds the future of GenAI ?

Rédigé par Rayann Bentounes

Rayann Bentounes

Discover how Langfuse offers secure, open-source monitoring for LLM and GenAI solutions.

Large Language Models (LLMs) are popping up everywhere and are more accessible than ever. We've all played around with ChatGPT and been (more or less) wowed by its skills. But behind this Generative AI (if you are not sure what GenAI is, here is a great article about Generative AI in the service of ecology to freshen up your memory) icon, a whole ecosystem is evolving. New, complex tools are being developed, making it easier and faster for everyone to test, experiment, and innovate.

But let's be real—keeping track of every token generated by these LLMs as they grow more sophisticated isn't easy. With each added layer, we lose a bit of clarity on what's really being generated. Errors will be harder to understand, details of what is happening under the hood will miss and you will not be able to solve problems effectively.

On top of that, traditional performance metrics fall short when it comes to LLMs. Their use has expanded way beyond simple "chatbots" into areas like analysis and automated applications where results and errors are harder to interpret.

In other AI and software engineering fields, tools have been developed to improve ease and experimentation. But in the GenAI world, we're still catching up. That's why we're excited to introduce our take on the best practices for developing LLM tools.

Why Monitoring LLMs is Crucial (and a bit of a Headache)

GenAI applications are tricky to evaluate. Defining a "good response" isn't as straightforward as it is with a classification or regression algorithm. The usual metrics don't cut it when trying to measure the performance of a prompt or a Retrieval-Augmented Generation (RAG) system.

Take metrics like ROUGE and BLEU, for example. They are great NLP tools used for evaluating translations and summaries, but they just don't capture the full picture with LLMs. They miss out on things like contextual relevance, creativity, and fluency.

Without a clear view of performance, developing a GenAI solution can be tough. And let's face it: most of us rely on API calls to services like OpenAI, Claude, or Mistral. Breaking down these calls can get pretty tedious, especially when responses are more creatively generated than just a simple Q&A format.

For example, let's say you're using RAG to answer questions on a document base (like, for example, our #TechRadar, which is full of insightful information about state-of-the-art Data Science !).

You might choose the following architecture, which will:
- setup a knowledge base by embedding all the desired context in a Vector Database
- use a conversation-based retrieval system that reformulates user's queries, making it more suitable for searching within a knowledge base, yet allowing retries in your questions.

Architecture of a Retrieval-Augmented Generative Agent (RAG): a powerfull GenAI tool based on LLMs
Architecture of a Retrieval-Augmented Generative Agent: Many Layers to Monitor

Once you have set it all up, you'll be able to ask all kinds of specific questions on MLOPS Stack, GenAI, or Data Modeling. Some responses will fall short, provide irrelevant information, or overlook key technical details, and you won't fully grasp the extent of the errors in your solution. You can't aggregate user interactions or compare different experiences effectively.

So, how do we keep track of progress, compare results, and share findings with our teams? Debugging an LLM can be a real pain, especially when using advanced tools like RAGs, Langchain, or Instructor that add extra layers of abstraction.

This challenge often leads us to create dashboards to try to make sense of things. But let's be honest, we're still far from the polished standards seen in other Data Science areas.

Enter Langfuse

That's where monitoring tools like Langfuse come in!

Quick Disclaimer: There are plenty of great monitoring tools out there, like Langsmith from the Langchain team, or up-and-coming ones like PareaAI and PhosphoAI. But we're focusing on this specific one because it's open-source and can be used on any project, no matter how sensitive the data, since you retain control over everything.

Langfuse is designed to manage LLM-based applications by keeping an eye on their health.

Langfuse will enable you to monitor valuable key metrics such as response times, error rates, and usage patterns. It provides detailed logs and visualizations, helping you understand and improve your GenAI application's performance. With alerting capabilities, it notifies you of issues in real time, allowing for quick corrective actions.

Langfuse's Frontpage shows insights to help monitoring your LLM
Langfuse Frontpage

You will be able to get precise information about what might be wrong within your GenAI pipelines. Which steps are making it slow: the chunks retrieval or the answer generation? Does this wrong answer come from poor context retrieval, hallucination, or misleading prompt? Is my application financially viable, or are the generation costs too high?

These are precious pieces of information that usually take time to harvest and highlight the deep functioning of your application.

Now, let’s dive deeper into what makes Langfuse stand out:

Tracing: The Foundation

Tracing is the basic building block of a monitoring tool. It logs detailed events or operations during a specific task, like an LLM call, and you can choose the level of detail. Nested traces become crucial when using abstraction tools that alter the prompt, like chains, agents, or RAGs.

Plus, you get useful info like inference time, token count, and generation cost. Even better, you can add project-specific metadata: with the right frameworks, you can easily store user feedback or set up custom alerts!

This foundational block is a powerful, versatile tool that can adapt to any situation. Thanks to its detailed SDK, if you want more specific insights, you can also build them by yourself and adapt Langfuse’s functioning directly to your use cases. As a software developer, it truly is a game changer.

Langfuse trace example: many insightful details on an LLM single Generation.
Langfuse trace example: many insightful details.

Metrics

You'll find plenty of metrics to monitor your LLM, such as cost per model or user, token volume ingested or generated, latency, and custom metadata. These metrics help you understand how your solution is being used and spot anomalies that might otherwise go unnoticed!

Datasets

Created from traces in the UI, manually in the dedicated tab, or automatically via scripts from your project, datasets let you test and observe the LLM's behavior on a set of test prompts. You can check for recurring errors or response types.

It's a handy tool for behavioral testing, though it still lacks the ease of unit tests that would actively notify users of detected behaviors. That said, we recommend using DVC to track and visualize experiment results, as mentioned in the Tech Radar!

Self-Hosting: The Big Advantage

Langfuse shines with its open-source nature! Unlike many other monitoring tools that only offer managed solutions, which can be a security concern when handling sensitive data, you can deploy and set up self-hosting yourself. Whether you want to work locally or on a secure cloud, it's easy and cost-effective. And if security isn't a big concern, you can even use their free managed service up to a certain volume, making it a breeze to iterate quickly.

Areas for Improvement:

The Playground

This is the feature that offers the most development convenience on other platforms like Langsmith or PareaAI. It allows you to test an LLM application without touching code, opening up R&D to non-tech team members. They can iterate on prompts, analyze responses, and understand the nuances of "prompt engineering."

Currently, access to this feature requires using the managed service or subscribing to the enterprise plan. However, with some effort, it is possible to create your own playground using Streamlit, providing a flexible alternative for those willing to invest the time.

Evaluation

Unit testing in GenAI is still in its early days. There are generally three ways to test an LLM's response quality:

  1. Human Verification: Someone manually checks the quality. It's precise but time-consuming, costly, and not scalable.
  2. Algorithmic Verification: Test functions check responses. It's quick and cheap but often lacks precision, especially for semantic evaluation.
  3. LLM as a Judge (LLMaJ): Another LLM checks the response quality and gives a score. It's quick, potentially costly, and automatable, but the quality might be hit or miss.

No platform offers highly satisfactory methods for evaluating LLMs, which is a big hurdle in developing reliable solutions. Without proper evaluation, it's tough to consistently monitor your solution's behavior.

Tools like Giskard or Promptfoo are emerging with potential solutions, but they're not quite mature enough for serious project use.

Once more, Langfuse allows you to manually score and edit traces or monitor a LLMaJ dedicated pipeline, which already helps a lot to step into LLM evaluation. Still, it remains far from the testing standards we have been used to, but is it really the purpose of a monitoring tool to offer an evaluation feature?

Conclusion

In conclusion, as complex as LLMs are, they require sophisticated monitoring tools to remain transparent and efficient. Langfuse is a strong contender, offering both security and flexibility with its open-source, self-hosted model. We highly recommend giving it a try: it’s easy to test using their managed service, and it’s already in use for several production projects for all the reasons we've discussed!

Langfuse has made it simple for us to track user requests and improve our GenAI solutions. While it’s still evolving with regular updates, we are very optimistic about its future. It's also incredibly easy to implement and cost-effective, especially with a self-hosted setup on serverless databases and compute platforms like GCP. The advantages are clear—it gives us a rapid, comprehensive view of our developments, and we strongly encourage its use. Don’t forget to keep an eye on other options like Langsmith from the Langchain team, but Langfuse is definitely worth exploring.

You are looking for GenAI experts? Feel free to contact us!

Cet article a été écrit par

Rayann Bentounes

Rayann Bentounes