LLMs in Prod'25: Real-World Insights from 2Tn+ Tokens

Siddharth Sambharia — Tue, 21 Jan 2025 13:13:51 +0000

Introduction

2024 marked the year when AI moved from experiments to mission-critical systems. But as organizations scaled their implementations, they encountered challenges that few were prepared for.

Through Portkey's AI gateway, we've had a unique vantage point into how enterprises are building, scaling, and optimizing their AI infrastructure. We’ve worked with 650+ organizations, processing over 2 trillion tokens across 90+ regions. Along the way, everyone kept asking us the same questions:

Which providers are leading the way?
How can we ensure reliability in production?
What patterns are emerging in enterprise AI infrastructure?

To help answer these questions, our team has analyzed 2 trillion+ tokens processed by our AI gateway in 2024. The result is the LLMs in Prod—a data driven report of how companies are using LLMs in production going forward. Today we are excited to share this with you.

Key Learnings from 2 Trillion+ Tokens

As organizations scaled their AI efforts, some shocking insights stood out. These takeaways highlight the challenges and opportunities in building reliable, scalable AI systems.

With LLMs taking over the world, everyone’s asking the same question: “Which LLM is the most utilized of them all?” Let’s unpack what we’ve seen.

1. Multi-Provider Strategies Are Becoming the Norm

Our data shows a dramatic shift toward multi-provider adoption, driven by the need for redundancy and improved performance. Multi-provider adoption jumped from 23% to 40% in the last year.

2. Reliability is the New Battleground

As enterprises scale their AI systems, reliability has emerged as a key concern. Our analysis revealed that during peak times, some providers experience failure rates of over 20%.

3. Complexity is Scaling with Demand

Enterprises rapidly moved from basic LLM usage (80% simple queries in early 2024) to more sophisticated implementations, with simple queries dropping to 20% by late 2024 as companies adopt complex workflows and multi-step chains that use more tokens per request.
In just a year we saw:

100-500 token requests grew from 10% to 37%.
500+ token buckets saw consistent, sustained growth.

The Road Ahead

As we look to 2025, it's clear that the focus must shift from basic implementation to building reliable, efficient, and secure AI infrastructure at scale. But the path forward isn't obvious.

In our complete "LLMs in Production 2025" report, we analyzed insights:

Detailed reliability benchmarks across providers
Architectural patterns for multi-provider deployments
Cost optimization frameworks for complex workflows
LLM Adoption patterns and more..

➡ Get the Full LLMs In Prod Report

Attention Isn’t All You Need

Siddharth Sambharia — Thu, 05 Sep 2024 10:59:28 +0000

Mamba: The AI that remembers like a Transformer but thinks like an RNN. Its improved long-term memory could revolutionize DNA processing, video analysis, and AI agents with persistent goals.

Mistral announced a 7B parameter Codestral Mamba model. While there are quite a few smaller models out there now, there's something special about this one — it's not just about size this time, it's about the architecture.

Codestral Mamba (7B) has been tested against other similarly sized as well as some bigger models:

Source: https://mistral.ai/news/codestral-mamba/

As you can see, Codestral Mamba squarely beats most models of its size and also rivals much bigger Codestral22B & CodeLlama34B models in performance.

The Transformer Dilemma

Transformers have been the cornerstone architecture for LLMs, powering everything from open-source LLMs to ChatGPT and Claude. They are great at remembering things, as each token can look back at every previous token when making predictions. This makes them undeniably effective, storing every detail from the past for theoretically perfect recall.

However, the attention mechanism comes with a significant drawback: the quadratic bottleneck problem. When generating the next token, we need to recalculate the attention for the entire sequence, even if we have already generated some tokens. This leads to increasing computational costs as sequences grow longer.

Traditional RNNs: Efficient but Limited

On the other side of the spectrum, we have traditional Recurrent Neural Networks (RNNs). These models excel at one thing: efficiency.

RNNs process sequences by maintaining a hidden state, retaining only a small portion of information and discarding the rest. This makes them highly efficient but less effective since the discarded information cannot be retrieved.

Mamba: The Selective State Space Model

Mamba is a model that aims to combine the best of both worlds. It belongs to a class of models known as State Space Models (SSMs). SSMs excel in understanding and predicting how systems evolve based on measurable data. State Space, very simply, is the minimum number of variables that can define a system.

Mamba takes the efficiency of RNNs and enhances it with a crucial feature: selectivity. This selective approach is what elevates basic SSM models to Mamba - the Selective State Space Model. Selectivity enables each token to be transformed according to its specific requirements.

Inspired from: https://youtu.be/vrF3MtGwD0Y

Mamba Outperforms Transformers

Mamba performs similar to or better than Transformers-based models. And crucially, it eliminates the quadratic bottleneck present in the attention mechanism.

Gu and Dao, the Mamba authors write:

"Mamba enjoys fast inference and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modelling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation."

Paper: https://arxiv.org/abs/2312.00752

Why Mamba is Special

Efficiency: Mamba performs similarly to or better than Transformer-based models while maintaining the efficiency of RNNs.
Scalability: It eliminates the quadratic bottleneck present in the attention mechanism, allowing for linear scaling with sequence length.
Long-term memory: Unlike traditional RNNs, Mamba can selectively retain important information over long sequences.

Mamba in Action

LLMs are already good at summarizing text, even if some details may be lost. However, summarizing other forms of content, like a two-hour movie is trickier!

Long-Term Memory

This is where Mamba's long-term memory comes into play, enabling the model to retain important information. Mamba could be a game-changer for tasks requiring extensive context, like:

DNA processing
Video analysis
Agentic workflows with long-term memory and goals

Using Mamba Today

Portkey natively integrates with Mistral's APIs - making it effortless for you to try out their Codestral Mamba model for multiple different use cases!

Integration guide: https://docs.portkey.ai/docs/welcome/integration-guides/mistral-ai

Conclusion: A New Tool in the AI Toolkit

As AI engineers, it's crucial to stay abreast of these architectural innovations. Mamba represents not just a new model, but a new way of thinking about sequence processing in AI. While Transformers will continue to play a vital role, Mamba opens up possibilities for tackling problems that were previously computationally infeasible.

Whether you're working on next-generation language models, processing complex scientific data, or developing AI agents with long-term memory, Mamba is an architecture worth exploring. It's a powerful reminder that in the world of AI, attention isn't always all you need—sometimes, selective efficiency is the key to unlocking new potentials.

Ready to explore Mamba in your projects? Join other engineers building AI apps on the Portkey community here.

Forem: Siddharth Sambharia