<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Siddharth Sambharia</title>
    <description>The latest articles on Forem by Siddharth Sambharia (@siddhxrth10).</description>
    <link>https://forem.com/siddhxrth10</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1938129%2Fcfc65f7d-599b-49f1-a5ff-4e490cb7dd51.jpg</url>
      <title>Forem: Siddharth Sambharia</title>
      <link>https://forem.com/siddhxrth10</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/siddhxrth10"/>
    <language>en</language>
    <item>
      <title>LLMs in Prod'25: Real-World Insights from 2Tn+ Tokens</title>
      <dc:creator>Siddharth Sambharia</dc:creator>
      <pubDate>Tue, 21 Jan 2025 13:13:51 +0000</pubDate>
      <link>https://forem.com/portkey/llms-in-prod25-real-world-insights-from-2tn-tokens-4k7a</link>
      <guid>https://forem.com/portkey/llms-in-prod25-real-world-insights-from-2tn-tokens-4k7a</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;2024 marked the year when AI moved from experiments to mission-critical systems. But as organizations scaled their implementations, they encountered challenges that few were prepared for.&lt;/p&gt;

&lt;p&gt;Through Portkey's AI gateway, we've had a unique vantage point into how enterprises are building, scaling, and optimizing their AI infrastructure. We’ve worked with &lt;strong&gt;650+ organizations&lt;/strong&gt;, processing over &lt;strong&gt;2 trillion tokens&lt;/strong&gt; across &lt;strong&gt;90+ regions&lt;/strong&gt;. Along the way, everyone kept asking us the same questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;em&gt;Which providers are leading the way?&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;How can we ensure reliability in production?&lt;/em&gt;&lt;/li&gt;
&lt;li&gt;&lt;em&gt;What patterns are emerging in enterprise AI infrastructure?&lt;/em&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To help answer these questions, our team has analyzed 2 trillion+ tokens processed by our AI gateway in 2024. The result is the &lt;strong&gt;LLMs in Prod&lt;/strong&gt;—a data driven report of how companies are using LLMs in production going forward. Today we are excited to share this with you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Learnings from 2 Trillion+ Tokens
&lt;/h2&gt;

&lt;p&gt;As organizations scaled their AI efforts, some shocking insights stood out. These takeaways highlight the challenges and opportunities in building reliable, scalable AI systems.&lt;/p&gt;

&lt;p&gt;With LLMs taking over the world, everyone’s asking the same question: &lt;em&gt;“Which LLM is the most utilized of them all?&lt;/em&gt;”  Let’s unpack what we’ve seen.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fsiddharthsambharia-portkey%2FPortkey-Product-Images%2Frefs%2Fheads%2Fmain%2F3LLMsinProd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fsiddharthsambharia-portkey%2FPortkey-Product-Images%2Frefs%2Fheads%2Fmain%2F3LLMsinProd.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Multi-Provider Strategies Are Becoming the Norm
&lt;/h3&gt;

&lt;p&gt;Our data shows a dramatic shift toward multi-provider adoption, driven by the need for redundancy and improved performance. Multi-provider adoption jumped from &lt;strong&gt;23% to 40%&lt;/strong&gt; in the last year.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fsiddharthsambharia-portkey%2FPortkey-Product-Images%2Frefs%2Fheads%2Fmain%2F1LLMsinProd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fsiddharthsambharia-portkey%2FPortkey-Product-Images%2Frefs%2Fheads%2Fmain%2F1LLMsinProd.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Reliability is the New Battleground
&lt;/h3&gt;

&lt;p&gt;As enterprises scale their AI systems, reliability has emerged as a key concern. Our analysis revealed that during peak times, some providers experience failure rates of over 20%.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fsiddharthsambharia-portkey%2FPortkey-Product-Images%2Frefs%2Fheads%2Fmain%2F2LLMsinProd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fsiddharthsambharia-portkey%2FPortkey-Product-Images%2Frefs%2Fheads%2Fmain%2F2LLMsinProd.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Complexity is Scaling with Demand
&lt;/h3&gt;

&lt;p&gt;Enterprises rapidly moved from basic LLM usage (80% simple queries in early 2024) to more sophisticated implementations, with simple queries dropping to 20% by late 2024 as companies adopt complex workflows and multi-step chains that use more tokens per request.&lt;br&gt;
In just a year we saw:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;100-500 token requests grew from 10% to 37%.&lt;/li&gt;
&lt;li&gt;500+ token buckets saw consistent, sustained growth.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fsiddharthsambharia-portkey%2FPortkey-Product-Images%2Frefs%2Fheads%2Fmain%2F4LLMsinProd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fsiddharthsambharia-portkey%2FPortkey-Product-Images%2Frefs%2Fheads%2Fmain%2F4LLMsinProd.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Road Ahead
&lt;/h2&gt;

&lt;p&gt;As we look to 2025, it's clear that the focus must shift from basic implementation to building reliable, efficient, and secure AI infrastructure at scale. But the path forward isn't obvious.&lt;/p&gt;

&lt;p&gt;In our complete "LLMs in Production 2025" report, we analyzed insights:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Detailed reliability benchmarks across providers&lt;/li&gt;
&lt;li&gt;Architectural patterns for multi-provider deployments&lt;/li&gt;
&lt;li&gt;Cost optimization frameworks for complex workflows &lt;/li&gt;
&lt;li&gt;LLM Adoption patterns and more..&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://portkey.sh/devto-report" rel="noopener noreferrer"&gt;➡ Get the Full LLMs In Prod Report &lt;/a&gt;&lt;/p&gt;

</description>
      <category>llm</category>
      <category>ai</category>
      <category>openai</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Attention Isn’t All You Need</title>
      <dc:creator>Siddharth Sambharia</dc:creator>
      <pubDate>Thu, 05 Sep 2024 10:59:28 +0000</pubDate>
      <link>https://forem.com/portkey/attention-isnt-all-you-need-3edk</link>
      <guid>https://forem.com/portkey/attention-isnt-all-you-need-3edk</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Mamba: The AI that remembers like a Transformer but thinks like an RNN. Its improved long-term memory could revolutionize DNA processing, video analysis, and AI agents with persistent goals.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Mistral announced a 7B parameter &lt;a href="https://new.portkey.ai/announcements/codestral-mamba-on-portkey?ref=portkey.ai" rel="noopener noreferrer"&gt;Codestral Mamba&lt;/a&gt; model. While there are quite a few smaller models out there now, there's something special about this one — it's not just about size this time, it's about the architecture.&lt;/p&gt;

&lt;p&gt;Codestral Mamba (7B) has been tested against other similarly sized as well as some bigger models:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg8iwgrmg8lqg5ecp830e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg8iwgrmg8lqg5ecp830e.png" alt="Transformers-training-inference" width="800" height="229"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Source: &lt;a href="https://mistral.ai/news/codestral-mamba/" rel="noopener noreferrer"&gt;https://mistral.ai/news/codestral-mamba/&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;As you can see, Codestral Mamba squarely beats most models of its size and also rivals much bigger Codestral22B &amp;amp; CodeLlama34B models in performance.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Transformer Dilemma
&lt;/h2&gt;

&lt;p&gt;Transformers have been the cornerstone architecture for LLMs, powering everything from open-source LLMs to ChatGPT and Claude. They are great at remembering things, as each token can look back at every previous token when making predictions. This makes them undeniably effective, storing every detail from the past for theoretically perfect recall.&lt;/p&gt;

&lt;p&gt;However, the attention mechanism comes with a significant drawback: the &lt;a href="https://www.reddit.com/r/MachineLearning/comments/hxvts0/d_breaking_the_quadratic_attention_bottleneck_in/?ref=portkey.ai" rel="noopener noreferrer"&gt;quadratic bottleneck&lt;/a&gt; problem. When generating the next token, we need to recalculate the attention for the entire sequence, even if we have already generated some tokens. This leads to increasing computational costs as sequences grow longer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4rn72z44a8hu98w9rwz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn4rn72z44a8hu98w9rwz.png" alt="Transformers-training-inference" width="800" height="324"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Traditional RNNs: Efficient but Limited
&lt;/h2&gt;

&lt;p&gt;On the other side of the spectrum, we have traditional Recurrent Neural Networks (RNNs). These models excel at one thing: efficiency.&lt;/p&gt;

&lt;p&gt;RNNs process sequences by maintaining a hidden state, retaining only a small portion of information and discarding the rest. This makes them highly efficient but less effective since the discarded information cannot be retrieved.&lt;/p&gt;

&lt;h2&gt;
  
  
  Mamba: The Selective State Space Model
&lt;/h2&gt;

&lt;p&gt;Mamba is a model that aims to combine the best of both worlds. It belongs to a class of models known as State Space Models (SSMs). SSMs excel in understanding and predicting how systems evolve based on measurable data. State Space, very simply, is the minimum number of variables that can define a system.&lt;/p&gt;

&lt;p&gt;Mamba takes the efficiency of RNNs and enhances it with a crucial feature: selectivity. This selective approach is what elevates basic SSM models to Mamba - the Selective State Space Model. Selectivity enables each token to be transformed according to its specific requirements.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdivjnes4m69z04ug2sgz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdivjnes4m69z04ug2sgz.png" alt="selective-SSM-SSM-Tranformer-RNN" width="800" height="285"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Inspired from: &lt;a href="https://youtu.be/vrF3MtGwD0Y" rel="noopener noreferrer"&gt;https://youtu.be/vrF3MtGwD0Y&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Mamba Outperforms Transformers
&lt;/h2&gt;

&lt;p&gt;Mamba performs similar to or better than Transformers-based models. And crucially, it eliminates the quadratic bottleneck present in the attention mechanism.&lt;/p&gt;

&lt;p&gt;Gu and Dao, the Mamba authors write:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Mamba enjoys fast inference and linear scaling in sequence length, and its performance improves on real data up to million-length sequences. As a general sequence model backbone, Mamba achieves state-of-the-art performance across several modalities such as language, audio, and genomics. On language modelling, our Mamba-3B model outperforms Transformers of the same size and matches Transformers twice its size, both in pretraining and downstream evaluation."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;em&gt;Paper: &lt;a href="https://arxiv.org/abs/2312.00752" rel="noopener noreferrer"&gt;https://arxiv.org/abs/2312.00752&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Mamba is Special
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Efficiency&lt;/strong&gt;: Mamba performs similarly to or better than Transformer-based models while maintaining the efficiency of RNNs.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: It eliminates the quadratic bottleneck present in the attention mechanism, allowing for linear scaling with sequence length.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Long-term memory&lt;/strong&gt;: Unlike traditional RNNs, Mamba can selectively retain important information over long sequences.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Mamba in Action
&lt;/h2&gt;

&lt;p&gt;LLMs are already good at summarizing text, even if some details may be lost. However, summarizing other forms of content, like a two-hour movie is trickier!&lt;/p&gt;

&lt;h3&gt;
  
  
  Long-Term Memory
&lt;/h3&gt;

&lt;p&gt;This is where Mamba's long-term memory comes into play, enabling the model to retain important information. Mamba could be a game-changer for tasks requiring extensive context, like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DNA processing&lt;/li&gt;
&lt;li&gt;Video analysis&lt;/li&gt;
&lt;li&gt;Agentic workflows with long-term memory and goals&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Using Mamba Today
&lt;/h2&gt;

&lt;p&gt;Portkey natively integrates with Mistral's APIs - making it effortless for you to try out their Codestral Mamba model for multiple different use cases!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frmdip7ze2g7uwkq054z9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frmdip7ze2g7uwkq054z9.png" alt="Portkey-mistral -mamba" width="800" height="427"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Integration guide: &lt;a href="https://docs.portkey.ai/docs/welcome/integration-guides/mistral-ai" rel="noopener noreferrer"&gt;https://docs.portkey.ai/docs/welcome/integration-guides/mistral-ai&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion: A New Tool in the AI Toolkit
&lt;/h2&gt;

&lt;p&gt;As AI engineers, it's crucial to stay abreast of these architectural innovations. Mamba represents not just a new model, but a new way of thinking about sequence processing in AI. While Transformers will continue to play a vital role, Mamba opens up possibilities for tackling problems that were previously computationally infeasible.&lt;/p&gt;

&lt;p&gt;Whether you're working on next-generation language models, processing complex scientific data, or developing AI agents with long-term memory, Mamba is an architecture worth exploring. It's a powerful reminder that in the world of AI, attention isn't always all you need—sometimes, selective efficiency is the key to unlocking new potentials.&lt;/p&gt;

&lt;p&gt;Ready to explore Mamba in your projects? Join other engineers building AI apps on the &lt;a href="https://discord.com/invite/kXYKpPGasJ" rel="noopener noreferrer"&gt;Portkey community here&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>mistral</category>
      <category>ai</category>
      <category>aiops</category>
    </item>
  </channel>
</rss>
