Forem: Vrushank

LLMs in Prod 2025: Insights from 2 Trillion+ Tokens

Vrushank — Tue, 21 Jan 2025 10:47:15 +0000

2024 marked the year when AI moved from experiments to mission-critical systems. But as organizations scaled their implementations, they encountered challenges that few were prepared for.

Through Portkey's AI Gateway, we've had a unique vantage point into how enterprises are building, scaling, and optimizing their AI infrastructure. We’ve worked with 650+ organizations , processing over 2 trillion tokens across 90+ regions. Along the way, everyone kept asking us the same questions:

Which providers are leading the way?
How can we ensure reliability in production?
What patterns are emerging in enterprise AI infrastructure?

To help answer these questions, our team has analyzed 2 trillion+ tokens processed by our AI gateway in 2024. The result is the LLMs in Prod —a data driven report of how companies are using LLMs in production going forward. Today we are excited to share this with you.

Key Learnings from 2 Trillion+ Tokens

As organizations scaled their AI efforts, some shocking insights stood out. These takeaways highlight the challenges and opportunities in building reliable, scalable AI systems.

With LLMs taking over the world, everyone’s asking the same question: “Which LLM is the most utilized of them all?” Let’s unpack what we’ve seen.

1. Multi-Provider Strategies Are Becoming the Norm

Our data shows a dramatic shift toward multi-provider adoption, driven by the need for redundancy and improved performance. Multi-provider adoption jumped from 23% to 40% in the last year.

2. Reliability is the New Battleground

As enterprises scale their AI systems, reliability has emerged as a key concern. Our analysis revealed that during peak times, some providers experience failure rates of over 20%.

3. Complexity is Scaling with Demand

Enterprises rapidly moved from basic LLM usage (80% simple queries in early 2024) to more sophisticated implementations, with simple queries dropping to 20% by late 2024 as companies adopt complex workflows and multi-step chains that use more tokens per request.

In just a year we saw:

100-500 token requests grew from 10% to 37%.
500+ token buckets saw consistent, sustained growth.

LLM Token Usage Pattern | LLMs in Prod 25

The Road Ahead

As we look to 2025, it's clear that the focus must shift from basic implementation to building reliable, efficient, and secure AI infrastructure at scale. But the path forward isn't obvious.

In our complete "LLMs in Production 2025" report, we analyzed insights:

Detailed reliability benchmarks across providers
Architectural patterns for multi-provider deployments
Cost optimization frameworks for complex workflows
LLM Adoption patterns and more..

Get the Full Report ➡

LLMs in Prod 25 Report

Model Context Protocol for building reliable, enterprise LLM applications

Vrushank — Tue, 17 Dec 2024 07:48:32 +0000

Picture the modern enterprise LLM application scene - from customer service chatbots parsing thousands of queries to data analysis systems processing vast business insights. Large Language Models (LLMs) power these systems, but there's a critical challenge that many organizations overlook: context management.

As enterprises scale their LLM deployments, they face a huge problem. These powerful models need more than just computational resources - they need smart, efficient ways to access and process context from diverse data sources. Traditional approaches to context handling often create bottlenecks, leading to inconsistent performance and rising costs.

Enter the Model Context Protocol (MCP). By creating a standardized bridge between models and data sources, MCP tackles the context challenge head-on.

What is Model Context Protocol (MCP)?

The Model Context Protocol (MCP) is a framework designed to improve how context is managed within machine learning models, particularly large language models (LLMs). It is a standardized protocol that governs how models handle and process contextual information, ensuring that this information is passed efficiently through the system during training, inference, and deployment.

In traditional LLM deployments, managing context—such as user inputs, historical data, or domain-specific information—can become increasingly complex as models scale. This can lead to challenges such as model drift, inefficiencies, and poor response quality. MCP addresses these challenges by creating a clear and consistent way to manage and enrich context across various LLM use cases.

MCP's architecture is built on three fundamental components that enable sophisticated context management in LLM deployments

The protocol implements a dedicated context management system that maintains the state across the entire model operation cycle. It tracks contextual dependencies, manages priority queues for context updates, and ensures critical information persists throughout model inference. This layer prevents context loss that can degrade output quality in high-throughput scenarios.

MCP also provides a framework that allows context management to scale effectively, making it easier to handle larger datasets and more complex workflows. MCP ensures that the model’s outputs remain consistent and aligned with the provided context, regardless of the model size or complexity.

By implementing MCP, LLMs can process information more efficiently, resulting in faster, more accurate, and resource-efficient applications.

It also ensures that the model’s outputs remain consistent and aligned with the provided context, regardless of the model size or complexity.

What are the challenges in enterprise LLM Apps?

Firstly, AI integrations lack standardization , requiring teams to build custom APIs and connectors for each integration point. This creates inconsistent implementations across projects and increases technical complexity.

Building workflows across multiple tools requires extensive custom logic. Teams spend valuable time orchestrating these connections instead of focusing on core application features.

Also, LLM applications need consistent access to structured data, but retrieving and processing information across diverse platforms is complicated. Enterprise environments, with their numerous data sources, face this challenge acutely.

Interoperability issues arise when integrations are tied to specific platforms. This creates technical barriers when scaling AI systems across different tools and environments.

Security presents its own set of challenges when integrating external tools. Teams must balance data accessibility with protection against unauthorized access while maintaining system performance. Debugging AI workflows requires specialized approaches - without proper monitoring tools, identifying and fixing issues in production becomes time-consuming.

As systems grow, maintaining custom integrations demands increasing engineering effort. Development teams often find themselves managing integration logic rather than building new features. This technical overhead directly impacts development velocity and the ability to innovate. Custom-built integrations become harder to scale, particularly as organizations add more tools and data sources to their AI infrastructure.

These challenges become more complex with each new integration, tool, or data source added to the system.

How does MCP address these challenges?

MCP implements a standardized protocol layer that sits between LLMs and external systems. This layer handles complex context management through intelligent routing and state management. Rather than building custom integrations for each new tool or data source, developers can leverage MCP's protocol to establish consistent, reliable connections.

Model Context Protocol Architecture (Source)

The protocol's architecture introduces sophisticated context handling. It implements real-time state synchronization, allowing LLMs to maintain contextual awareness across multiple tools and data sources.

From a security standpoint, MCP builds robust access controls at the protocol level. It implements granular permissions and audit trails, ensuring that sensitive data remains protected while maintaining system performance. The protocol's built-in monitoring capabilities provide deep visibility into context flow, making it easier to identify and resolve issues in production.

For development teams, MCP significantly reduces integration complexity through its SDK-first approach. Rather than writing custom integration code, developers can use standardized interfaces to connect new tools and data sources.

The protocol's stateful connection management ensures reliable performance at scale. MCP's distributed architecture efficiently handles increasing loads as systems grow without requiring architectural overhauls. This built-in scalability removes the traditional bottlenecks that plague custom integrations.

By providing this comprehensive infrastructure layer, MCP transforms the development and deployment of LLM applications. Teams can focus on building core functionality rather than managing complex integration logic, leading to more robust and maintainable AI systems.

By addressing key challenges such as standardization, interoperability, security, and scalability, the Model Context Protocol (MCP) provides a robust foundation for managing the complexities of enterprise AI.

The path forward depends on industry-wide technical collaboration. As engineering teams implement MCP in production systems, their practical insights will drive protocol refinements and standardization efforts. This collective expertise will be crucial in evolving the protocol to handle increasingly complex AI workflows while maintaining system reliability and performance.

LLMs in Prod: The Reality of AI Outages, No LLM is Immune

Vrushank — Sat, 14 Dec 2024 09:58:00 +0000

This is Part 2 of our series analyzing Portkey's critical insights from production LLM deployments. Today, we're diving deep into provider reliability data from 650+ organizations , examining outages, error rates, and the real impact of downtime on AI applications. From the infamous OpenAI outage to the daily challenges of rate limits, we'll reveal why 'hope isn't a strategy' when it comes to LLM infrastructure

🚨 LLMs in Production: Day 3

“Hope isn’t a strategy.”

When your LLM provider goes down—and trust us, it will—how ready are you?

Today, we’re sharing fresh data from 650+ orgs on LLM provider reliability, downtime strategies, and how to keep things running smoothly (while…

— Portkey (@PortkeyAI) December 13, 2024

Before that, here’s a recap from Part 1 of LLMs in Prod:

• @OpenAI dominance is eroding, with Anthropic slowly but steadily gaining ground

• @AnthropicAI requests are growing at a staggering 61% MoM

• @Google Vertex AI is finally gaining momentum after a rocky start.

Now,… pic.twitter.com/4MjD63EWyJ

— Portkey (@PortkeyAI) December 13, 2024

Remember the OpenAI Outage?

In just one day, they reminded the world how critical they are—by taking everything offline for ~4 hours. 😛

But here’s the thing: this wasn’t an anomaly.

Outages like these are a recurring pattern across ALL providers.

Which begs the question: why… pic.twitter.com/HYNVeZlSpo

— Portkey (@PortkeyAI) December 13, 2024

📊 Over the past year, error spikes hit every provider—from 429s to 5xxs, no one was spared.

The truth?

There’s no pattern, no guarantees, and no immunity.

If you’re not prepared with multi-provider setups, you’re inviting downtime.

Reliability isn’t optional—it’s table… pic.twitter.com/MDpSfSrYft

— Portkey (@PortkeyAI) December 13, 2024

Rate Limit Reality Check:

• @GroqInc : 21.11%

• @Perplexity: 12.24%

• @AnthropicAI : 5.60%

• @Azure OpenAI: 1.74%

Translation: If you're not handling rate limits gracefully, you're gambling with user experience.

Your customers won’t wait for infra to catch up. Are you… pic.twitter.com/GiJwXdPMuQ

— Portkey (@PortkeyAI) December 13, 2024

But rate limits are just the tip of the iceberg.

Server Error (5xx) rates this year:

• Groq: 0.67%

• Anthropic: 0.56%

• Perplexity: 0.39%

• Gemini: 0.32%

• Bedrock: 0.28%

Even "small" error rates = thousands of failed requests at scale.

These aren’t just numbers—they’re… pic.twitter.com/0CqdEGfYc0

— Portkey (@PortkeyAI) December 13, 2024

So, what’s the solution?

The hard truth? Your users don't care why your AI features failed.

They just know you failed.

The key isn’t choosing the “best” provider—it’s building a system that works when things go wrong:

💡 Diversify providers.

💡 Implement caching.

💡 Build smart…

— Portkey (@PortkeyAI) December 13, 2024

6/ Why caching matters:

Performance optimization is critical, and here’s where caching delivers results:

• 36% average cache hit rate (peaks for Q&A use cases)

• 30x faster response times

• 38% cost reduction

Caching isn't optional at scale—it's your first line of defense. pic.twitter.com/YX7YvwkmMS

— Portkey (@PortkeyAI) December 13, 2024

That’s it for today! Follow @PortkeyAI for more on LLMs in Prod Series

— Portkey (@PortkeyAI) December 13, 2024

https://t.co/54QiUNDZx2

— Portkey (@PortkeyAI) December 13, 2024

LLMs In Prod: Day 1

Vrushank — Fri, 13 Dec 2024 10:08:40 +0000

This is Part 1 of our analysis diving into Portkey's year-end LLM production data. Over the next few days, we'll be unpacking insights from over 2 trillion tokens worth of production data to reveal how the AI ecosystem evolved in 2024.

Today, we're starting with the most crucial aspect: who's dominating the market, who's growing the fastest, and what's really changing in the LLM landscape.

✨LLMs in Prod Day 1: Provider Trends Unveiled

Who’s dominating? Who’s growing the fastest? And what’s really changing in the AI ecosystem?

We analyzed 2 trillion tokens worth of our production data to uncover trends that might surprise you. Let’s get into it. 👇

— Portkey (@PortkeyAI) December 12, 2024

OpenAI: Still Leading, But Watch Out

• 24% monthly growth in requests

• 6% monthly growth in organizations

OpenAI remains the leader, but here’s the twist: their adoption has dropped from 89% to 76%.

Steady dominance? Yes. But competitors are catching up faster than… pic.twitter.com/jkCsxaUcnn

— Portkey (@PortkeyAI) December 12, 2024

Anthropic: Fastest Growing Provider of 2024

• 61% monthly growth in requests

• 22% monthly growth in organizations

Anthropic has been on fire. Every time they launch a new Claude model (like Sonnet 3.5, Haiku-3.5), we see big adoption spikes: model releases matter.

Anthropic… pic.twitter.com/KMsLLOLTsy

— Portkey (@PortkeyAI) December 12, 2024

Gemini (Google): Late Start, Big Momentum

49% monthly growth in requests

9% growth in new organizations

Gemini may have been late to the game, but they’re showing deep usage from those who’ve adopted them.

It’s a strong comeback for a Google—Let’s see how Gemini shapes up in… pic.twitter.com/8DYZReRjar

— Portkey (@PortkeyAI) December 12, 2024

Cloud Providers: Partnerships Are Key

Azure is leading the charge, with 3x more penetration than AWS Bedrock.

But here’s the twist:

Bedrock started the year with 10x the adoption of Vertex AI, but by year-end, it’s only 2x ahead.

It’s clear now: your AI infrastructure is as… pic.twitter.com/XcKMK2Delt

— Portkey (@PortkeyAI) December 12, 2024

Amazon Bedrock: Fewer Orgs, Deeper Usage

13% monthly growth in organizations

91% growth in requests

Bedrock isn’t catching up to OpenAI in reach, but when companies use it, they go all in.

High usage, fewer orgs—it’s all about depth here. pic.twitter.com/cH1TInSoJ7

— Portkey (@PortkeyAI) December 12, 2024

Vertex AI: Building Momentum Fast

43% monthly growth in requests

Vertex AI might’ve started slower, but it’s picking up steam. It’s catching up, just not there yet. pic.twitter.com/ogKPZ47yhl

— Portkey (@PortkeyAI) December 12, 2024

Multi-Provider Usage Is on the Rise

In just 10 months, the proportion of Portkey orgs using multiple providers jumped from 23% to 40%.

Why’s this happening?

Redundancy is cheap. Downtime isn’t.

After a year of outages and capacity issues, companies are finally realizing: you… pic.twitter.com/kGC8T3uVQV

— Portkey (@PortkeyAI) December 12, 2024

Tomorrow: We’ll dive into how these providers actually perform in production. Spoiler alert: the numbers might surprise you. 😉

Stay tuned, and follow @PortkeyAI for more insights from LLMs in Prod '24.

— Portkey (@PortkeyAI) December 12, 2024

Find the next post in the series on Portkey's Twitter

Portkey is Joining Hacktoberfest

Vrushank — Fri, 11 Oct 2024 07:52:05 +0000

Portkey today is one of the most popular and performant open source AI Gateways in the market, and the big reason for that is — the Portkey Community. We're thrilled to give back to the community with Hacktoberfest this year and reward top contributors with amazing prizes!

Quick Intorudction to Portkey Gateway

Our open source AI Gateway lets you connect, load balance, setup fallbacks and seamlessly manage 250+ AI models using an OpenAI compatible API. It was launched early this year and since then, we've had 40+ contributors adding new providers, guardrails, and fixing bugs every day!

The project just crossed 6,000 stars on Github ⭐️

How You Can Contribute

We are accepting Hacktoberfest contributions for the Gateway repo as well as for its docs!

Look for issues tagged with hacktoberfest to find eligible contributions. We're particularly excited about contributions in these areas:

New guardrails and plugins for our guardrails library
Integrations with additional LLM providers
Documentation improvements and fixes

But don't let that limit you – we welcome contributions across all areas of our projects!

Prizes and Recognition

Here's what's in store:

For top contributors: A pair of AirPods Pro (2nd Generation)
For all contributors with 2+ successful contributions: Portkey AI Swag Box with T-shirts and more!

Getting Started

New to open source or Portkey? No worries! We've got you covered:

Support Options

We're here to help you succeed in your contributions:

Discord: Join our Portkey Discord server for real-time support.
GitHub Issues: Post your questions in the Issues tab of our repo.
Documentation: Check out our contributor guidelines for detailed information.

Remember, to have your PRs count towards Hacktoberfest:

They must be merged, approved, or tagged with the hacktoberfest-accepted label.
Adhere to the quality standards to avoid being marked as spam or invalid.

Ready to get started? Check out our GitHub repositories and join us in making AI development more accessible, reliable, and powerful for everyone!

Happy Hacking!

How I Optimized My Prompts For E-commerce Search Using DSPy

Vrushank — Tue, 03 Sep 2024 18:54:54 +0000

This is a guest post by Ganaraj, based on the talk he gave at the LLMs in Prod Meetup in Bangalore.

AI has fundamentally changed the way we build apps. I transitioned from being a coder to a creative prompter. This shift from coding to creative prompting might seem like a dream come true. However, the nerd inside me was unhappy.

Enter DSPy

DSPy changes this dynamic entirely. Instead of obsessing over prompt crafting, it allows me to focus on what I do best: programming. I'm moving from writing prompts to actually programming AI pipelines.

What's DSPy All About?

At its core, DSPy is a framework for optimizing how we work with LLMs. Instead of manually crafting the perfect prompt, DSPy lets us focus on building AI pipelines:

You define the flow of your AI program.
You set up metrics to measure what "good output" looks like for your task.
DSPy handles the optimization of prompts and weights automatically.

It separates the flow of your program (modules) from the parameter using optimizer (LLM-driven algorithms). Optimizer can tune the prompts and the weights of your LLM calls, based on a metric you want
to maximize.

"Don't bother figuring out what special magic combination of words will give you the best performance for your task. Just develop a scoring metric then let the model optimize itself." - Battle at IEEE

Why This Matters

DSPy stands out in three crucial areas:

a) LLM-agnostic: You can switch between models without changing the entire system. Moving from GPT-4 to Mistral-7B doesn't mean rewriting my app from scratch.

b) Complexity management: It breaks down tasks into manageable chunks, each with its own optimization.

c) Automated optimization: No more manual prompt tweaking

"DSPy allowed us to beat GPT-4 performance on this task of extracting data from messy tables, at 10x lower cost per table and 10x lower manual effort" - Gradient AI

DSPy in Action: The Zoro UK Case Study

I work as a Lead Architect at Zoro UK, where we've scaled DSPy to production. We use DSPy to solve complex challenges in e-commerce - specifically, normalizing product attributes across millions of items from hundreds of suppliers.

One of our biggest problems is that we have more than 300 plus suppliers, and they all supply similar kinds of stuff, For example, a simple attribute like thread length can be represented in various ways by different suppliers - some might use '25.4 mm', others '25.4 M', and some might use inches.

To address this challenge, we (the engineering team) orchestrated a system using DSPy:

LLM Sorter Service

At the heart of the system, this service leveraged DSPy's capabilities. It implemented a tiered approach:-

A smaller model (e.g., Mistral) for deciding if a set of attribute values needs to be sorted by LLM. In simple cases, we can just go with alpha-numeric sort. So LLM doesn't need to be involved, thereby saving cost.
A more powerful model (e.g., GPT-4) for complex, nuanced attribute value sorting.

Workflow

The Attribute Sorting Job extracts raw attribute data from MongoDB.
The LLM Sorter Service processes these attributes using the appropriate model based on task complexity.
DSPy's optimization features continuously refine the processing based on defined metrics.
Processed and standardized attributes are then fed back into MongoDB.

We first started with one model, and then for some use cases, that was not enough. So we just switched it over to OpenAI GPT-4, and it started working as well.

Slides: https://ggl.link/dspy-in-prod
Smrthi project with translations produced using DSPy: https://www.smrthi.com/
Awesome DSPy repo: https://github.com/ganarajpr/awesome-dspy

Metrics in DSPy

One of the most intriguing aspects of DSPy is its approach to metrics. Defining the right metric is about more than just accuracy; it's about encapsulating the nuanced goals of your AI system into a quantifiable measure. This process forces clarity of thought that often leads to better system design overall.

Metrics can range from simple measures like accuracy, exact match, or F1 score to more sophisticated approaches such as cosine similarity or deep eval, depending on the complexity of the task.

FAQs with DSPy

1) How does DSPy enhance scalability in production?

DSPy's optimizers automatically adapt your pipeline to handle increased data volume and complexity, reducing the need for manual prompt engineering as your system scales.

2) What's DSPy's approach to model portability?
DSPy allows seamless switching between models (e.g., GPT-4 to Llama 3.1) without rewriting your core logic, crucial for adapting to changing production requirements or cost structures.

3) How does DSPy help manage costs in large-scale deployments?
By optimizing prompts and potentially using smaller models more effectively, DSPy can significantly reduce API calls and associated costs in high-volume production environments.

4) How does DSPy handle performance monitoring and observability in live systems?
You can use AI Gateways like Portkey to have an observability layer on top of DSPy.

5. What's the typical compile-time overhead in a production setting?
Compile time varies based on complexity. For reference, a moderately complex program might take anywhere from 5mins-1hr depending on the number of examples and iterations used.

What’s next?

For those working on AI systems, especially in production environments, it's time to look at DSPy. The challenges it addresses - scalability, consistency, and adaptability - are only going to become more pressing as AI continues to permeate every aspect of our digital infrastructure. The future of programming isn't about arguing with AI—it's about orchestrating it, and with tools like DSPy, we're well-equipped to lead this transformation.

Three Prompt Libraries you should know as an AI Engineer

Vrushank — Thu, 11 Jul 2024 09:54:15 +0000

As developers we write code to develop logic that eventually helps solve larger problems or automate a workflow that is unproductive for humans.

When LLMs came into picture, prompting obviously became famous. Prompt Engineering became a art!

Prompting became one of the key components in Generative AI and so the use of prompt libraries. These libraries provide predefined prompts that can be used to train AI models, making the development process more efficient and effective.

In this Blog, we’ll explore What is a Prompt library, how it boosts our workflow, Is it safe or not? Finally we will take a look at three Prompt Libraries to maximise productivity as an AI Engineer.

What is a Prompt Library?

A prompt library is not just a repository for prompts; it serves as a powerful solution for collaboration and knowledge sharing within your organisation.

Prompt libraries provide centralised platforms to store, organise, and access AI prompts, enabling teams to collaborate and streamline workflows. Therefore, the overarching purpose of a prompt library is to improve efficiency, performance, and collaboration.

It enables your teams to discover and reuse prompts rapidly, avoiding duplicate work and accelerating development cycles. By providing access to highly optimised, pre-tested prompts, a prompt library ensures that the output quality of your projects is consistently high.

How Prompt libraries boosts your workflow?

Prompt libraries can significantly streamline AI development by providing ready-to-use prompts that can be easily integrated into your projects.

Here are some ways prompt libraries can enhance your workflow:

Simplified Task Execution: Prompt libraries provides a collection of predefined prompts that we can use for various tasks such as text generation, sentiment analysis, and more. With this we don’t have to create prompts from the scratch.
Increased Productivity: Focus on higher-level tasks rather than spending time on prompt creation. This improves the overall productivity of the Team.
Consistency and Quality: Prompt libraries ensure consistency in the prompts used across different projects. This consistency helps to produce higher quality AI outputs and reduces the chances of errors.

Is it Safe to Use AI Prompt Libraries?

While prompt libraries offer numerous benefits, it is important to consider their safety and reliability. Here are some points we should keep in mind:

Potential Risks: Using pre-defined prompts may introduce biases or inaccuracies if the prompts are not well-designed. It is crucial to review and test the prompts thoroughly before using them in production.
Best Practices: To ensure safe and ethical use of AI prompt libraries, follow best practices such as regularly updating the libraries, validating the prompts, and monitoring the AI outputs for any anomalies.
Reliability: Choose prompt libraries from reputable sources and communities. This ensures that the prompts are well-maintained and updated regularly.

Three Prompt Libraries

Priompt

Priompt is a prompt library designed for creating prompts specifically for large language models (LLMs). It uses JSX syntax, similar to what we use in React development, to structure our prompts.

Here are some Key Features of Prompt:

JSX-based syntax: This makes building prompts more intuitive and easier to read, especially for those familiar with React.
Priorities: Priompt lets us to define priorities for different parts of your prompt. This helps the LLM determine what information to include in the context window based on its importance.
Control flow: Components like enable you to control the flow of information in your prompt. For instance, you can use it to define fallbacks or shorten prompts that become too long.

Priompt aims to streamline the process of designing prompts for LLMs by providing a familiar and structured approach.

Promptfoo

Promptfoo is an open-source toolkit designed to help developers improve the performance of large language models (LLMs) through prompt engineering.

Here are some of the key features of Prompfoo:

Systematic Evaluation: Prompfoo allows us to establish benchmarks and test cases to systematically evaluate the outputs of LLMs. This eliminates the need for time-consuming trial-and-error approaches.
Side-by-Side Comparisons: It enables you to compare the outputs of various prompts and see which ones generate the best results for your specific use case.
Automatic Scoring: It can automatically score the outputs of LLMs based on the metrics you define. This helps you objectively assess the quality and effectiveness of the LLM's responses.
Multiple LLM Support: Prompfoo works with a wide range of LLM APIs, including OpenAI, Anthropic, Azure, Google, and HuggingFace.

Overall, Prompfoo offers a structured approach to prompt engineering, helping developers build reliable prompts and fine-tune LLMs for their specific applications.

PromptHub

PromptHub is a platform designed to specifically address prompt testing and evaluation for large language models.

Here are some of the key features of PromptHub:

Prompt Collection: Provides a library of pre-built prompts for common Natural Language Processing (NLP) tasks like text summarization, question answering, and code generation.
Prompt Testing: Allows you to test your own prompts or those from the library with different LLMs.
Evaluation Metrics: Offers various metrics to assess prompt performance, such as accuracy, relevance, and coherence of LLM outputs.
Hyperparameter Tuning: Enables you to experiment with different hyperparameters within a prompt (e.g., wording, examples) to optimize LLM performance.
Collaboration Features: May provide functionalities for sharing prompts and test results with team members (depending on the specific offering).

Overall, PromptHub is a valuable tool for those working with LLMs and prompt engineering. It streamlines the process of testing and evaluating prompts, leading to better-performing LLMs for various NLP tasks.

To Summarise:

Prompt Libraries play a vital role in enhancing the efficiency and effectiveness of Generative AI App development. By providing ready-to-use prompts, these libraries can simplify tasks, increase productivity, and ensure consistency and quality in AI outputs.

We at Portkey have been building a Open-source AI Gateway that helps you build a resilient LLM-powered application in production. Join our community of AI practitioners to learn together and share more interesting updates.

Happy Building!

Stream LLM Responses from Cache

Vrushank — Wed, 06 Mar 2024 10:38:45 +0000

LLMs can become more expensive as your app consumes more tokens. Portkey's AI gateway allows you to cache LLM responses and serve users from the cache to save costs. Here's the best part: now, with streams enabled.

Streams are an efficient way to work with large responses because:

They reduce the perceived latency when users are using your app.
Your app doesn't have to buffer it in the memory.

Let's check out how to get cached responses to your app through streams, chunk by chunk. Every time portkey serves requests from the cache, we save costs for tokens.

With streaming and caching enabled, we will make a chat completion call to OpenAI through Portkey.

Import and instantiate the portkey.

import Portkey from "portkey-ai";

const portkey = new Portkey({
  apiKey: process.env.PORTKEYAI_API_KEY,
  virtualKey: process.env.OPENAI_API_KEY,
  config: {
    cache:{
      mode: "semantic"
    }
  }
});

apiKey	Sign up for Portkey and copy API key
virtualKey	Securely store in the vault and reference it using Virtual Keys
config	Pass configurations to enable caching

Our app will list the tasks to help with planning a birthday party.

const messages = [{
        role: "system",
        content: "You are very good program manager and have organised many events before. You can break every task in simple and means for others to pick it up.",
    },
    {
        role: "user",
        content: "Help me plan a birthday party for my 8 yr old kid?",
    },
];

Portkey follows same signature as OpenAI's, hence enabling streams in responses is passing stream:true option.

try {
    var response = await portkey.chat.completions.create({
        messages,
        model: "gpt-3.5-turbo",
        stream: true,
    });
    for await (const chunk of response) {
        process.stdout.write(chunk.choices[0]?.delta?.content || "");
    }
} catch (error) {
    console.error("Errors usually happen:", error);
}

You can iterate over the response object, processing each word (chunk) and presenting it to the user as soon as it's received.

Here's a tip: You can skip cache HIT by passing cacheForceRefresh

  var response = await portkey.chat.completions.create({
      messages,
      model: "gpt-3.5-turbo",
      stream: true,
  }, {
      cacheForceRefresh: true
  });

Streaming becomes more effective in providing a smoother user experience and efficiently managing your memory.

Put this into practice today!

Open Source Models for your next AI use-case

Vrushank — Wed, 06 Mar 2024 10:35:06 +0000

Open-source models offer greater control, tools for quality improvement, and cost savings. They are becoming a favored choice for developers to consider for their apps. This blog looks at popular models and the most straightforward way to use them in your apps: by consuming Inference endpoints.

Inference APIs allow you to generate predictions, decisions, or outputs from a trained AI model. Providers like Anyscale, Perplexity & TogetherAI expose open-source models through Inference endpoints for your apps to make API calls and get responses from LLMs. We will use Portkey SDK to make API calls to inference endpoints since it follows the OpenAI signature, which allows us to easily switch between LLMs by doing find & replace model and endpoint URLs in our app.

Plan a birthday party.

Consider you are building an app that suggests to the user some steps to plan a birthday party. It should give users a checklist of items to take care of to organize a successful birthday party.

Here's how your implementation might look like:

Construct a prompt properly with a suitable system & user query.
Make an API call to the LLM.
Transform the response to be suitable for our app.

The control panel would be Portkey. To avoid managing multiple API keys (Anyscale, Perplexity, and TogetherAI) securely save them to the vault using virtual keys.

Import portkey-ai and phrase your prompt

import Portkey from "portkey-ai";

let portkey = new Portkey({
    apiKey: process.env.PORTKEYAI_API_KEY,
    virtualKey: process.env.TOGETHERAI_VIRTUAL_KEY,// or Anyscale or Perplexity
});

const messages = [{
        role: "system",
        content: "You are very good program manager and have organised many events before. You can break every task in simple and means for others to pick it up. You each step as short as possible. Keep the response under 1000 words.",
    },
    {
        role: "user",
        content: "Help me plan a birthday party?",
    },
];

The chat completions call to llama-2-70b-chat

  var response = await portkey.chat.completions.create({
      messages,
      model: "togethercomputer/llama-2-70b-chat",
      max_tokens: 1000,
  });
  console.info(response.choices[0].message.content);

You will notice logs for every request sent on the portkey’s control panel with useful data such as the timestamp, request type, LLM used, tokens generated, and cost.

When instantiating the portkey, use virtual keys of different LLM providers, such as Perplexity or Anyscale, to choose one of them. Select any model from these providers and pass its name to make the chat completion call. See the complete list of models supported through Portkey.

Inference Engines

Different LLM providers can provide the same model to our applications. For example, Llama 2 is available on Anyscale and TogetherAI. Although they have the same models, their inference engines are different. Inference engines handle all our app's API calls; henceforth, we use inference endpoints. They are optimized for performance and quality. It’s important for you to consider the differences in the Inference engines as you finalize your best-suited model.

Summary

Suggestion to plan a birthday party? You can choose different LLMs through inference engines of various LLM providers. We explored how query, prompt, LLM, LLM provider, and Portkey make a chat completion call. Have fun experimenting with multiple prompts, language models, and features available now!

OpenAI Model Deprecation Guide

Vrushank — Wed, 03 Jan 2024 13:59:47 +0000

On Jan 4, OpenAI will retire 33 models, including GPT-3 (text-davinci-003) and various others. This is OpenAI's biggest model deprecation so far. Here's what you need to know:

GPT-3 Model Retirement
The text-davinci-003 model (commonly known as GPT-3) will be unavailable from Jan 4.
→ You must manually transition to the replacement model, gpt-3.5-turbo-instruct .

/fine-tunes Endpoint Retirement
The ada, babbage, curie, and davinci models on the /fine-tunes endpoint will be retired on Jan 4.
→ You must manually transition to the new endpoint and models, on /fine-tuning.

Old Embedding Models Retirement
All models except for text-embedding-ada-002 will be shutdown on Jan 4.
→ You must manually transition to the new model text-embedding-ada-002.

No Deprecations for Chat Models
All GPT-4 and gpt-3.5-turbo models remain active. The earliest deprecation is happening on June 13th for gpt-4-0314, gpt-4-0613, gpt-3.5-turbo-0613, gpt-3.5-turbo-0301. Azure OpenAI will deprecate the -0314 and -0301 models on July 5th. (Azure shutdown for -0613 models is to be announced.)
→ It is recommended to start shifting your gpt-4 and gpt-3.5-turbo workloads to the newer, cheaper -1106 models.

We've put all of these updates in a simple-to-understand OpenAI Model Map. (Click on the image to expand)

For Chat Models

For Text models & /fine-tunes

Find AI Grants, Compute Credits, and Investments

Vrushank — Wed, 16 Aug 2023 20:51:22 +0000

Did you know? Over $8 Million USD in investments & credits are up for grabs for early-stage AI builders today.

But all this information is scattered across the web, making it challenging to track down. On top of that, if you want to check your eligibility for any program, you again have to browse various FAQs, rules (or sometimes even clarifications on Twitter) to make sense of it all.

We've tried to simplify it for you in the AI Grants Finder. On the tool you can easily browse:
🎖️ Program perks
📝 List of qualifiers
🌍 Accepted countries
📅 Deadlines
all in a single place.

Check it out here: https://grantsfinder.portkey.ai/

Give it a whirl, and if you stumble upon a game-changing program for your builder journey, we'd love to hear about it🙏