<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Drishti Shah (Ext)</title>
    <description>The latest articles on Forem by Drishti Shah (Ext) (@drishti_shah_portkey).</description>
    <link>https://forem.com/drishti_shah_portkey</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2152730%2Fd37f52c6-3fe2-4d33-aee0-339a18c4899b.png</url>
      <title>Forem: Drishti Shah (Ext)</title>
      <link>https://forem.com/drishti_shah_portkey</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/drishti_shah_portkey"/>
    <language>en</language>
    <item>
      <title>FrugalGPT: Reducing LLM Costs &amp; Improving Performance</title>
      <dc:creator>Drishti Shah (Ext)</dc:creator>
      <pubDate>Mon, 07 Oct 2024 16:56:37 +0000</pubDate>
      <link>https://forem.com/portkey/frugalgpt-reducing-llm-costs-improving-performance-25h2</link>
      <guid>https://forem.com/portkey/frugalgpt-reducing-llm-costs-improving-performance-25h2</guid>
      <description>&lt;p&gt;FrugalGPT is a framework proposed by Lingjiao Chen, Matei Zaharia, and James Zou from Stanford University in their 2023 paper "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance". The paper outlines strategies for more cost-effective and performant usage of large language model (LLM) APIs.&lt;/p&gt;

&lt;p&gt;A year after its initial publication, FrugalGPT remains highly relevant and widely discussed in the AI community. Its enduring popularity stems from the pressing need to make LLM API usage more affordable and efficient as these models grow larger and more expensive.&lt;/p&gt;

&lt;p&gt;The core of FrugalGPT revolves around three key techniques for reducing LLM inference costs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Prompt Adaptation - Using concise, optimized prompts to minimize prompt processing costs&lt;/li&gt;
&lt;li&gt;LLM Approximation - Utilizing caches and model fine-tuning to avoid repeated queries to expensive models&lt;/li&gt;
&lt;li&gt;LLM Cascade - Dynamically selecting the optimal set of LLMs to query based on the input&lt;/li&gt;
&lt;li&gt;The authors demonstrate the potential of these techniques, showing that&lt;/li&gt;
&lt;/ol&gt;

&lt;blockquote&gt;
&lt;p&gt;FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In this post, we'll delve into the practical implementation of these FrugalGPT strategies. We'll provide concrete code examples of how you can employ prompt adaptation, LLM approximation, and LLM cascade in your own AI applications to get the most out of LLMs while managing costs effectively. By adopting FrugalGPT techniques, you can significantly reduce your LLM operating expenses without sacrificing performance.&lt;/p&gt;

&lt;p&gt;Let's put the theory of FrugalGPT into practice:&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Prompt Adaptation
&lt;/h2&gt;

&lt;p&gt;FrugalGPT wants us to either reduce the size of the prompt OR combine similar prompts together. The core idea is to minimize tokens and thus reduce LLM costs.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.1 Decrease the prompt size
&lt;/h3&gt;

&lt;p&gt;FrugalGPT proposes that instead of sending a lot of examples in a few-shot prompt, we could pick and choose the best ones, and thus reduce the prompt size.&lt;/p&gt;

&lt;p&gt;While some larger models today don't necessarily need few-shot prompts, the technique does significantly increase accuracy across multiple tasks.&lt;/p&gt;

&lt;p&gt;Let's take a classification example where we want to classify our incoming email based on the body into folders - Personal, Work, Events, Updates, Todos.&lt;/p&gt;

&lt;p&gt;A few-shot prompt would look something like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[{"role": "user", "content": "Identify the intent of incoming email by it's summary and classify it as Personal, Work, Events, Updates or Todos"},
{{examples}},
{"role": "user", "content": "Email: {{email_body}}"}]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Where the examples would be formatted as human, assistant chats duets like this&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[{"role": "user", "content": "Email: Hey, this is a reminder from your future self to take flowers home today"},
{"role": "assistant", "content": "Personal, Todos"}]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The prompt with 20 examples is approximately 623 tokens and would cost 0.1 cents per request on &lt;code&gt;gpt-3.5-turbo&lt;/code&gt;. Also, we might want to keep adding examples when emails are mislabeled to improve accuracy. This would further increase the cost of the prompt tokens.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyvrd69kjufdur5birezb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyvrd69kjufdur5birezb.png" alt="prompt"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;FrugalGPT suggests identifying the best examples to be used instead of all of them.&lt;/p&gt;

&lt;p&gt;In this case, we could do a semantic similarity test between the incoming email and the example email bodies. Then, only pick the top k similar examples.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import numpy as np
from sklearn.metrics.pairise import cosine_similarity
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

def get_embeddings(texts):
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    embeddings = model_output.last_hidden_state[:, 0, :] 
    return embeddings.numpy()

# Assuming examples is a list of dictionaries with 'email_body' and 'labels' keys
example_embeddings = get_embeddings([ex['email_body'] for ex in examples]) 

def select_best_examples(query, examples, k=5):
    query_embedding = get_embeddings([query])[0]
    similarities = cosine_similarity([query_embedding], example_embeddings)[0]
    top_k_indices = np.argsort(similarities)[-k:]
    return [examples[i] for i in top_k_indices]

# Example usage
query_email = "Don't forget our lunch meeting at the new Italian place today at noon!"
best_examples = select_best_examples(query_email, examples)
few_shot_prompt = generate_prompt(best_examples, query_email) # Omitted for brevity
# Few-shot prompt now contains only the most relevant examples, reducing token count and cost
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If we pick only the top 5, we reduce our prompt token cost to 0.03 cents which is already a 70% reduction.&lt;/p&gt;

&lt;p&gt;This is a great technique when using few-shot prompts for high accuracy. In production scenarios, this works really well.&lt;/p&gt;

&lt;h3&gt;
  
  
  1.2 Combine similar requests together
&lt;/h3&gt;

&lt;p&gt;LLMs have been found to retain context for multiple tasks together and FrugalGPT proposes to use this to group multiple requests together thus decreasing the redundant prompt examples in each request.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feooabx72n27omltgxnun.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feooabx72n27omltgxnun.png" alt="combine_requests"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Using the same example as above, we could now try to classify multiple emails in a single request by tweaking the prompt like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[
    {"role": "user", "content": "Identify the intent of the incoming emails by it's summary and classify it as 'Personal', 'Work', 'Events', 'Updates' or 'Todos'"},
    {{examples}},
    {"role": "user", "content": "Emails:\n - {{email_body_1}}\n - {{email_body_2}}\n - {{email_body_3}}"}
]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And similarly, modify the examples also to be in batches of 2 &amp;amp; 3.&lt;/p&gt;

&lt;p&gt;This reduces the cost for 3 requests from 0.06 cents to 0.03 cents in a 50% decrease.&lt;/p&gt;

&lt;p&gt;This approach is particularly useful when processing data in batches using a few-shot prompt.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bonus 1.3: Better utilize a smaller model with a more optimized prompt
&lt;/h3&gt;

&lt;p&gt;There may be certain tasks that can only be accomplished with bigger models. This is because prompting a bigger model is easier or also because you can write a more general-purpose prompt, do zero-shot prompting without giving any examples, and still get reliable results.&lt;/p&gt;

&lt;p&gt;If we can convert some zero-shot prompts for bigger models into few-shot prompts for smaller models, we can get the same level of accuracy at a faster, cheaper rate.&lt;/p&gt;

&lt;p&gt;Matt Shumer proposed an interesting way to use Claude Opus to &lt;a href="https://x.com/mattshumer_/status/1770942240191373770?ref=portkey.ai" rel="noopener noreferrer"&gt;convert a zero-shot prompt into a few-shot prompt&lt;/a&gt; which could then be run on a much smaller model without a significant decrease in accuracy.&lt;/p&gt;

&lt;p&gt;This can lead to significant savings in both latency &amp;amp; cost. For the example Matt used, the original zero-shot prompt contained 281 tokens and was optimized for Claude-3-Opus, which cost 3.2 cents.&lt;/p&gt;

&lt;p&gt;When converted to a few-shot prompt with enough instructions, the prompt size increased to 1600 tokens. But, since we could now run this on Claude Haiku, our overall cost for the request was reduced to 0.05 cents, representing a 98.5% percent cost reduction with a 78% percent speed up!&lt;/p&gt;

&lt;p&gt;The approach works well across XXL and S-sized models. Check out a more general-purpose notebook here.&lt;/p&gt;

&lt;p&gt;This can lead to significant savings in both latency &amp;amp; cost. For the example Matt used, the original zero-shot prompt contained 281 tokens and was optimized for Claude-3-Opus, which cost 3.2 cents.&lt;/p&gt;

&lt;p&gt;When converted to a few-shot prompt with enough instructions, the prompt size increased to 1600 tokens. But, since we could now run this on Claude Haiku, our overall cost for the request was reduced to 0.05 cents, representing a 98.5% percent cost reduction with a 78% percent speed up!&lt;/p&gt;

&lt;p&gt;The approach works well across XXL and S-sized models. Check out a more &lt;a href="https://github.com/mshumer/gpt-prompt-engineer?ref=portkey.ai" rel="noopener noreferrer"&gt;general-purpose notebook&lt;/a&gt; here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bonus 1.4: Compress the prompt
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://arxiv.org/pdf/2403.12968" rel="noopener noreferrer"&gt;LLMLingua paper&lt;/a&gt; published in April 2024 talks about an interesting concept to reduce LLM costs called prompt compression. Prompt compression aims to shorten the input prompts fed into LLMs while preserving the key information, to make LLM inference more efficient and less costly.&lt;/p&gt;

&lt;p&gt;LLMLingua is a task-agnostic prompt compression method proposed in the paper. It works by estimating the information entropy of tokens in the prompt using a smaller language model like LLaMa-7B and then removing low-entropy tokens that contribute less to the overall meaning. The authors demonstrated that LLMLingua can achieve compression ratios of 2.5x-5x on datasets like MeetingBank, LongBench, and GSM8K. Importantly, the compressed prompts still allow the target LLM (e.g. GPT-3.5-Turbo) to produce outputs comparable in quality to using the original full-length prompts.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpiw1rf97vou0iajnbgf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhpiw1rf97vou0iajnbgf.png" alt="LLMLingua"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By reducing prompt length through compression techniques like LLMLingua, we can substantially cut down on the computational cost and latency of LLM inference, without sacrificing too much on the quality of the model's outputs. This is a promising approach to make LLMs more practical and accessible for various AI applications. As research on prompt compression advances, we can expect LLMs to become more cost-efficient to deploy and use at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. LLM Approximation
&lt;/h2&gt;

&lt;p&gt;LLM approximation is another key strategy proposed in FrugalGPT for reducing the costs associated with querying large language models. The idea is to approximate an expensive LLM using cheaper models or infrastructure when possible, without significantly sacrificing performance.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.1 Cache LLM Requests
&lt;/h3&gt;

&lt;p&gt;The age-old technique to reduce costs applies to LLM requests as well. When the prompt is exactly the same, we can save the inference time and cost by serving the request from the cache.&lt;/p&gt;

&lt;p&gt;At Portkey, we've seen that adding a cache to a co-pilot, on average results in 8% cache hits with 99% cost savings, along with a 95% decrease in latency.&lt;/p&gt;

&lt;p&gt;If using the &lt;a href="https://docs.portkey.ai/docs/product/ai-gateway-streamline-llm-integrations?ref=portkey.ai" rel="noopener noreferrer"&gt;Portkey AI gateway&lt;/a&gt;, you can turn on the cache by adding the cache object to your gateway configuration.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"cache": {
    "mode": "simple",
    "max-age": "3600"
}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The generative AI twist to this is an emerging technique called &lt;a href="https://portkey.ai/blog/reducing-llm-costs-and-latency-semantic-cache/" rel="noopener noreferrer"&gt;Semantic Caching&lt;/a&gt;. It's an intelligent way to find similar prompts we've seen in the past and if the threshold is high enough, we could serve the response through a semantic cache.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Femmfngzxseyo50esf25m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Femmfngzxseyo50esf25m.png" alt="semantic cache"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At Portkey, we've seen that upgrading to semantic cache, on average results in a 21% cache hit rate with a 95% decrease in costs and latency! In some cases, we've seen semantic cache efficiencies to be as high as 60% 🤯&lt;/p&gt;

&lt;p&gt;If using the Portkey &lt;a href="https://portkey.ai/features/ai-gateway" rel="noopener noreferrer"&gt;AI gateway&lt;/a&gt;, you can upgrade to a semantic cache.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"cache": {
    "mode": "semantic",
    "max-age": "3600"
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Caveats with caching in production
&lt;/h3&gt;

&lt;p&gt;Having served over 250M cache requests, we've developed an understanding of some of the caveats to keep in mind while implementing a cache in production:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Cache can leak data
We've built a lot of RBAC rules around databases so a user cannot access the data of another user. This falls flat in a cache since a user could try to mock the request of another user and access information from the LLM cache.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To avoid this, on Portkey, you can add metadata keys to your requests and partition the cache as per user/org/session/etc to ensure that the cache store for each metadata key is separate.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Semantic similarity threshold cannot be arbitrary&lt;br&gt;
While getting started with a 0.95 similarity threshold is a good place to start, it's advisable to do a backtest with ~5k requests and test the similarity threshold at which accuracy stays above 99%. After all, we wouldn't want to supply incorrect replies in the name of cost savings!&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We could be caching the wrong results&lt;br&gt;
Since LLMs are probabilistic, the responses can be unsatisfactory at times. Caching these results would only lead to more frustration as the wrong response will be served again and again.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To solve this, it's recommended to create a workflow where you force refresh the cache when the user gives negative feedback (like a thumbs down) or clicks on refresh/regenerate. In Portkey, you can implement this using the &lt;code&gt;force-refresh&lt;/code&gt; config parameter.&lt;/p&gt;

&lt;h3&gt;
  
  
  2.2 Fine-tune a smaller model in parallel
&lt;/h3&gt;

&lt;p&gt;We know that switching to smaller models decreases inference cost and latency. But, this is usually at the expense of accuracy. Fine-tuning is a very effective middle ground, where you can train a smaller model on a specific task and have it perform as well or even better than the larger model.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F57ru9evomcukweg4m6nu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F57ru9evomcukweg4m6nu.png" alt="mistral7b_vs_gpt-4"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In a production environment, it can be massively beneficial to keep serving requests through a bigger model while continuously logging and fine-tuning a smaller model on those responses. We can then evaluate the results from the fine-tuned model and the larger model to determine when it makes sense to switch.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3hf3jf3bjzvu0bpnayz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft3hf3jf3bjzvu0bpnayz.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We've observed cost decreases of as much as 94% percent without a large accuracy decline. The latency also decreases significantly, thus improving user satisfaction.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Note: As with caching, it's beneficial to use human feedback when picking the examples to fine-tune the smaller model.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In Portkey, you can create an &lt;a href="https://portkey.ai/docs/product/autonomous-fine-tuning?ref=portkey.ai" rel="noopener noreferrer"&gt;autonomous fine-tune&lt;/a&gt; using these principles.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Pick the model you want to fine-tune across providers.&lt;/li&gt;
&lt;li&gt;Select the training set for the model from the filtered Portkey logs. We add filters to only pick successful requests (Status: 200) and where the user gave positive feedback (Avg Weighted Feedback &amp;gt; 0)&lt;/li&gt;
&lt;li&gt;Pick the validation dataset (optional)&lt;/li&gt;
&lt;li&gt;Start the fine-tune&lt;/li&gt;
&lt;li&gt;Set a frequency to automatically keep improving the model - Portkey can use the filter criteria to continuously add more training examples to your fine-tuned model and create multiple checkpoints for the same.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  3. LLM Cascade
&lt;/h2&gt;

&lt;p&gt;The LLM cascade is a powerful technique proposed in FrugalGPT that leverages the diversity of available LLMs to optimize for both cost and performance. The key idea is to sequentially query different LLMs based on the confidence of the previous LLM's response. If a cheaper LLM can provide a satisfactory answer, there's no need to query the more expensive models, thus saving costs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6i6g9bb10racmvxsynxs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6i6g9bb10racmvxsynxs.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In essence, the LLM cascade makes a request to the smallest model first, evaluates the response, and returns it if it's good enough. Otherwise, it requests the next larger model and so on until a satisfactory response is obtained or the largest model is reached.&lt;/p&gt;

&lt;p&gt;The LLM cascade consists of two main components:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Generation Scoring Function: This function, denoted as g(q, a), assigns a reliability score between 0 and 1 to an LLM's response a for a given query q. It helps determine if the response is satisfactory or if we need to query the next LLM in the cascade.&lt;/li&gt;
&lt;li&gt;LLM Router: The router is responsible for selecting the optimal sequence of m LLMs to query for a given task. It learns this sequence by optimizing for a combination of cost and performance on a validation set.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Let's see how we can implement an LLM cascade in practice. We'll continue with our email classification example from earlier.&lt;/p&gt;

&lt;p&gt;Suppose we have access to three LLMs: GPT-J (lowest cost), GPT-3.5-Turbo (medium cost), and GPT-4 (highest cost). Our goal is to classify incoming emails into the categories: Personal, Work, Events, Updates, or Todos.&lt;/p&gt;

&lt;p&gt;First, we need to train our generation scoring function. We can use a smaller model like DistilBERT for this. The input to the model will be the concatenation of the query (email body) and the generated classification. The output is a score between 0 and 1 indicating confidence in the classification.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from transformers import pipeline

scorer = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

def generation_score(query, generated_class):
    input_text = f"Email: {query} \Generated Class:\n {generated_class}"
    score = scorer(input_text)[0]['score']
    return score
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Next, we need to learn the optimal LLM sequence and threshold values for our cascade. This can be done by evaluating different combinations on a validation set and optimizing for a given cost budget.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ef evaluate_cascade(llm_sequence, threshold_values, val_set, cost_budget):
    total_cost = 0
    correct_predictions = 0

    for email, true_class in val_set:
        for llm, threshold in zip(llm_sequence, threshold_values):
            generated_class = llm(email)
            score = generation_score(email, generated_class)

            total_cost += llm_cost[llm]
            if score &amp;gt; threshold:
                if generated_class == true_class:
                    correct_predictions += 1
                break

    accuracy = correct_predictions / len(val_set)

    return accuracy, total_cost

# Optimize for llm_sequence and threshold_values using techniques like grid search
best_llm_sequence, best_threshold_values = optimize(evaluate_cascade, cost_budget)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the trained scoring function and optimized LLM cascade, we can now efficiently classify incoming emails:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def classify_email(email):
    for llm, threshold in zip(best_llm_sequence, best_threshold_values):
        generated_class = llm(email)
        score = generation_score(email, generated_class)

        if score &amp;gt; threshold:
            return generated_class

    return generated_class  # Return the final LLM's prediction
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By implementing an LLM cascade, we can dynamically adapt to each query, using more powerful LLMs only when necessary. This allows us to optimize for both cost and performance on a per-query basis.&lt;/p&gt;

&lt;p&gt;In their experiments, the &lt;a href="https://portkey.ai/blog/implementing-frugalgpt-smarter-llm-usage-for-lower-costs/" rel="noopener noreferrer"&gt;FrugalGPT&lt;/a&gt; authors show that an LLM cascade can match GPT-4's performance while reducing costs by up to 98% on some datasets.&lt;/p&gt;

&lt;h2&gt;
  
  
  All Together Now
&lt;/h2&gt;

&lt;p&gt;In conclusion, FrugalGPT offers a comprehensive set of strategies for optimizing LLM API usage while reducing costs and maintaining high performance. By implementing techniques such as prompt adaptation, LLM approximation, and LLM cascade, developers and businesses can significantly reduce their LLM operating expenses without compromising on the quality of their AI-powered applications.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5yfydkedsqwmlwvgz035.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5yfydkedsqwmlwvgz035.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The practical examples and code snippets provided in this post demonstrate how to put FrugalGPT's theory into practice. By adopting these techniques and adapting them to your specific use case, you can create more efficient, cost-effective, and performant LLM-based solutions.&lt;/p&gt;

&lt;p&gt;As LLMs continue to grow in size and capability, the strategies proposed in FrugalGPT will become increasingly important for ensuring the accessibility and sustainability of these powerful tools.&lt;/p&gt;

&lt;h2&gt;
  
  
  Get Started Today
&lt;/h2&gt;

&lt;p&gt;You can put FrugalGPT's core principles into practice using Portkey's product suite for observability, gateway, and fine-tuning.&lt;/p&gt;

&lt;p&gt;To meet other practitioners and engineers who are pushing the boundaries of what’s possible with LLMs, join our close-knit community of practitioners putting &lt;a href="https://discord.com/invite/kXYKpPGasJ?ref=portkey.ai" rel="noopener noreferrer"&gt;LLMs in Prod&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>llm</category>
    </item>
    <item>
      <title>⭐️ Decoding OpenAI Evals</title>
      <dc:creator>Drishti Shah (Ext)</dc:creator>
      <pubDate>Sun, 06 Oct 2024 16:58:49 +0000</pubDate>
      <link>https://forem.com/portkey/decoding-openai-evals-4hpb</link>
      <guid>https://forem.com/portkey/decoding-openai-evals-4hpb</guid>
      <description>&lt;p&gt;Learn how to use the eval framework to evaluate models &amp;amp; prompts to optimize LLM systems for the best outputs.&lt;/p&gt;

&lt;p&gt;Conversation on &lt;a href="https://x.com/jumbld/status/1656295008083779586?s=20&amp;amp;ref=portkey.ai" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There's been a lot of buzz around model evaluations since OpenAI open-sourced their eval framework and Anthropic released their datasets.&lt;/p&gt;

&lt;p&gt;While the documentation for evals is superb from both, understanding it for production implementation is hard. My goal was to use these evals in my own LLM apps. So, we'll try to break down the concepts from the libraries and use them in real-life systems.&lt;/p&gt;

&lt;p&gt;Ready? Let's focus on the &lt;code&gt;openai/evals&lt;/code&gt; library to start with.&lt;/p&gt;

&lt;p&gt;It contains 2 distinct parts:&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;framework&lt;/strong&gt; to evaluate an LLM or a system built on top of an LLM.&lt;br&gt;
A &lt;strong&gt;registry&lt;/strong&gt; of challenging evals.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We'll only focus on the framework in this blog.&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The aim of this blog is to use the eval framework to evaluate models &amp;amp; prompts to optimize LLM systems for the best outputs.&lt;br&gt;
The goal of the blog is not to learn how to submit an eval to OpenAI :)&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  What is an Eval?
&lt;/h2&gt;

&lt;p&gt;An eval is a task used to measure the quality of output of an LLM or LLM system.&lt;/p&gt;

&lt;p&gt;Given an input &lt;code&gt;prompt&lt;/code&gt;, an &lt;code&gt;output&lt;/code&gt; is generated. We evaluate this &lt;code&gt;output&lt;/code&gt; with a set of  ideal_answers and find the quality of the LLM system.&lt;/p&gt;

&lt;p&gt;If we do this a bunch of times, we can find the accuracy.&lt;/p&gt;
&lt;h2&gt;
  
  
  Why use Evals?
&lt;/h2&gt;

&lt;p&gt;While we use evals to measure the accuracy of any LLM system, there are 3 key ways they become extremely useful for any AI app in production.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;As part of the CI/CD Pipeline
Given a dataset, we can make evals a part of our CI/CD pipeline to make sure we achieve the desired accuracy before we deploy. This is especially helpful if we've changed models or parameters by mistake or intentionally.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We could set the CI/CD block to fail in case the accuracy does not meet our standards on the provided dataset.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Finding blind sides of a model in real-time&lt;br&gt;
In real-time, we could keep judging the output of models based on real-user input and find areas or use cases where the model may not be performing well.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;To compare fine-tunes to foundational models&lt;br&gt;
We can also use evals to find if the accuracy of the model improves as we fine-tune it with examples. However, it becomes important to separate out the test &amp;amp; train data so that we don't introduce a bias in our evaluations.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;
  
  
  Eval Templates
&lt;/h2&gt;

&lt;p&gt;OpenAI has defined 2 types of eval templates that can be used out of the box:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Basic Eval Templates&lt;br&gt;
These contain deterministic functions to compare the output to the ideal_answers&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Model-Graded Templates&lt;br&gt;
These contain functions where an LLM compares the output to the ideal_answers and attempts to judge the accuracy.&lt;br&gt;
Let's look at the various functions for these 2 templates.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Basic Eval Templates&lt;br&gt;
These are most helpful when the outputs we're evaluating have very little variation in content &amp;amp; structure.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;When the output is boolean (true/false),&lt;br&gt;
is one of many choices (options in a multiple-choice question),&lt;br&gt;
or is very straightforward (a fact-based answer)&lt;br&gt;
There are 3 methods you can use for comparison&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/openai/evals/blob/main/evals/api.py?ref=portkey.ai#L55" rel="noopener noreferrer"&gt;match&lt;/a&gt;&lt;br&gt;
Checks if any of the ideal_answers start with the output&lt;br&gt;
&lt;a href="https://github.com/openai/evals/blob/main/evals/elsuite/basic/includes.py?ref=portkey.ai#L32" rel="noopener noreferrer"&gt;includes&lt;/a&gt;&lt;br&gt;
Checks if the output is contained within any of the ideal_answers&lt;br&gt;
&lt;a href="https://github.com/openai/evals/blob/main/evals/elsuite/utils.py?ref=portkey.ai#L47" rel="noopener noreferrer"&gt;fuzzy_match&lt;/a&gt;&lt;br&gt;
Checks if the output is contained within any of the ideal_answers OR any ideal_answer is contained within the output&lt;/p&gt;

&lt;p&gt;The workflow for a basic eval looks like this&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuab6n6hafv7c3opshd3p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuab6n6hafv7c3opshd3p.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Given,&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;an input prompt and&lt;/li&gt;
&lt;li&gt;ideal_answers&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We generate the output from the LLM system and then compare the output to the ideal_answers set by using one of the matching algorithms. This feels very natural to how we do QA on deterministic systems as well today, with the exception that we may not get exact matches, but can rely on the output being contained in the ideal_answers or vice-versa.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Model-Graded Eval Templates&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;These are useful when the outputs being generated have significant variations and might even have different structures.&lt;/p&gt;

&lt;p&gt;For example:&lt;/p&gt;

&lt;p&gt;Answering an open-ended question&lt;br&gt;
Summarising a large piece of text&lt;br&gt;
Searching through a set of text&lt;br&gt;
For these use cases, it has been found that we can use a model to grade itself. This is especially interesting as we're now exploiting the reasoning capabilities of an LLM. I'd imagine GPT-4 would be especially good at complex comparisons, while GPT -3.5 (faster, cheaper) would work for simpler comparisons.&lt;/p&gt;

&lt;p&gt;We could use the same model being used for the generation OR a different model. (In a webinar, @kamilelukosiute mentioned that it might be prudent to use a different one to reduce bias)&lt;/p&gt;

&lt;p&gt;There's a generic classification method for model-graded eval templates.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/openai/evals/blob/main/evals/elsuite/modelgraded/classify.py?ref=portkey.ai#L26" rel="noopener noreferrer"&gt;ModelBasedClassify&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This accepts&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;input prompt&lt;/code&gt; being used for the generation,&lt;br&gt;
the &lt;code&gt;output&lt;/code&gt; generation for the prompt&lt;br&gt;
optionally a reference &lt;code&gt;ideal_answer&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;It then prompts an LLM with these 3 parts and expects it to classify if the output is good or not. There are 3 classification methods specified:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;cot_classify&lt;/code&gt; - The model is asked to define a chain of thought and then arrive at an answer (Reason, then answer). This is the recommended classification method.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;classify_cot&lt;/code&gt; - The model is asked to provide an answer and then explain the reasoning behind it.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;classify&lt;/code&gt; expects only the final answer as the output.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Essentially, this is the super simplified workflow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzyz93svw6el3y6na1jms.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzyz93svw6el3y6na1jms.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Let's look at a few examples of this in the real world.&lt;/p&gt;

&lt;p&gt;Eg. 1: Fact-checking (&lt;a href="https://github.com/openai/evals/blob/main/evals/registry/modelgraded/fact.yaml?ref=portkey.ai" rel="noopener noreferrer"&gt;fact.yaml&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Given an &lt;code&gt;input&lt;/code&gt; question, the generated output, and a reference ideal_answer, the model outputs one of 5 options.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;"A"&lt;/code&gt; if &lt;code&gt;a&lt;/code&gt; ⊆ &lt;code&gt;b&lt;/code&gt;, i.e., the &lt;code&gt;output&lt;/code&gt; is a subset of the ideal_answer and is fully consistent with it.&lt;br&gt;
&lt;code&gt;"B"&lt;/code&gt; if &lt;code&gt;a&lt;/code&gt; ⊇ &lt;code&gt;b&lt;/code&gt;, i.e., the &lt;code&gt;output&lt;/code&gt; is a superset of the ideal_answer and is fully consistent with it.&lt;br&gt;
&lt;code&gt;"C"&lt;/code&gt; if &lt;code&gt;a&lt;/code&gt; = &lt;code&gt;b&lt;/code&gt;, i.e., the &lt;code&gt;output&lt;/code&gt; contains all the same details as the ideal_answer.&lt;br&gt;
&lt;code&gt;"D"&lt;/code&gt; if &lt;code&gt;a&lt;/code&gt; ≠ &lt;code&gt;b&lt;/code&gt;, i.e., there is a disagreement between the &lt;code&gt;output&lt;/code&gt; and the &lt;code&gt;ideal_answer&lt;/code&gt;.&lt;br&gt;
&lt;code&gt;"E"&lt;/code&gt; if &lt;code&gt;a&lt;/code&gt; ≈ &lt;code&gt;b&lt;/code&gt;, i.e., the &lt;code&gt;output&lt;/code&gt; and &lt;code&gt;ideal_answer differ&lt;/code&gt;, but these differences don't matter factually.&lt;/p&gt;

&lt;p&gt;Eg. 2: Closed Domain Q&amp;amp;A (&lt;a href="https://github.com/openai/evals/blob/main/evals/registry/modelgraded/closedqa.yaml?ref=portkey.ai" rel="noopener noreferrer"&gt;closedqa.yaml&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;What is closed domain Q&amp;amp;A?&lt;br&gt;
Closed domain Q&amp;amp;A is way to use an LLM system to answer a question, given all the context needed to answer the question.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This is the most common type of Q&amp;amp;A implemented today where you&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;pull context about a user query (mostly from a vector database),&lt;/li&gt;
&lt;li&gt;feed the question and the context to an LLM system, and&lt;/li&gt;
&lt;li&gt;expect the system to synthesize the correct answer.
Here's a cookbook by OpenAI detailing how you could do the same.&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;Given an &lt;code&gt;input_prompt&lt;/code&gt;, the &lt;code&gt;context&lt;/code&gt; or &lt;code&gt;criteria&lt;/code&gt; used to answer the question, and the generated output - the model generates reasoning and then classifies the eval as a &lt;strong&gt;Y&lt;/strong&gt; or &lt;strong&gt;N&lt;/strong&gt; representing the accuracy of the &lt;code&gt;output&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For all search and Q&amp;amp;A use cases, this would be a good way to evaluate the completion of an LLM.&lt;/p&gt;

&lt;p&gt;Eg. 3: Naughty Strings Eval (&lt;a href="https://github.com/openai/evals/blob/main/evals/registry/modelgraded/security.yaml?ref=portkey.ai" rel="noopener noreferrer"&gt;security.yaml&lt;/a&gt;)&lt;/p&gt;

&lt;p&gt;Given an output we try to evaluate if the output is malicious or not. The model returns one of "Yes", "No" or "Unsure" which can be used to grade our LLM system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;More Examples&lt;/strong&gt;&lt;br&gt;
There are more examples of eval templates mentioned here. The idea is to use these as starting points to build eval templates of our own and judge the accuracy of our responses.&lt;/p&gt;
&lt;h2&gt;
  
  
  Using the Evals Framework
&lt;/h2&gt;

&lt;p&gt;Armed with the basics of how evals work (both basic and model-graded), we can use the evals library to evaluate models based on our requirements. We'll create a custom eval for our use case, try running it with a set of prompts and generations, and then also see how we could run evals in real time.&lt;/p&gt;

&lt;p&gt;To start creating an eval, we need&lt;/p&gt;

&lt;p&gt;The test dataset in the &lt;a href="https://jsonlines.org/?ref=portkey.ai" rel="noopener noreferrer"&gt;JSONL format&lt;/a&gt;.&lt;br&gt;
The eval template to be used&lt;br&gt;
Let's create an eval to test an LLM system's capability to extract countries from a passage of text.&lt;/p&gt;
&lt;h3&gt;
  
  
  Creating the dataset
&lt;/h3&gt;

&lt;p&gt;It's recommended to use ChatCompletions for evals, and thus the data set should be in the chat message format and contain&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;the input - the prompt to be fed to the completion system&lt;/li&gt;
&lt;li&gt;and the ideal output - optional, denotes the ideal answer
&lt;/li&gt;
&lt;/ol&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;"input": [{"role": "system", "content": "&amp;lt;input prompt&amp;gt;","name":"example-user"}, "ideal": "correct answer"]

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;blockquote&gt;
&lt;p&gt;Note from OpenAI&lt;br&gt;
The library does support interoperability between regular text prompts and Chat prompts. So, the chat datasets could then be run on the non chat models (text-davinci-003) or any other completion function.&lt;br&gt;
It's just more clear how to downcast from chat to text rather than the other way around since we would make some decisions around where to put the text in system/user/assistant prompts. But both are supported!&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We'll use the following prompt template to extract country information from a text passage.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;List all the countries you can find in the following text.&lt;br&gt;
Text: {{passage_text}}&lt;br&gt;
Countries:&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We can create our input dataset by filling in passages in the prompt template.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;{"input": [{"role": "user", "content": "List all the countries you can find in the following text.\n\nText: Australia is a continent country surrounded by the Indian and Pacific Oceans, known for its beautiful beaches, diverse wildlife, and vast outback. \n\nCountries:"}], "ideal": "Australia"}

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Save the file as &lt;code&gt;inputs.jsonl&lt;/code&gt; to be used later in the eval.&lt;/p&gt;

&lt;h3&gt;
  
  
  Creating a custom eval
&lt;/h3&gt;

&lt;p&gt;We extend the &lt;code&gt;evals.Eval&lt;/code&gt; base class to create our custom eval. We need to override two methods in the class&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;code&gt;eval_sample&lt;/code&gt;: The main method that evaluates a single sample from our dataset. This is where we create a prompt, get a completion (using the default completion function) from our chosen model, and evaluate if the answer is satisfactory or not.&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;run&lt;/code&gt;:  This method is called by the &lt;code&gt;oaieval&lt;/code&gt; CLI to run the eval. The &lt;code&gt;eval_all_samples&lt;/code&gt; function in turn will call the &lt;code&gt;eval_sample&lt;/code&gt; function iteratively.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import evals
import evals.metrics
import random
import openai

openai.api_key = "&amp;lt;YOUR OPENAI KEY&amp;gt;"

class ExtractCountries(evals.Eval):
    def __init__(self, test_jsonl, **kwargs):
        super().__init__(**kwargs)
        self.test_jsonl = test_jsonl

    def run(self, recorder):
        test_samples = evals.get_jsonl(self.test_jsonl)
        self.eval_all_samples(recorder, test_samples)
        # Record overall metrics
        return {
            "accuracy": evals.metrics.get_accuracy(recorder.get_events("match")),
        }

    def eval_sample(self, test_sample,rng: random.Random):
        prompt = test_sample["input"]
        result = self.completion_fn(
            prompt=prompt,
            max_tokens=25
        )
        sampled = result.get_completions()[0]
        evals.record_and_check_match(
            prompt,
            sampled,
            expected=test_sample["ideal"]
        )
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;evals/elsuite/extract_countries.py&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;To run the ExtractCountries eval, we can register it for the CLI to be able to run. Let's create a file called &lt;code&gt;extract_countries.yaml&lt;/code&gt;under the &lt;code&gt;evals/registry/evals&lt;/code&gt; folder and add an entry for our eval.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;lt;!--- Define a base eval --&amp;gt;
extract_countries:
  &amp;lt;!--- id specifies the eval that this eval is an alias for --&amp;gt;
  &amp;lt;!--- When you run oaieval gpt-3.5-turbo extract_countries, you are actually running oaieval gpt-3.5-turbo extract_countries.dev.match-v1 --&amp;gt;

  id: extract_countries.dev.match-v1

  &amp;lt;!--- The metrics that this eval records --&amp;gt;
  &amp;lt;!--- The first metric will be considered to be the primary metric --&amp;gt;
  metrics: [accuracy]
  description: Evaluate if all the countries were extracted from the passage

Define the eval
extract_countries.dev.match-v1:
  &amp;lt;!---Specify the class name as a dotted path to the module and class --&amp;gt;
  class: evals.elsuite.extract_countries:ExtractCountries
  &amp;lt;!--- Specify the arguments as a dictionary of JSONL URIs --&amp;gt;
  &amp;lt;!--- These arguments can be anything that you want to pass to the class constructor --&amp;gt;
  args:
    test_jsonl: /tmp/inputs.jsonl
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Running the Eval
&lt;/h3&gt;

&lt;p&gt;Now, we can run this eval using the &lt;code&gt;oaieval&lt;/code&gt; CLI like this&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;pip install&lt;br&gt;
oaieval gpt-3.5-turbo extract_countries&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This will run our evaluation in parallel on multiple threads and produce an accuracy.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe38r3qsf8ko19sjw2diq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fe38r3qsf8ko19sjw2diq.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In our case, we got an accuracy of 76% which is not great for an operation like this. The following could be the reasons for the current accuracy:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We might not be using the right evaluation spec.&lt;/li&gt;
&lt;li&gt;The test data isn't very clean&lt;/li&gt;
&lt;li&gt;The model doesn't work as expected.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Going through the eval logs can be helpful here.&lt;/p&gt;

&lt;h3&gt;
  
  
  Going through Eval Logs
&lt;/h3&gt;

&lt;p&gt;The eval logs are located at &lt;code&gt;/tmp/evallogs&lt;/code&gt; and different log files are created for each evaluation run.&lt;/p&gt;

&lt;p&gt;We could go through the text file in an editor, or there are open-source projects like &lt;a href="https://github.com/zeno-ml/zeno-evals?ref=portkey.ai" rel="noopener noreferrer"&gt;Zeno&lt;/a&gt; that help us visualize these logs and analyze them better.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using custom completion functions
&lt;/h3&gt;

&lt;p&gt;"Completion Functions" are the completion methods of any model (LLM or otherwise). And a &lt;code&gt;completion&lt;/code&gt; is the text output that would be the LLM system's answer to the &lt;code&gt;prompt input&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In the example above, we chose to use the default completion function (text-davinci-003) of the library for our eval.&lt;/p&gt;

&lt;p&gt;We could also write our own completion functions as &lt;a href="https://github.com/openai/evals/blob/main/docs/completion-fns.md?ref=portkey.ai" rel="noopener noreferrer"&gt;explained here&lt;/a&gt;. These completion functions could be&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;any LLM model of choice,&lt;/li&gt;
&lt;li&gt;a chain of prompts (as popularised by Langchain)&lt;/li&gt;
&lt;li&gt;or &lt;a href="https://github.com/Significant-Gravitas/Auto-GPT-Benchmarks?ref=portkey.ai" rel="noopener noreferrer"&gt;even AutoGPT&lt;/a&gt;!&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Useful tips as you run your evals
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;If you notice &lt;a href="https://portkey.ai/blog/decoding-openai-evals/" rel="noopener noreferrer"&gt;evals&lt;/a&gt; has cached your data and you need to clear that cache, you can do so with &lt;code&gt;rm -rf /tmp/filecache&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Wherever Basic Templates can work, avoid Model Graded Templates as they will have lower reliability.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Further Reading&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://towardsdatascience.com/the-most-common-evaluation-metrics-in-nlp-ced6a763ac8b?ref=portkey.ai" rel="noopener noreferrer"&gt;Common evaluation metrics in NLP&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;[ARXIV] &lt;a href="https://portkey.ai/blog/discovering-language-model-behaviors-with-model-written-evaluations-summary/" rel="noopener noreferrer"&gt;Discovering Language Model Behaviors with Model-Written&lt;/a&gt; Evaluations and Anthropic's &lt;a href="https://github.com/anthropics/evals?ref=portkey.ai" rel="noopener noreferrer"&gt;Model-Written Evaluation Datasets&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.eleuther.ai/projects/large-language-model-evaluation?ref=portkey.ai" rel="noopener noreferrer"&gt;Evaluating LLMs by EleutherAI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/Significant-Gravitas/AutoGPT/tree/master/tests/integration/challenges?ref=portkey.ai" rel="noopener noreferrer"&gt;Challenges by AutoGPT&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/llamaindex-blog/building-and-evaluating-a-qa-system-with-llamaindex-3f02e9d87ce1" rel="noopener noreferrer"&gt;Building and Evaluating a QA System with LlamaIndex&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard?ref=portkey.ai" rel="noopener noreferrer"&gt;Leaderboard to track, rank and evaluate LLMs and chatbots by HuggingFace&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>llm</category>
      <category>openai</category>
    </item>
    <item>
      <title>Ranking LLMs with Elo Ratings</title>
      <dc:creator>Drishti Shah (Ext)</dc:creator>
      <pubDate>Tue, 01 Oct 2024 14:59:28 +0000</pubDate>
      <link>https://forem.com/portkey/ranking-llms-with-elo-ratings-38g2</link>
      <guid>https://forem.com/portkey/ranking-llms-with-elo-ratings-38g2</guid>
      <description>&lt;p&gt;Large language models (LLMs) are becoming increasingly popular for various use cases, from natural language processing, and text generation to creating hyper-realistic videos.&lt;/p&gt;

&lt;p&gt;Tech giants like OpenAI, Google, and Facebook are all vying for dominance in the LLM space, offering their own unique models and capabilities. Stanford, AI21, Anthropic, Cerebras, and more companies also have very interesting projects coming live!&lt;/p&gt;

&lt;p&gt;For text generation alone, nat.dev has over 20 models you can choose from.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F04pd55z1vzfnsbsa7u2x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F04pd55z1vzfnsbsa7u2x.png" alt="nat.dev interface"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Choosing a model for your use case can be challenging. You could just play it safe and choose ChatGPT or GPT-4, but other models might be cheaper or better suited for your use case.&lt;/p&gt;

&lt;p&gt;So how do you compare outputs?&lt;/p&gt;

&lt;p&gt;Let's try leveraging the &lt;a href="https://en.wikipedia.org/wiki/Elo_rating_system?ref=portkey.ai" rel="noopener noreferrer"&gt;Elo rating system&lt;/a&gt;, originally designed to rank chess players, to evaluate and rank different LLMs based on their performance in head-to-head comparisons.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding Elo Ratings
&lt;/h2&gt;

&lt;p&gt;So, what are Elo ratings? They were created to rank chess players around the world. Players start with a rating between 1000 Elo (beginner) and 2800 Elo or higher (pros). When a player wins a match, their rating goes up based on their opponent’s Elo rating.&lt;/p&gt;

&lt;p&gt;You might remember that scene from The Social Network where Zuck and Saverin scribble the Elo formula on their dorm window. Yeah, that’s the same thing we’re about to use to rank LLMs!&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fng88zqd3e5dqxmniiajm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fng88zqd3e5dqxmniiajm.png" alt="the social network"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here’s how it works: if you start with a 1000 Elo score and somehow beat Magnus Carlsen, who has a rating of 2882, your rating would jump 32 points to 1032 Elo, while his would drop 32 points to 2850.&lt;/p&gt;

&lt;p&gt;The increase and decrease are based on a formula, but we won’t get too deep into the math here. Just know that there are libraries for all that stuff, and the Elo scoring system has been proven to work well. In fact, it’s likely used by large models for RLHF training of more command-based models like ChatGPT.&lt;/p&gt;

&lt;h2&gt;
  
  
  Expanding Elo Ratings for Multiple LLMs
&lt;/h2&gt;

&lt;p&gt;While traditional Elo ratings are designed for comparing two players, our goal is to rank multiple LLMs. To do this, we can adapt the Elo rating system, and we have &lt;a href="https://towardsdatascience.com/developing-a-generalized-elo-rating-system-for-multiplayer-games-b9b495e87802" rel="noopener noreferrer"&gt;Danny Cunningham’s awesome method&lt;/a&gt; to thank for that. With this expansion, we can rank multiple models at the same time, based on their performance in head-to-head matchups.&lt;/p&gt;

&lt;p&gt;This adaptation allows us to have a more comprehensive view of how each model stacks up against the others. By comparing the models’ performances in various combinations, we can gather enough data to determine the most effective model for our use case.&lt;/p&gt;

&lt;p&gt;Setting Up the Test&lt;br&gt;
Alright, it’s time to see our method in action! We’re going to create “Satoshi tweets about …” using three different generation models to compare their performance. Our contenders are&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Davinci (GPT-3)&lt;/li&gt;
&lt;li&gt;GPT-4&lt;/li&gt;
&lt;li&gt;Llama 7B&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each of these models will generate its own version of the tweet based on the same prompt.&lt;/p&gt;

&lt;p&gt;To make things organized, we’ll save the outputs in a CSV file. The file will have columns for the prompt, Davinci, GPT-4, and Llama, so it’s easy to see the results generated by each model. This setup will help us compare the different LLMs effectively and determine which one is the best fit for generating content in this specific scenario.&lt;/p&gt;

&lt;p&gt;By conducting this test, we’ll gather valuable insights into each model’s capabilities and strengths, giving us a clearer picture of which LLM comes out on top.&lt;/p&gt;
&lt;h2&gt;
  
  
  Let’s Start Ranking
&lt;/h2&gt;

&lt;p&gt;To make the comparison process smooth and enjoyable, we’ll create a simple user interface (UI) for uploading the CSV file and ranking the outputs. This UI will allow for a blind test, which means we won’t know which model generated each output. This way, we can minimize any potential bias while evaluating the results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Friv5kzs6kv8b79yqkau2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Friv5kzs6kv8b79yqkau2.png" alt="retool"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After we’ve made about six comparisons, we can start to notice patterns emerging in the rankings. (Six is a fairly small number, but then we did rig the test in the choice of our models 😉)&lt;/p&gt;

&lt;p&gt;Here’s what’s going on in the background while you’re ranking the outputs:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;All models start with a base level of 1500 Elo&lt;/strong&gt;: They all begin with an equal footing, ensuring a fair comparison.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;New ranks are calculated for all LLMs after each ranking input&lt;/strong&gt;: As we evaluate and rank the outputs, the system will update the Elo ratings for each model based on their performance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;A line chart identifies trends in ranking changes&lt;/strong&gt;: Visualizing the ranking changes over time will help us spot trends and better understand which LLM consistently outperforms the others.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9aa0j5m4c8d59sqmfvi1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9aa0j5m4c8d59sqmfvi1.png" alt="line chart for LLM elo ratings"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here’s a code snippet to try out a simulation for multiple models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from multielo import MultiElo

# Initialize the Elo ratings and history
recipes = ['Model 1', 'Model 2', 'Model 3']
elo_ratings = np.array([1500, 1500, 1500])
elo_history = [np.array([1500]), np.array([1500]), np.array([1500])]

elo = MultiElo()

while True:
    print("Please rank the recipes from 1 (best) to 3 (worst).")
    ranks = [int(input(f"Rank for {recipe}: ")) for recipe in recipes]

    # Update ratings based on user input
    new_ratings = elo.get_new_ratings(elo_ratings, ranks)

    # Update the rating history
    for idx, rating in enumerate(new_ratings):
        elo_history[idx] = np.append(elo_history[idx], rating)

    elo_ratings = new_ratings
    print("Current Elo Ratings:", elo_ratings)

    plot_choice = input("Would you like to plot the Elo rating changes? (yes/no): ")

    if plot_choice.lower() == 'yes':
        for idx, recipe in enumerate(recipes):
            plt.plot(elo_history[idx], label=recipe)

        plt.xlabel("Number of Iterations")
        plt.ylabel("Elo Rating")
        plt.title("Elo Rating Changes")
        plt.legend()
        plt.show()

    continue_choice = input("Do you want to continue ranking recipes? (yes/no): ")
    if continue_choice.lower() == 'no':
        break
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Wrapping Up the Test&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;We want to be confident in declaring a winner before wrapping up the A/B/C test. To reach this point, consider the following factors:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Choose your confidence level&lt;/strong&gt;: Many people opt for a 95% confidence level, but we can adjust it based on our specific needs and preferences.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep an eye on &lt;a href="https://portkey.ai/blog/comparing-llm-outputs-with-elo-ratings/" rel="noopener noreferrer"&gt;Elo LLM ratings&lt;/a&gt;&lt;/strong&gt;: As you conduct more and more tests, the differences in ratings between the models will become more stable. A larger Elo rating difference between the two options means you can be more certain about the winner.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Carry out enough matches&lt;/strong&gt;: It’s important to strike a balance between the number of matches and the duration of your test. You might decide to end the test when the Elo rating difference between the options meets your chosen confidence level.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What's next?
&lt;/h2&gt;

&lt;p&gt;Conducting quick tests can help us pick an LLM, but we can also use real user feedback to optimize the model in real time. By integrating this approach into our application, we'd be able to identify the winning and losing models as they emerge, adapting on the fly to improve performance.&lt;/p&gt;

&lt;p&gt;We could also pick models for segments of a user base depending on the incoming feedback which can create different Elo ratings for different cohorts of users.&lt;/p&gt;

&lt;p&gt;This could also be used as a starting point to identify fine-tuning and training opportunities for companies looking to get the extra edge from base LLMs.&lt;/p&gt;

&lt;p&gt;While there are tons of ways to run A/B tests on LLMs, this simple Elo LLM rating method is a fun and effective way to refine our choices and make sure we pick the best option for our project.&lt;/p&gt;

&lt;p&gt;Happy testing!&lt;/p&gt;

</description>
      <category>llm</category>
      <category>portkey</category>
    </item>
  </channel>
</rss>
