Forem: Drishti Shah (Ext)

FrugalGPT: Reducing LLM Costs & Improving Performance

Drishti Shah (Ext) — Mon, 07 Oct 2024 16:56:37 +0000

FrugalGPT is a framework proposed by Lingjiao Chen, Matei Zaharia, and James Zou from Stanford University in their 2023 paper "FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance". The paper outlines strategies for more cost-effective and performant usage of large language model (LLM) APIs.

A year after its initial publication, FrugalGPT remains highly relevant and widely discussed in the AI community. Its enduring popularity stems from the pressing need to make LLM API usage more affordable and efficient as these models grow larger and more expensive.

The core of FrugalGPT revolves around three key techniques for reducing LLM inference costs:

Prompt Adaptation - Using concise, optimized prompts to minimize prompt processing costs
LLM Approximation - Utilizing caches and model fine-tuning to avoid repeated queries to expensive models
LLM Cascade - Dynamically selecting the optimal set of LLMs to query based on the input
The authors demonstrate the potential of these techniques, showing that

FrugalGPT can match the performance of the best individual LLM (e.g. GPT-4) with up to 98% cost reduction or improve the accuracy over GPT-4 by 4% with the same cost.

In this post, we'll delve into the practical implementation of these FrugalGPT strategies. We'll provide concrete code examples of how you can employ prompt adaptation, LLM approximation, and LLM cascade in your own AI applications to get the most out of LLMs while managing costs effectively. By adopting FrugalGPT techniques, you can significantly reduce your LLM operating expenses without sacrificing performance.

Let's put the theory of FrugalGPT into practice:

1. Prompt Adaptation

FrugalGPT wants us to either reduce the size of the prompt OR combine similar prompts together. The core idea is to minimize tokens and thus reduce LLM costs.

1.1 Decrease the prompt size

FrugalGPT proposes that instead of sending a lot of examples in a few-shot prompt, we could pick and choose the best ones, and thus reduce the prompt size.

While some larger models today don't necessarily need few-shot prompts, the technique does significantly increase accuracy across multiple tasks.

Let's take a classification example where we want to classify our incoming email based on the body into folders - Personal, Work, Events, Updates, Todos.

A few-shot prompt would look something like:

[{"role": "user", "content": "Identify the intent of incoming email by it's summary and classify it as Personal, Work, Events, Updates or Todos"},
{{examples}},
{"role": "user", "content": "Email: {{email_body}}"}]

Where the examples would be formatted as human, assistant chats duets like this

[{"role": "user", "content": "Email: Hey, this is a reminder from your future self to take flowers home today"},
{"role": "assistant", "content": "Personal, Todos"}]

The prompt with 20 examples is approximately 623 tokens and would cost 0.1 cents per request on gpt-3.5-turbo. Also, we might want to keep adding examples when emails are mislabeled to improve accuracy. This would further increase the cost of the prompt tokens.

FrugalGPT suggests identifying the best examples to be used instead of all of them.

In this case, we could do a semantic similarity test between the incoming email and the example email bodies. Then, only pick the top k similar examples.

import numpy as np
from sklearn.metrics.pairise import cosine_similarity
from transformers import AutoTokenizer, AutoModel

tokenizer = AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')
model = AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2')

def get_embeddings(texts):
    encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
    with torch.no_grad():
        model_output = model(**encoded_input)
    embeddings = model_output.last_hidden_state[:, 0, :] 
    return embeddings.numpy()

# Assuming examples is a list of dictionaries with 'email_body' and 'labels' keys
example_embeddings = get_embeddings([ex['email_body'] for ex in examples]) 

def select_best_examples(query, examples, k=5):
    query_embedding = get_embeddings([query])[0]
    similarities = cosine_similarity([query_embedding], example_embeddings)[0]
    top_k_indices = np.argsort(similarities)[-k:]
    return [examples[i] for i in top_k_indices]

# Example usage
query_email = "Don't forget our lunch meeting at the new Italian place today at noon!"
best_examples = select_best_examples(query_email, examples)
few_shot_prompt = generate_prompt(best_examples, query_email) # Omitted for brevity
# Few-shot prompt now contains only the most relevant examples, reducing token count and cost

If we pick only the top 5, we reduce our prompt token cost to 0.03 cents which is already a 70% reduction.

This is a great technique when using few-shot prompts for high accuracy. In production scenarios, this works really well.

1.2 Combine similar requests together

LLMs have been found to retain context for multiple tasks together and FrugalGPT proposes to use this to group multiple requests together thus decreasing the redundant prompt examples in each request.

Using the same example as above, we could now try to classify multiple emails in a single request by tweaking the prompt like this:

[
    {"role": "user", "content": "Identify the intent of the incoming emails by it's summary and classify it as 'Personal', 'Work', 'Events', 'Updates' or 'Todos'"},
    {{examples}},
    {"role": "user", "content": "Emails:\n - {{email_body_1}}\n - {{email_body_2}}\n - {{email_body_3}}"}
]

And similarly, modify the examples also to be in batches of 2 & 3.

This reduces the cost for 3 requests from 0.06 cents to 0.03 cents in a 50% decrease.

This approach is particularly useful when processing data in batches using a few-shot prompt.

Bonus 1.3: Better utilize a smaller model with a more optimized prompt

There may be certain tasks that can only be accomplished with bigger models. This is because prompting a bigger model is easier or also because you can write a more general-purpose prompt, do zero-shot prompting without giving any examples, and still get reliable results.

If we can convert some zero-shot prompts for bigger models into few-shot prompts for smaller models, we can get the same level of accuracy at a faster, cheaper rate.

Matt Shumer proposed an interesting way to use Claude Opus to convert a zero-shot prompt into a few-shot prompt which could then be run on a much smaller model without a significant decrease in accuracy.

This can lead to significant savings in both latency & cost. For the example Matt used, the original zero-shot prompt contained 281 tokens and was optimized for Claude-3-Opus, which cost 3.2 cents.

When converted to a few-shot prompt with enough instructions, the prompt size increased to 1600 tokens. But, since we could now run this on Claude Haiku, our overall cost for the request was reduced to 0.05 cents, representing a 98.5% percent cost reduction with a 78% percent speed up!

The approach works well across XXL and S-sized models. Check out a more general-purpose notebook here.

This can lead to significant savings in both latency & cost. For the example Matt used, the original zero-shot prompt contained 281 tokens and was optimized for Claude-3-Opus, which cost 3.2 cents.

The approach works well across XXL and S-sized models. Check out a more general-purpose notebook here.

Bonus 1.4: Compress the prompt

The LLMLingua paper published in April 2024 talks about an interesting concept to reduce LLM costs called prompt compression. Prompt compression aims to shorten the input prompts fed into LLMs while preserving the key information, to make LLM inference more efficient and less costly.

LLMLingua is a task-agnostic prompt compression method proposed in the paper. It works by estimating the information entropy of tokens in the prompt using a smaller language model like LLaMa-7B and then removing low-entropy tokens that contribute less to the overall meaning. The authors demonstrated that LLMLingua can achieve compression ratios of 2.5x-5x on datasets like MeetingBank, LongBench, and GSM8K. Importantly, the compressed prompts still allow the target LLM (e.g. GPT-3.5-Turbo) to produce outputs comparable in quality to using the original full-length prompts.

By reducing prompt length through compression techniques like LLMLingua, we can substantially cut down on the computational cost and latency of LLM inference, without sacrificing too much on the quality of the model's outputs. This is a promising approach to make LLMs more practical and accessible for various AI applications. As research on prompt compression advances, we can expect LLMs to become more cost-efficient to deploy and use at scale.

2. LLM Approximation

LLM approximation is another key strategy proposed in FrugalGPT for reducing the costs associated with querying large language models. The idea is to approximate an expensive LLM using cheaper models or infrastructure when possible, without significantly sacrificing performance.

2.1 Cache LLM Requests

The age-old technique to reduce costs applies to LLM requests as well. When the prompt is exactly the same, we can save the inference time and cost by serving the request from the cache.

At Portkey, we've seen that adding a cache to a co-pilot, on average results in 8% cache hits with 99% cost savings, along with a 95% decrease in latency.

If using the Portkey AI gateway, you can turn on the cache by adding the cache object to your gateway configuration.

"cache": {
    "mode": "simple",
    "max-age": "3600"
}

The generative AI twist to this is an emerging technique called Semantic Caching. It's an intelligent way to find similar prompts we've seen in the past and if the threshold is high enough, we could serve the response through a semantic cache.

At Portkey, we've seen that upgrading to semantic cache, on average results in a 21% cache hit rate with a 95% decrease in costs and latency! In some cases, we've seen semantic cache efficiencies to be as high as 60% 🤯

If using the Portkey AI gateway, you can upgrade to a semantic cache.

"cache": {
    "mode": "semantic",
    "max-age": "3600"
}

Caveats with caching in production

Having served over 250M cache requests, we've developed an understanding of some of the caveats to keep in mind while implementing a cache in production:

Cache can leak data We've built a lot of RBAC rules around databases so a user cannot access the data of another user. This falls flat in a cache since a user could try to mock the request of another user and access information from the LLM cache.

To avoid this, on Portkey, you can add metadata keys to your requests and partition the cache as per user/org/session/etc to ensure that the cache store for each metadata key is separate.

Semantic similarity threshold cannot be arbitrary
While getting started with a 0.95 similarity threshold is a good place to start, it's advisable to do a backtest with ~5k requests and test the similarity threshold at which accuracy stays above 99%. After all, we wouldn't want to supply incorrect replies in the name of cost savings!
We could be caching the wrong results
Since LLMs are probabilistic, the responses can be unsatisfactory at times. Caching these results would only lead to more frustration as the wrong response will be served again and again.

To solve this, it's recommended to create a workflow where you force refresh the cache when the user gives negative feedback (like a thumbs down) or clicks on refresh/regenerate. In Portkey, you can implement this using the force-refresh config parameter.

2.2 Fine-tune a smaller model in parallel

We know that switching to smaller models decreases inference cost and latency. But, this is usually at the expense of accuracy. Fine-tuning is a very effective middle ground, where you can train a smaller model on a specific task and have it perform as well or even better than the larger model.

In a production environment, it can be massively beneficial to keep serving requests through a bigger model while continuously logging and fine-tuning a smaller model on those responses. We can then evaluate the results from the fine-tuned model and the larger model to determine when it makes sense to switch.

We've observed cost decreases of as much as 94% percent without a large accuracy decline. The latency also decreases significantly, thus improving user satisfaction.

Note: As with caching, it's beneficial to use human feedback when picking the examples to fine-tune the smaller model.

In Portkey, you can create an autonomous fine-tune using these principles.

Pick the model you want to fine-tune across providers.
Select the training set for the model from the filtered Portkey logs. We add filters to only pick successful requests (Status: 200) and where the user gave positive feedback (Avg Weighted Feedback > 0)
Pick the validation dataset (optional)
Start the fine-tune
Set a frequency to automatically keep improving the model - Portkey can use the filter criteria to continuously add more training examples to your fine-tuned model and create multiple checkpoints for the same.

3. LLM Cascade

The LLM cascade is a powerful technique proposed in FrugalGPT that leverages the diversity of available LLMs to optimize for both cost and performance. The key idea is to sequentially query different LLMs based on the confidence of the previous LLM's response. If a cheaper LLM can provide a satisfactory answer, there's no need to query the more expensive models, thus saving costs.

In essence, the LLM cascade makes a request to the smallest model first, evaluates the response, and returns it if it's good enough. Otherwise, it requests the next larger model and so on until a satisfactory response is obtained or the largest model is reached.

The LLM cascade consists of two main components:

Generation Scoring Function: This function, denoted as g(q, a), assigns a reliability score between 0 and 1 to an LLM's response a for a given query q. It helps determine if the response is satisfactory or if we need to query the next LLM in the cascade.
LLM Router: The router is responsible for selecting the optimal sequence of m LLMs to query for a given task. It learns this sequence by optimizing for a combination of cost and performance on a validation set.

Let's see how we can implement an LLM cascade in practice. We'll continue with our email classification example from earlier.

Suppose we have access to three LLMs: GPT-J (lowest cost), GPT-3.5-Turbo (medium cost), and GPT-4 (highest cost). Our goal is to classify incoming emails into the categories: Personal, Work, Events, Updates, or Todos.

First, we need to train our generation scoring function. We can use a smaller model like DistilBERT for this. The input to the model will be the concatenation of the query (email body) and the generated classification. The output is a score between 0 and 1 indicating confidence in the classification.

from transformers import pipeline

scorer = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english")

def generation_score(query, generated_class):
    input_text = f"Email: {query} \Generated Class:\n {generated_class}"
    score = scorer(input_text)[0]['score']
    return score

Next, we need to learn the optimal LLM sequence and threshold values for our cascade. This can be done by evaluating different combinations on a validation set and optimizing for a given cost budget.

ef evaluate_cascade(llm_sequence, threshold_values, val_set, cost_budget):
    total_cost = 0
    correct_predictions = 0

    for email, true_class in val_set:
        for llm, threshold in zip(llm_sequence, threshold_values):
            generated_class = llm(email)
            score = generation_score(email, generated_class)

            total_cost += llm_cost[llm]
            if score > threshold:
                if generated_class == true_class:
                    correct_predictions += 1
                break

    accuracy = correct_predictions / len(val_set)

    return accuracy, total_cost

# Optimize for llm_sequence and threshold_values using techniques like grid search
best_llm_sequence, best_threshold_values = optimize(evaluate_cascade, cost_budget)

With the trained scoring function and optimized LLM cascade, we can now efficiently classify incoming emails:

def classify_email(email):
    for llm, threshold in zip(best_llm_sequence, best_threshold_values):
        generated_class = llm(email)
        score = generation_score(email, generated_class)

        if score > threshold:
            return generated_class

    return generated_class  # Return the final LLM's prediction

By implementing an LLM cascade, we can dynamically adapt to each query, using more powerful LLMs only when necessary. This allows us to optimize for both cost and performance on a per-query basis.

In their experiments, the FrugalGPT authors show that an LLM cascade can match GPT-4's performance while reducing costs by up to 98% on some datasets.

All Together Now

In conclusion, FrugalGPT offers a comprehensive set of strategies for optimizing LLM API usage while reducing costs and maintaining high performance. By implementing techniques such as prompt adaptation, LLM approximation, and LLM cascade, developers and businesses can significantly reduce their LLM operating expenses without compromising on the quality of their AI-powered applications.

The practical examples and code snippets provided in this post demonstrate how to put FrugalGPT's theory into practice. By adopting these techniques and adapting them to your specific use case, you can create more efficient, cost-effective, and performant LLM-based solutions.

As LLMs continue to grow in size and capability, the strategies proposed in FrugalGPT will become increasingly important for ensuring the accessibility and sustainability of these powerful tools.

Get Started Today

You can put FrugalGPT's core principles into practice using Portkey's product suite for observability, gateway, and fine-tuning.

To meet other practitioners and engineers who are pushing the boundaries of what’s possible with LLMs, join our close-knit community of practitioners putting LLMs in Prod.

⭐️ Decoding OpenAI Evals

Drishti Shah (Ext) — Sun, 06 Oct 2024 16:58:49 +0000

Learn how to use the eval framework to evaluate models & prompts to optimize LLM systems for the best outputs.

Conversation on Twitter

There's been a lot of buzz around model evaluations since OpenAI open-sourced their eval framework and Anthropic released their datasets.

While the documentation for evals is superb from both, understanding it for production implementation is hard. My goal was to use these evals in my own LLM apps. So, we'll try to break down the concepts from the libraries and use them in real-life systems.

Ready? Let's focus on the openai/evals library to start with.

It contains 2 distinct parts:

A framework to evaluate an LLM or a system built on top of an LLM.
A registry of challenging evals.

We'll only focus on the framework in this blog.

The aim of this blog is to use the eval framework to evaluate models & prompts to optimize LLM systems for the best outputs.
The goal of the blog is not to learn how to submit an eval to OpenAI :)

What is an Eval?

An eval is a task used to measure the quality of output of an LLM or LLM system.

Given an input prompt, an output is generated. We evaluate this output with a set of ideal_answers and find the quality of the LLM system.

If we do this a bunch of times, we can find the accuracy.

Why use Evals?

While we use evals to measure the accuracy of any LLM system, there are 3 key ways they become extremely useful for any AI app in production.

As part of the CI/CD Pipeline Given a dataset, we can make evals a part of our CI/CD pipeline to make sure we achieve the desired accuracy before we deploy. This is especially helpful if we've changed models or parameters by mistake or intentionally.

We could set the CI/CD block to fail in case the accuracy does not meet our standards on the provided dataset.

Finding blind sides of a model in real-time
In real-time, we could keep judging the output of models based on real-user input and find areas or use cases where the model may not be performing well.
To compare fine-tunes to foundational models
We can also use evals to find if the accuracy of the model improves as we fine-tune it with examples. However, it becomes important to separate out the test & train data so that we don't introduce a bias in our evaluations.

Eval Templates

OpenAI has defined 2 types of eval templates that can be used out of the box:

Basic Eval Templates
These contain deterministic functions to compare the output to the ideal_answers
Model-Graded Templates
These contain functions where an LLM compares the output to the ideal_answers and attempts to judge the accuracy.
Let's look at the various functions for these 2 templates.
Basic Eval Templates
These are most helpful when the outputs we're evaluating have very little variation in content & structure.

For example:

When the output is boolean (true/false),
is one of many choices (options in a multiple-choice question),
or is very straightforward (a fact-based answer)
There are 3 methods you can use for comparison

match
Checks if any of the ideal_answers start with the output
includes
Checks if the output is contained within any of the ideal_answers
fuzzy_match
Checks if the output is contained within any of the ideal_answers OR any ideal_answer is contained within the output

The workflow for a basic eval looks like this

Given,

an input prompt and
ideal_answers

We generate the output from the LLM system and then compare the output to the ideal_answers set by using one of the matching algorithms. This feels very natural to how we do QA on deterministic systems as well today, with the exception that we may not get exact matches, but can rely on the output being contained in the ideal_answers or vice-versa.

Model-Graded Eval Templates

These are useful when the outputs being generated have significant variations and might even have different structures.

For example:

Answering an open-ended question
Summarising a large piece of text
Searching through a set of text
For these use cases, it has been found that we can use a model to grade itself. This is especially interesting as we're now exploiting the reasoning capabilities of an LLM. I'd imagine GPT-4 would be especially good at complex comparisons, while GPT -3.5 (faster, cheaper) would work for simpler comparisons.

We could use the same model being used for the generation OR a different model. (In a webinar, @kamilelukosiute mentioned that it might be prudent to use a different one to reduce bias)

There's a generic classification method for model-graded eval templates.

ModelBasedClassify

This accepts

The input prompt being used for the generation,
the output generation for the prompt
optionally a reference ideal_answer

It then prompts an LLM with these 3 parts and expects it to classify if the output is good or not. There are 3 classification methods specified:

cot_classify - The model is asked to define a chain of thought and then arrive at an answer (Reason, then answer). This is the recommended classification method.
classify_cot - The model is asked to provide an answer and then explain the reasoning behind it.
classify expects only the final answer as the output.

Essentially, this is the super simplified workflow:

Let's look at a few examples of this in the real world.

Eg. 1: Fact-checking (fact.yaml)

Given an input question, the generated output, and a reference ideal_answer, the model outputs one of 5 options.

"A" if a ⊆ b, i.e., the output is a subset of the ideal_answer and is fully consistent with it.
"B" if a ⊇ b, i.e., the output is a superset of the ideal_answer and is fully consistent with it.
"C" if a = b, i.e., the output contains all the same details as the ideal_answer.
"D" if a ≠ b, i.e., there is a disagreement between the output and the ideal_answer.
"E" if a ≈ b, i.e., the output and ideal_answer differ, but these differences don't matter factually.

Eg. 2: Closed Domain Q&A (closedqa.yaml)

What is closed domain Q&A?
Closed domain Q&A is way to use an LLM system to answer a question, given all the context needed to answer the question.

This is the most common type of Q&A implemented today where you

pull context about a user query (mostly from a vector database),

feed the question and the context to an LLM system, and

expect the system to synthesize the correct answer. Here's a cookbook by OpenAI detailing how you could do the same.

Given an input_prompt, the context or criteria used to answer the question, and the generated output - the model generates reasoning and then classifies the eval as a Y or N representing the accuracy of the output.

For all search and Q&A use cases, this would be a good way to evaluate the completion of an LLM.

Eg. 3: Naughty Strings Eval (security.yaml)

Given an output we try to evaluate if the output is malicious or not. The model returns one of "Yes", "No" or "Unsure" which can be used to grade our LLM system.

More Examples
There are more examples of eval templates mentioned here. The idea is to use these as starting points to build eval templates of our own and judge the accuracy of our responses.

Using the Evals Framework

Armed with the basics of how evals work (both basic and model-graded), we can use the evals library to evaluate models based on our requirements. We'll create a custom eval for our use case, try running it with a set of prompts and generations, and then also see how we could run evals in real time.

To start creating an eval, we need

The test dataset in the JSONL format.
The eval template to be used
Let's create an eval to test an LLM system's capability to extract countries from a passage of text.

Creating the dataset

It's recommended to use ChatCompletions for evals, and thus the data set should be in the chat message format and contain

the input - the prompt to be fed to the completion system
and the ideal output - optional, denotes the ideal answer

"input": [{"role": "system", "content": "<input prompt>","name":"example-user"}, "ideal": "correct answer"]

Note from OpenAI
The library does support interoperability between regular text prompts and Chat prompts. So, the chat datasets could then be run on the non chat models (text-davinci-003) or any other completion function.
It's just more clear how to downcast from chat to text rather than the other way around since we would make some decisions around where to put the text in system/user/assistant prompts. But both are supported!

We'll use the following prompt template to extract country information from a text passage.

List all the countries you can find in the following text.
Text: {{passage_text}}
Countries:

We can create our input dataset by filling in passages in the prompt template.

{"input": [{"role": "user", "content": "List all the countries you can find in the following text.\n\nText: Australia is a continent country surrounded by the Indian and Pacific Oceans, known for its beautiful beaches, diverse wildlife, and vast outback. \n\nCountries:"}], "ideal": "Australia"}

Save the file as inputs.jsonl to be used later in the eval.

Creating a custom eval

We extend the evals.Eval base class to create our custom eval. We need to override two methods in the class

eval_sample: The main method that evaluates a single sample from our dataset. This is where we create a prompt, get a completion (using the default completion function) from our chosen model, and evaluate if the answer is satisfactory or not.
run: This method is called by the oaieval CLI to run the eval. The eval_all_samples function in turn will call the eval_sample function iteratively.

import evals
import evals.metrics
import random
import openai

openai.api_key = "<YOUR OPENAI KEY>"

class ExtractCountries(evals.Eval):
    def __init__(self, test_jsonl, **kwargs):
        super().__init__(**kwargs)
        self.test_jsonl = test_jsonl

    def run(self, recorder):
        test_samples = evals.get_jsonl(self.test_jsonl)
        self.eval_all_samples(recorder, test_samples)
        # Record overall metrics
        return {
            "accuracy": evals.metrics.get_accuracy(recorder.get_events("match")),
        }

    def eval_sample(self, test_sample,rng: random.Random):
        prompt = test_sample["input"]
        result = self.completion_fn(
            prompt=prompt,
            max_tokens=25
        )
        sampled = result.get_completions()[0]
        evals.record_and_check_match(
            prompt,
            sampled,
            expected=test_sample["ideal"]
        )

evals/elsuite/extract_countries.py

To run the ExtractCountries eval, we can register it for the CLI to be able to run. Let's create a file called extract_countries.yamlunder the evals/registry/evals folder and add an entry for our eval.

<!--- Define a base eval -->
extract_countries:
  <!--- id specifies the eval that this eval is an alias for -->
  <!--- When you run oaieval gpt-3.5-turbo extract_countries, you are actually running oaieval gpt-3.5-turbo extract_countries.dev.match-v1 -->

  id: extract_countries.dev.match-v1

  <!--- The metrics that this eval records -->
  <!--- The first metric will be considered to be the primary metric -->
  metrics: [accuracy]
  description: Evaluate if all the countries were extracted from the passage

Define the eval
extract_countries.dev.match-v1:
  <!---Specify the class name as a dotted path to the module and class -->
  class: evals.elsuite.extract_countries:ExtractCountries
  <!--- Specify the arguments as a dictionary of JSONL URIs -->
  <!--- These arguments can be anything that you want to pass to the class constructor -->
  args:
    test_jsonl: /tmp/inputs.jsonl

Running the Eval

Now, we can run this eval using the oaieval CLI like this

pip install
oaieval gpt-3.5-turbo extract_countries

This will run our evaluation in parallel on multiple threads and produce an accuracy.

In our case, we got an accuracy of 76% which is not great for an operation like this. The following could be the reasons for the current accuracy:

We might not be using the right evaluation spec.
The test data isn't very clean
The model doesn't work as expected.

Going through the eval logs can be helpful here.

Going through Eval Logs

The eval logs are located at /tmp/evallogs and different log files are created for each evaluation run.

We could go through the text file in an editor, or there are open-source projects like Zeno that help us visualize these logs and analyze them better.

Using custom completion functions

"Completion Functions" are the completion methods of any model (LLM or otherwise). And a completion is the text output that would be the LLM system's answer to the prompt input.

In the example above, we chose to use the default completion function (text-davinci-003) of the library for our eval.

We could also write our own completion functions as explained here. These completion functions could be

any LLM model of choice,
a chain of prompts (as popularised by Langchain)
or even AutoGPT!

Useful tips as you run your evals

If you notice evals has cached your data and you need to clear that cache, you can do so with rm -rf /tmp/filecache.
Wherever Basic Templates can work, avoid Model Graded Templates as they will have lower reliability.

Ranking LLMs with Elo Ratings

Drishti Shah (Ext) — Tue, 01 Oct 2024 14:59:28 +0000

Large language models (LLMs) are becoming increasingly popular for various use cases, from natural language processing, and text generation to creating hyper-realistic videos.

Tech giants like OpenAI, Google, and Facebook are all vying for dominance in the LLM space, offering their own unique models and capabilities. Stanford, AI21, Anthropic, Cerebras, and more companies also have very interesting projects coming live!

For text generation alone, nat.dev has over 20 models you can choose from.

Choosing a model for your use case can be challenging. You could just play it safe and choose ChatGPT or GPT-4, but other models might be cheaper or better suited for your use case.

So how do you compare outputs?

Let's try leveraging the Elo rating system, originally designed to rank chess players, to evaluate and rank different LLMs based on their performance in head-to-head comparisons.

Understanding Elo Ratings

So, what are Elo ratings? They were created to rank chess players around the world. Players start with a rating between 1000 Elo (beginner) and 2800 Elo or higher (pros). When a player wins a match, their rating goes up based on their opponent’s Elo rating.

You might remember that scene from The Social Network where Zuck and Saverin scribble the Elo formula on their dorm window. Yeah, that’s the same thing we’re about to use to rank LLMs!

Here’s how it works: if you start with a 1000 Elo score and somehow beat Magnus Carlsen, who has a rating of 2882, your rating would jump 32 points to 1032 Elo, while his would drop 32 points to 2850.

The increase and decrease are based on a formula, but we won’t get too deep into the math here. Just know that there are libraries for all that stuff, and the Elo scoring system has been proven to work well. In fact, it’s likely used by large models for RLHF training of more command-based models like ChatGPT.

Expanding Elo Ratings for Multiple LLMs

While traditional Elo ratings are designed for comparing two players, our goal is to rank multiple LLMs. To do this, we can adapt the Elo rating system, and we have Danny Cunningham’s awesome method to thank for that. With this expansion, we can rank multiple models at the same time, based on their performance in head-to-head matchups.

This adaptation allows us to have a more comprehensive view of how each model stacks up against the others. By comparing the models’ performances in various combinations, we can gather enough data to determine the most effective model for our use case.

Setting Up the Test
Alright, it’s time to see our method in action! We’re going to create “Satoshi tweets about …” using three different generation models to compare their performance. Our contenders are

Davinci (GPT-3)
GPT-4
Llama 7B

Each of these models will generate its own version of the tweet based on the same prompt.

To make things organized, we’ll save the outputs in a CSV file. The file will have columns for the prompt, Davinci, GPT-4, and Llama, so it’s easy to see the results generated by each model. This setup will help us compare the different LLMs effectively and determine which one is the best fit for generating content in this specific scenario.

By conducting this test, we’ll gather valuable insights into each model’s capabilities and strengths, giving us a clearer picture of which LLM comes out on top.

Let’s Start Ranking

To make the comparison process smooth and enjoyable, we’ll create a simple user interface (UI) for uploading the CSV file and ranking the outputs. This UI will allow for a blind test, which means we won’t know which model generated each output. This way, we can minimize any potential bias while evaluating the results.

After we’ve made about six comparisons, we can start to notice patterns emerging in the rankings. (Six is a fairly small number, but then we did rig the test in the choice of our models 😉)

Here’s what’s going on in the background while you’re ranking the outputs:

All models start with a base level of 1500 Elo: They all begin with an equal footing, ensuring a fair comparison.
New ranks are calculated for all LLMs after each ranking input: As we evaluate and rank the outputs, the system will update the Elo ratings for each model based on their performance.
A line chart identifies trends in ranking changes: Visualizing the ranking changes over time will help us spot trends and better understand which LLM consistently outperforms the others.

Here’s a code snippet to try out a simulation for multiple models:

%matplotlib inline

import numpy as np
import matplotlib.pyplot as plt
from multielo import MultiElo

# Initialize the Elo ratings and history
recipes = ['Model 1', 'Model 2', 'Model 3']
elo_ratings = np.array([1500, 1500, 1500])
elo_history = [np.array([1500]), np.array([1500]), np.array([1500])]

elo = MultiElo()

while True:
    print("Please rank the recipes from 1 (best) to 3 (worst).")
    ranks = [int(input(f"Rank for {recipe}: ")) for recipe in recipes]

    # Update ratings based on user input
    new_ratings = elo.get_new_ratings(elo_ratings, ranks)

    # Update the rating history
    for idx, rating in enumerate(new_ratings):
        elo_history[idx] = np.append(elo_history[idx], rating)

    elo_ratings = new_ratings
    print("Current Elo Ratings:", elo_ratings)

    plot_choice = input("Would you like to plot the Elo rating changes? (yes/no): ")

    if plot_choice.lower() == 'yes':
        for idx, recipe in enumerate(recipes):
            plt.plot(elo_history[idx], label=recipe)

        plt.xlabel("Number of Iterations")
        plt.ylabel("Elo Rating")
        plt.title("Elo Rating Changes")
        plt.legend()
        plt.show()

    continue_choice = input("Do you want to continue ranking recipes? (yes/no): ")
    if continue_choice.lower() == 'no':
        break

Wrapping Up the Test

We want to be confident in declaring a winner before wrapping up the A/B/C test. To reach this point, consider the following factors:

Choose your confidence level: Many people opt for a 95% confidence level, but we can adjust it based on our specific needs and preferences.
Keep an eye on Elo LLM ratings: As you conduct more and more tests, the differences in ratings between the models will become more stable. A larger Elo rating difference between the two options means you can be more certain about the winner.
Carry out enough matches: It’s important to strike a balance between the number of matches and the duration of your test. You might decide to end the test when the Elo rating difference between the options meets your chosen confidence level.

What's next?

Conducting quick tests can help us pick an LLM, but we can also use real user feedback to optimize the model in real time. By integrating this approach into our application, we'd be able to identify the winning and losing models as they emerge, adapting on the fly to improve performance.

We could also pick models for segments of a user base depending on the incoming feedback which can create different Elo ratings for different cohorts of users.

This could also be used as a starting point to identify fine-tuning and training opportunities for companies looking to get the extra edge from base LLMs.

While there are tons of ways to run A/B tests on LLMs, this simple Elo LLM rating method is a fun and effective way to refine our choices and make sure we pick the best option for our project.

Happy testing!