Forem: Pezzo

Build an AI Powered Moderation System in Under 10 Minutes Using JavaScript

Matan Abramovich — Mon, 30 Oct 2023 12:30:00 +0000

Inappropriate or abusive content online can be a major headache. As a developer, you may have struggled with building effective content moderation into your applications. Manual moderation simply doesn’t scale. But what if you could quickly implement an AI-powered moderation system to automatically detect and filter out toxic comments?

In this guide, you'll learn how to leverage OpenAI's API to build a simple yet robust moderation system in under 10 minutes. Whether you're working on a social platform, forum, or any user-generated content site, you can easily integrate this into your stack.

Pezzo: Open-Source LLMOps Platform 🚀
Just a quick background about us. Pezzo is the fastest growing open source LLMOps platform, and the only one built for full stack developers with first class TypeScript Support.

Like our mission? Check Pezzo out and give us a star. We're building for developers, by developers 🌟.

Getting set up

Getting an OpenAI API key

First you’ll need to sign up at OpenAI and obtain an API key. Once obtained, make sure you set it as an environment variable (OPENAI_API_KEY).

Setting up the project

Create an app.ts somewhere in your file system. Initialize a new NPM project (npm init -y) and make sure to install the OpenAI client (npm i openai). You should be good to go!
For an in-depth guide on how OpenAI API works check out this post.

Let's start simple

We're going to start by writing a simple prompt. We'll have a system message that provides guidelines for moderation, and a user message that contains the users's input (imagine this comes from a UI of some sort). Here's a code example:

import OpenAI from "openai";

const openai = new OpenAI();

const response = await openai.chat.completions.create({
  model: "gpt-3.5-turbo",
  temperature: 0,
  messages: [
    {
      role: "system",
      content: "is this text inappropriate?"
    },
    {
      role: "user",
      content: "You are such an idiot! Only a moron would think that way. People like you don't deserve to have an opinion with such stupid ideas. Do everyone a favor and keep your dumb thoughts to yourself."

    }
  ],});

AI response:

{
  id: 'chatcmpl-8F9sKbcaPkWUJSc9gv3M1LBqGJmzf',
  object: 'chat.completion',
  created: 1698623572,
  model: 'gpt-3.5-turbo-0613',
  choices: [ { index: 0, message: [Object], finish_reason: 'stop' } ],
  usage: { prompt_tokens: 61, completion_tokens: 33, total_tokens: 94 }
}
[
  {
    index: 0,
    message: {
      role: 'assistant',
      content: 'Yes, this text is inappropriate. It contains insults, name-calling, and derogatory language. It is disrespectful and does not promote healthy communication or constructive dialogue.'
    },
    finish_reason: 'stop'
  }
]

Let's break this down:

The user message is: "You are such an idiot! Only a moron would think that way. People like you don't deserve to have an opinion with such stupid ideas. Do everyone a favor and keep your dumb thoughts to yourself."

The system message is: "is this text inappropriate?"

The AI response: Yes, this text is inappropriate. It contains insults, name-calling, and derogatory language. It is disrespectful and does not promote healthy communication or constructive dialogue.

Better moderation granularity

Simply understanding if the text is inappropriate isn't enough. We want to understand what's inappropriate about it.

We can guide the AI to be more granular, and categorize its response Toxicity, Hate Speech or Threats.

Toxicity covers rude, disrespectful comments. Hate speech involves racist, sexist or discriminatory language. Threats are violent, harmful statements.

(For ethical reasons, this guide will not include examples of actual hate speech or threats - but the concepts can be applied to address these policy violations.)

messages: [
    {
      role: "system",
      content: "Lable this text as: Toxicity - Rude, disrespectful comments OR Hate Speech - Racist, sexist, discriminatory OR Threats - Violent threats"
    },
    {
      role: "user",
      content: "You are such an idiot! Only a moron would think that way. People like you don't deserve to have an opinion with such stupid ideas. Do everyone a favor and keep your dumb thoughts to yourself."

    }
  ]

AI response:

{
  id: 'chatcmpl-8FAUdmvD2yECuhbbKGgRX6d1MgO5J',
  object: 'chat.completion',
  created: 1698625947,
  model: 'gpt-3.5-turbo-0613',
  choices: [ { index: 0, message: [Object], finish_reason: 'stop' } ],
  usage: { prompt_tokens: 84, completion_tokens: 9, total_tokens: 93 }
}
[
  {
    index: 0,
    message: {
      role: 'assistant',
      content: 'Toxicity - Rude, disrespectful comments'
    },
    finish_reason: 'stop'
  }
]

Now the AI response is now more granular. In a real-world app, this will allow us to take different automatic moderation actions based on the type violation.

Stricter instructions via system prompts

We can achieve stricter and more accurate results by utilizing the system message. In short - LLMs behave the way they are trained. We'll apply some prompt engineering techniques to guide the AI to behave the way we want.

In the example below, we:

Assign a role to the AI - Content Moderator
State a clear task to be achieved
Define a limited set of results and criteria for each

 messages: [
    {
      role: "system",
      content: "Your role is to act as a content moderator for an online platform. Your task is to label comments as 'Toxicity', 'Hate Speech', or 'Threats' based on if they contain rude, discriminatory, or threatening language. Use the following criteria: Toxicity - Rude, disrespectful, overly negative comments, Hate Speech - Racist, sexist, homophobic, discriminatory language, Threats - Violent, graphic, or directly harmful statements"
    },
    {
      role: "user",
      content: "You are such an idiot! Only a moron would think that way. People like you don't deserve to have an opinion with such stupid ideas. Do everyone a favor and keep your dumb thoughts to yourself."

    }
  ]

AI response:

{
  id: 'chatcmpl-8FBP8kRFB5NTuhspJLQAbDwZDdJXQ',
  object: 'chat.completion',
  created: 1698629450,
  model: 'gpt-3.5-turbo-0613',
  choices: [ { index: 0, message: [Object], finish_reason: 'stop' } ],
  usage: { prompt_tokens: 145, completion_tokens: 3, total_tokens: 148 }
}
[
  {
    index: 0,
    message: { role: 'assistant', content: 'Toxicity' },
    finish_reason: 'stop'
  }
]

The AI's accuracy has improved. It is now able to distinguish between specific violation types.

There is a trade-off: more detailed instructions require more tokens upfront, but enable more precise results.

While elaborate prompts cost more tokens, the benefits taper off eventually. The key is optimizing prompts to be just as informative as needed - not as long as possible. We want to give the AI sufficient guidance without diminishing returns on token efficiency.

Additionally, too many tokens (or words) in the messages will result in potential hallucinations by the AI (in short, AI making things up).

Did you know? There is a way to getting better results form an AI model that IS cheaper. Let me know in the comments if you want me to write a post about it 👇

Structured JSON responses

The AI returns human-readable text, which is not very useful. Let's see how we can easily retrieve a JSON response, so that the result is processable. This is useful if you want to render the result in a user interface, or store it in a database.

It's as simple as adding one line to our system prompt!
Here it is:

You must respond in JSON, always following this schema: 

{
  label: string[];
}

messages: [
    {
      role: "system",
      content: "Your role is to act as a content moderator for an online platform. Your task is to label comments as 'Toxicity', 'Hate Speech', or 'Threats' based on if they contain rude, discriminatory, or threatening language. Use the following criteria: Toxicity - Rude, disrespectful, overly negative comments, Hate Speech - Racist, sexist, homophobic, discriminatory language, Threats - Violent, graphic, or directly harmful statements.

You must respond in JSON, always following this schema:

{
  label: string[];
}
"
    },
    {
      role: "user",
      content: "You are such an idiot! Only a moron would think that way. People like you don't deserve to have an opinion with such stupid ideas. Do everyone a favor and keep your dumb thoughts to yourself."

    }
  ]

AI response:

{
  id: 'chatcmpl-8FBkEFCJMQVpWIWQoR6Zho53k0DoU',
  object: 'chat.completion',
  created: 1698630758,
  model: 'gpt-3.5-turbo-0613',
  choices: [ { index: 0, message: [Object], finish_reason: 'stop' } ],
  usage: { prompt_tokens: 165, completion_tokens: 8, total_tokens: 173 }
}
[
  {
    index: 0,
    message: { role: 'assistant', content: '{"label": ["Toxicity"]}' },
    finish_reason: 'stop'
  }
]

Developers, Add AI To Your Toolkit in 10 Minutes

Ariel Weinberger — Mon, 23 Oct 2023 16:22:23 +0000

This post was originally posted as a guest post in the Builder.io blog.

Generative AI has been exploding recently, and we encounter the terms “ChatGPT”, “LLMs” and “Agents” several times a day. With so many new developments and powerful tools, it’s hard to keep up. In this article, you’re going to learn all the basics so you can officially add AI to your toolbox.

Now is our time

I’m a developer at heart. And when I say “our”, I mean us — developers. The recent (and upcoming) advancements in AI can safely be called a paradigm shift. Here’s why.

Traditionally, for a business to use AI, they would have to:

Hire top talent across various fields (data science, AI/ML) for model development.
Gather, scrape, or buy a lot of data to train the model.
Buy/rent expensive hardware for each training run.
Test, reinforce/fine-tune, and deploy the model to production in a scalable way.

Today, anyone can benefit from AI — it’s one API call away. These APIs tend to be affordable, easy to consume and reliable for most tasks.

This makes AI very attractive for projects at all stages. Now that AI is not exclusive to fortunate enterprises, we developers are going to spearhead the implementation of AI at a world scale.

What is an LLM?

LLMs (Large Language Models) are models that are trained on billions of parameters. These are different than traditional AI models that are trained to accomplish a very specific task.

LLMs are trained to understand natural language. This is very powerful because such models can connect more dots. You can use LLMs to produce content, analyze sentiment, write code, validate outputs, provide customer support, and much much more.

Some LLMs are open-source — such as Falcon, Mistral, Llama 2 — and some are closed-source and served through an API — such as, OpenAI GPT, and Anthropic Claude. In this article, we’ll focus on OpenAI.

Getting started

Getting an OpenAI API key

First you’ll need to sign up at OpenAI and obtain an API key. Once obtained, make sure you set it as an environment variable (OPENAI_API_KEY).

Setting up the project

Create an app.ts somewhere in your file system. Initialize a new NPM projcet (npm init -y) and make sure to install OpenAI client (npm i openai). You should be good to go!

Calling the OpenAI API

Here’s an example of how we’d call the OpenAI API using the OpenAI Client:

import OpenAI from "openai";

const openai = new OpenAI();

const completion = await openai.chat.completions.create({
  model: "gpt-3.5-turbo",
  temperature: 0,
  messsages: [
    // messages go here
  ],
  // ... other options
});

console.log("Completion", completion);

Let’s quickly go through what’s going on here.

First, we import OpenAI from the openai NPM package

Then, we initialize a new OpenAI client. We don’t provide an API key explicitly, as it is automatically fetched from the OPENAI_API_KEY environment variable set earlier.

Finally, we create a Chat Completion. “A Chat Completion? What is that?” you might be thinking. Let me explain.

Chat Completions

OpenAI provides various APIs. DALL-E for image generation, Whisper for audio transcription, Embeddings API, and so on. Probably the most well-known and used API is the Chat Completions API. Basically, creating a chat completion means having a chat with the AI model.

Despite the term “chat”, it is not exclusively used in chat applications only. Chat Completions can be used in single operations/tasks as well. It’s just the most capable API that supports the most capable models (gpt-3.5-turbo and gpt-4).

Options

When creating a new Chat Completion, you’ll provide some options. Let’s overview some of them:

model: The model you want to use for this particular call. In this example, we use gpt-3.5-turbo.
temperature: How creative we want the AI model to be. Zero would mean no additional creativity beyond baseline, and 1 would mean maximum creativity. If your tasks require precision, attention to detail and factuality, definitely set this to 0.
max_token: The maximum amount of tokens to retrieve in the response. We’ll talk about tokens later in this article. In short, this is your opportunity to limit the response length, save on costs and help reduce latency.
messages: This is where the magic happens. Here, you’ll provide a set of messages. This can be anything from one message for a basic task/operation, to a set of messages to represent a full chat history. You’ll spend most of your time here.

For the full list of options, check out the OpenAI API documentation.

Messages

As I mentioned earlier, the messages property is where the magic happens. Each item in this array represents a message. A message can hold two properties. First, there is the role, which an be one of the following:

system: Use system messages to provide guidelines, set boundaries, provide additional knowledge or set the tone. Imagine this as some “inner voice” that the AI model will take into account when generating responses.
user: This represents messages sent by the user. For example, if you are building a chat app, you want to send the user’s messages as user messages.
assistant: These messages represent the AI model’s responses.

Basic Example

[
  {
    role: "user",
    content: "Hi, my name is Ariel. Greet me!"
  }
]

The response to that would be:

{
  role: "assistant",
  content: "Hi Ariel, how are you doing?"
}

Now, let’s find out how I can add a System Messages to control the behavior of the AI model:

[
  {
    role: "system",
    content: "You are a rude support agent. Only respond with uppercase."
  },
  {
    role: "user",
    content: "Hi, my name is Ariel. Greet me!"
  }
]

And the response:

{
  role: "assistant",
  content: "HELLO ARIEL, WHAT DO YOU WANT?"
}

Pretty cool! We can use the system message to dictate how the AI model should behave, depending on our needs.

More examples

Let me share a few more examples in which utilizing the System Prompt is useful:

1. Example: providing knowledge:

Consider this example — a user asks an AI customer support bot for stock information on shoes:

[
  {
    role: "system",
    content: `
      You are an AI assistant for a shoe store, "Best Shoes".
      You introduce yourself as "Shoezzer".

      Here is stock information in CSV format:

      model,size,stock
      nike,12,4
      nike,11,0
      nike,10,4
      nike,9,1
    `
  },
  {
    role: "user",
    content: "Hi, do you have Nike shoes, size 11?"
  }
]

The response:

{
  role: "assistant",
  content: "Hi, my name is Shoezzer and I'm here to help you. Unfortunately, we do not currently have Nike shoes size 11 in stock."
}

In this example, we have provided the AI model with information about stock availability using simple CSV format. It could also be JSON, XML, or anything else. LLMs can handle it!

2. Setting boundaries

LLMs will try to satisfy the user no matter what. They are trained to do that. What if our business use case requires stricter boundaries? Take a look at the following example:

[
  {
    role: "system",
    content: `
      You are a Doctor's Assistant. You only do the intake. The user will share some of the
      symptoms with you and you will help identify potential diagnosis to be shared with the doctor.

      You are not allowed to provide any assistance to patients under the age of 16.
      In such a case, politely decline and ask them to call the practice at 1-800-PRACTICE. No additional text.
    `,
  },
  {
    role: "assistant",
    content: "Good afternoon. In order to help you, please tell me your age and any symptoms you're suffering from.",
  },
  {
    role: "user",
    content: "I'm 15 years old. I have pretty bad stomach pain and have been coughing since yesterday.",
  },
]

The response:

{
  role: "assistant",
  content: "I'm sorry, but as a Doctor Assistant, I am not able to provide assistance to patients under the age of 16. I recommend that you call the practice at 1-800-PRACTICE for further assistance."
}

3. Structured JSON Output

Imagine that we want to use AI for a single task, rather than a chat app. We want to be able to render the output of the AI response in some UI. This is obviously not possible with the traditional text responses. Here’s how we can approach it:

[
  {
    role: "system",
    content: `
      You are a talented task planner. The user will tell you their goal and
      you will generate exactly 3 tasks to help them accomplish it.

      You must strictly follow this schema:

      {
        tasks: {
          title: string; // max 50 characters
          description: string; // max 100 characters
          difficulty: number; // on a scale of 1 to 5
        }[]
      }
    `
  },
  {
    role: "user",
    content: "I want to visit to Amsterdam!"
  }
]

The response:

{
  role: 'assistant',
  content: '{\n' +
    '  "tasks": [\n' +
    '    {\n' +
    '      "title": "Research Amsterdam attractions",\n' +
    '      "description": "Find out about the top attractions in Amsterdam and create a list of places you want to visit.",\n' +
    '      "difficulty": 2\n' +
    '    },\n' +
    '    {\n' +
    '      "title": "Book accommodation in Amsterdam",\n' +
    '      "description": "Search for and book a suitable accommodation in Amsterdam that fits your budget and preferences.",\n' +
    '      "difficulty": 3\n' +
    '    },\n' +
    '    {\n' +
    '      "title": "Plan transportation in Amsterdam",\n' +
    `      "description": "Figure out the best way to get around Amsterdam, whether it's by public transportation, bike, or walking, and plan your routes.",\n` +
    '      "difficulty": 2\n' +
    '    }\n' +
    '  ]\n' +
    '}'
}

Check that out! It’s a perfectly valid JSON response that you can JSON.parse, return to a front end, and render!

If you require structured responses, keep temperature at 0, and check out the OpenAI Function Calling feature. It’s very powerful. Let me know if you want me to write an article about OpenAI Function Calling!

Keep track of usage

A response from OpenAI is something like this:

{
  id: '...',
  object: 'chat.completion',
  created: 1696431344,
  model: 'gpt-3.5-turbo-0613',
  choices: [ { index: 0, message: [Object], finish_reason: 'stop' } ],
  usage: { prompt_tokens: 144, completion_tokens: 45, total_tokens: 189 }
}

Notice the usage property. It mentions the number of tokens used in the request, the response, and in total. But what are those tokens?

Sometimes it’s easy to think that LLMs truly understand words. However, that’s not exactly how it works. It’s far easier for LLMs to understand tokens.

Tokens are numeric representations of strings, part of strings, or even individual characters. Essentially, the words we provide to the LLM becomes a set of floating numbers, which the model can then process in an easier, more performant way.

For example, the text Hello World, I am learning about AI equals 9 tokens. How exactly?

The algorithm used for tokenization has tokenized this sentence as follows:

[15496, 2159, 11, 314, 716, 4673, 546, 9552, 0]

This is a fascinating topic, but for the sake of this tutorial, just know that your text inputs are handled post-tokenization, and you are billed per token, in the input (request/prompt) and the output (response) output. This is usually billed per 1,000 tokens, and the response tends to be more expensive than the request. Check out the OpenAI Pricing page to view the exact cost.

It’s important to mention that different models also have different token limits. For example, at the time of writing this, gpt-3.5-turbo has a token limit of 4097 tokens in total (request and response combined). You can find more information about token limits per model in the OpenAI documentation.

How can I calculate the tokens myself?

There are various tools available online to help you calculate the tokens. I really like the OpenAI Tokenizer. However, depending on the model you’re using, it might not always be 100% accurate.

The Python ecosystem is fortunate to have a package called tiktoken that really helps with that. We are fortunate to have talented folks in our ecosystem who ported it to JavaScript/TypeScript! My favorite one is @dqpd/tiktoken. It works very well and is very reliable.

If you are counting tokens in a production environment, I suggest you take a 5%-10% margin for error. These tokenizers are not always accurate. Better safe than sorry!

Tips

Use the OpenAI Tokenizer to learn how tokens work and get an (almost accurate) idea of token usage.
Use the OpenAI Playground to practice prompt engineering without writing a single line of code.
Use Pezzo as a centralized prompt management platform to collaborate with your team and iterate quickly, as well as observe and monitor your AI operations and costs. It’s open-source! (disclaimer: I am the founder and CEO).
Consider taking my AI For JavaScript Developers course on Udemy. I’ve so far educated over 200,000 students on Udemy, and this 2-hour crash course is meant for developers like you and me, who want to add AI to their toolbox. We build real-world apps powered by AI and cover Function Calling, Real-time Data, Hallucinations, Vector Stores, Vercel AI SDK, LlamaIndex, and more!

6 Go-To Techniques to Master AI's Hidden Language: My Playbook for Winning the AI Chat Game

Matan Abramovich — Mon, 16 Oct 2023 16:40:52 +0000

Have you ever tried asking Siri to recommend a good takeout place, only to get a list of web search results?
Or perhaps you eagerly queried ChatGPT about the latest AI advancements, but got an incoherent mess of text back.
I've been there too! That was me before discovering the art of prompt engineering.

Imagine how much easier life would be if Siri could actually hold a conversation and make personalized recommendations.
Or if ChatGPT could engage intelligently on niche topics. Prompt skills let you make that AI dream a reality!

Why This Should Be On Your Radar

Here are 3 big benefits I've experienced:

More productive workflows. Now I can get ChatGPT to generate great content fast with the right prompts.
Improved home life. My new prompt-savvy smart speaker actually gives me solid advice now like a virtual buddy!
Next-level tech skills. I've become the AI whisperer among my friends. Talk about street cred!

The Secret Sauce: 6 Go-To Prompt Engineering Techniques

Through much trial and error, I've identified 6 techniques that transform my AI conversations:

1. Chain-of-Thought Prompting

How It Works: Break down complex instructions into easy step-by-steps.

Real-World Example: Cooking bots serve up way better recipes when I prompt them to lay out each step individually.
No more massive, confusing blocks of text!

Sample Prompts:

"Walk me through baking chocolate chip cookies from start to finish in simple, clear steps"
"Explain how to change a bike tire one step at a time"

2. Generated Knowledge Prompting

How It Works: Give bots background info to complete tasks.

Real-World Example: Now my travel assistant tells me cool historical facts about landmarks before discussing them.
Almost like having a tour guide!

Sample Prompts:

"Give a 2-3 sentence intro to Picasso before analyzing his paintings"
"Provide context about D-Day before explaining the events"

3. Least-to-Most Prompting

How It Works: Start with simple aspects before complex ones.

Real-World Example: For recipes, I list ingredients/tools first so the instructions make more sense.

Sample Prompts:

"List ingredients and tools for chocolate chip cookies, then give step-by-steps"
"Give 3 plot points in Romeo and Juliet before summarizing the full story"

4. Self-Consistency Decoding

How It Works: Generate multiple responses and pick the best.

Real-World Example: Now translations sound much more legit in any language!

Sample Prompt:

"Translate 'She loves traveling' into Spanish 3 ways, then analyze and choose the most accurate"

5. Complexity-Based Prompting

How It Works: Get nuanced, multi-faceted responses.

Real-World Example: My Spotifybot makes personalized playlists with songs tailored to my ever-changing moods.

Sample Prompt:

"Suggest a dinner playlist with happy folk songs, somber jazz, and peaceful classical"

6. Self-Refine

How It Works: Have bots critique and improve their own work.

Real-World Example: I had an AI write a website bio of me, critique its tone/content, and refine it. Work smarter not harder!

Sample Prompt:

"Write a professional LinkedIn summary for me, then give feedback on how to improve it"

Now It's Your Turn to Chat with Bots

There you have my best tips for prompt engineering!

With time and practice, you'll be able to effortlessly chat about complex topics with AI.
Try incorporating one of these techniques next time you ask chatGPT for help. Share your victories with us on @Pezzo.ai.

Now get out there and show those bots who's boss with your new conversation skills!

Pezzo v0.5 - Dashboards, Caching, Python Client, and More!

Ariel Weinberger — Sat, 02 Sep 2023 17:15:53 +0000

This version brings a lot of new features and improvements to Pezzo. We're excited to share them with you!

What's Pezzo?

Pezzo is a fully open-source (Apache 2.0) LLMOps platform built for developers and teams. It was designed to streamline Generative AI adoption, delivery, monitoring, observability and more.

Wanna know more? Check Pezzo out on GitHub.

What's New

📈 Project Dashboard

We've added a new screen to the Pezzo Console. The Project Dashboard gives you a quick overview of your project's performance. It features several useful features:

Timeframe Selector: Select from hourly, daily, weekly, monthly, yearly and even custom timeframes for analytics.
Vital Metrics: You can now see the number of requests, cost, average request duration, and success rate. You can even see how they change over time.
Useful Charts: We've implemened two charts -one for total requests (as well as errors) and average request duration over time. We'll add more charts in the future.

🏎️ Request Caching

We've implementing a caching mechanism. This feature can help you save up to 90% of your LLM API costs and time!

Some practical use cases:

Development: During development developers tend to go through flows very frequently. This usually involves the same set of LLM API calls with the same input date. With Pezzo, your entire organization can share the same cache, and focus on value!
Production: If you're building a support chabot, for example, there are many queries that are highly repetitive. For example, "What is your return policy?" or "What are your opening hours?"

You can read more in the Request Caching documentation.

We're planning to add more caching features in the future, such as semantic caching.

🐍 Python Client

We're excited to share that Pezzo now features a Pezzo client! Here are some useful links to help you get started:

We've also made sure to add a Python copy-pastable code snippet in he console, to make it even easier for you to get started.

Thank you!

We hope you enjoy this version. We're working hard on the next one, which will feature a lot of exciting features and improvements.

⭐️ Consider giving us a star on GitHub to support our mission
👾 Consider joining our Discord server
🎓 Read the documentation
🌎 Check out our website

The 5 Pillars for taking LLM to production

eylonmiz — Fri, 01 Sep 2023 11:59:13 +0000

LLMs (Large Language Models) have tremendous potential to enable new types of AI applications. The truth is, turning simple prototypes into robust, production-ready applications is quite challenging.

We've been supporting dozens of companies in bringing applications to production, and we're excited to share our learnings with you.

The Pillars

When building LLM applications for production use, certain capabilities rise above the rest in importance.

Prompt Engineering

Carefully crafted prompts are key to achieving reliable performance from LLMs. Think about:

Where and how to store your prompts for quick iteration.
Ability to experiment with prompts (A/B testing, user segmentation).
Collaboration. Stakeholders can contribute immensely to prompt engineering.

Observability

We've all been there - we designed some prompts that worked really well with test data. Then we went live and disaster struck. Bad handling of LLMs can have negative effects such as long waiting times, inappropriate responses, lack of context and more. This translates to bad user experience, and could negatively affect your brand/churn rate.

It's important to carefully think about the observability and monitoring aspects of your LLM operations, and have the ability to quickly identify issues and troubleshoot them. Think about tracing, the ability to track an entire conversation, replay it and improve it over time. Consider anomaly detection as well as emerging trends.

Tip:
It's important to know "what good looks like". Having the ability to mark good (e.g. converting) LLM responses versus bad (e.g. churn) will really pay off in the long run.

Cost Control

LLM API costs can quickly spiral out of control. It's important to be prepared and budget accordingly. With that being said, we've seen cases where a trivial parameter change has increased costs by 25% over night.

Granular tracking of API usage and billing helps identify expensive calls. With detailed visibility into LLM costs, you can set custom budgets and alerts to proactively manage spend. By analyzing logs and performance data, expensive queries using excessive tokens can be identified and reworked to be more efficient. With rigorous cost management tools, LLM costs can be predictable and optimized.

Tip:
You often find yourself chaining multiple calls. Think about the models in use. Do you really need to use GPT-4 for everything? If you can, save GPT-4 calls for scoring/labeling/classification calls, where the output is short. This will save you plenty of money. When you need to generate long responses, GPT-3.5-Turbo might be more appropriate from a cost perspective.

Evaluation

Rigorous evaluation using datasets and metrics is key for reliability when building LLM applications. With a centralized dataset store, relevant datasets can be easily logged from real application queries and used to frequently evaluate production models. Built-in integration with open source evaluation libraries makes it simple to assess critical metrics like accuracy, response consistency, and more.

Evaluation frameworks help you efficiently validate new prompts, chains, and workflows before deploying them to production. Ongoing evaluation using real user data helps identify areas for improvement and ensures optimal performance over time.

Tip:
Evaluation doesn't have to be too complicated. You can sample an X% of your LLM responses and run them through another, simple prompt for scoring. Over time, this will give you valuable data.

Training

There's a limit to what you can do with off-the-shelf models. If using LLMs becomes an important aspect of your operations, you'll likely resort to fine-tuning at some point.

With integrated data pipelines, real user queries can be efficiently logged and processed into clean training datasets. These datasets empower on-going learning - models can be fine-tuned to better handle terminology and scenarios unique to your business use-case.

Invest in tooling to generate datasets and fine-tune models early to ensure LLMs deliver maximum value by keeping them closely aligned with evolving business needs.

Tip:
Apart from yielding better results, fine-tuning can dramatically improve costs. For example, you can train the gpt-3.5-turbo model based on data produced by GPT-4, or other capable (and expensive) models.

Additional Considerations

Besides the pillars mentioned above, there are a few more concepts you need to consider when building production-grade, LLM-powered applications:

Performance: Depending on your application, it might be crucial to optimize for fast response times and minimal latency. Make sure to design your prompt chains for maximum throughput.
Multi-Model Support: If you use multiple LLMs like GPT-3.5, GPT-4, Claude, LLaMA-2, consider how you consume these. Adopting a unified, abstracted way to consume various models will make your application more maintainable as you scale.
User Feedback: Understanding how real users interact with your LLMs is invaluable for guiding improvements. Make sure to capture real usage data and feedback so you can improve the user experience over time.
Enterprise Readiness: Depending on your target market, enterprise-grade capabilities might be important. Think about fine-grained access controls and permissions, predictability and reliability SLAs, data security, privacy, and compliance assurance, automated testing and validation frameworks to ensure reliability, and more.

LLMOps Platform - Give Pezzo a Try!

Pezzo is the open-source (Apache 2.0) LLMOps platform. It addresses prompt management, versioning, instant delivery, A/B testing, fine-tuning, observability, monitoring, evaluation, collaboration and more.

Regardless of where you’re at in your LLM adoption journey, consider using Pezzo. It takes exactly one minute to integrate, and endless value will come your way.

If you’d like to learn more about Pezzo:

Check out the Pezzo GitHub repository and consider giving us a star! ⭐️
Check out our website and try Pezzo Cloud
Read the Pezzo Documentation