Forem: Ken Collins

Multi AI Agent Systems using OpenAI's new GPT-4o Model

Ken Collins — Sat, 18 May 2024 02:52:12 +0000

A few weeks ago we explored using OpenAI's new Assistant's API to build a personal creative assistant capable of creating consistent on-brand artwork using Ideogram. Back then I promised we would explore expert-based architectures in a future post... and today is that day. 🥳

Two major updates have happened since then. First, the Assistants API now supports vision 👀 allowing messages in a thread to become truly multi-modal. Second, and the most important, OpenAI finally released a new model, GPT-4o. The oh stands for omniscient and the model delivers.

Introducing Experts.js

The new Assistants API (still in beta) from OpenAI sets a new industry standard, significantly advancing beyond the widely adopted Chat Completions API. It represents a major leap in the usability of AI agents and the way engineers interact with LLMs. Paired with the cutting-edge GPT-4o model, Assistants can now reference attached files & images as knowledge sources within a managed context window called a Thread. Unlike Custom GPTs, Assistants support instructions up to 256,000 characters, integrate with 128 tools, and utilize the innovative Vector Store API for efficient file search on up to 10,000 files per assistant!

Experts.js aims to simplify the usage of this new API by removing the complexity of managing Run objects and allowing Assistants to be linked together as Tools.

const thread = Thread.create();
const assistant = await MyAssistant.create();
const output = await assistant.ask("Say hello.", thread.id);
console.log(output) // Hello

Why Experts.js

Please read over the projects documentation on GitHub for a full breakdown on the Expert.js capabilities and options. I think you will find the library small and easy to understand and immediately see the value in using it.

https://github.com/metaskills/experts

However, for our group, I wanted to explore a very real use case for Experts.js. Where a company assistant acts a main router for an entire panel of experts. This sales and router expert has one tool, a merchandising expert. This merchandising tool in turn has its own tool, one capable of searching an OpenSearch vector database. The idea here is that each Assistant owns its domain and context. Why would a company sales assistant need to know (and waste tokens) on how to perform amazing OpenSearch queries. Likewise, being an amazing accounts or order assistant requires context and tools that would likely confuse another.

Yet, given this architecture there are some critical flaws that need to be addressed.

Data loss moving from left to right. The grapevine effect. See how messages get truncated or reinterpreted? Some of that behavior is good, you want experts to contextualize. However, this is clearly a problem.
Assistants-only outputs. The typical mental model for most Multi-Agent Systems (MAS) takes the output of one LLM as the input or results to another. See how the Products Tool got all the great aggregate category information? But the main assistant only knows what it was told. If asked a followup question, it would not have the true data to respond. Worse, it may summarize a summary to the user. Also, that Product Tools output is just wasted tokens.
Some Assistants can leverage many tools and some of those tools should be outputs for their parents context. In this case there was an image created by code interpreter which has no way to make it to the parent company assistant.

The fix is pretty simple. The Experts.js framework allows for Tools to control their output so we can redirect or pipe all knowledge where it needs to go. The grape vine data loss is an easy fix. Models such as gpt-4o are great at following instructions. A little prompt engineering ensures messages or tool calls have all the context they need.

Lastly, thread management. By default, each Tool in Experts.js has its own thread & context. This avoids a potential thread locking issue which happens if a Tool were to share an Assistant's thread still waiting for tool outputs to be submitted. The following diagram illustrates how Experts.js manages threads on your behalf to avoid this problem.

All questions to your experts require a thread ID. For chat applications, the ID would be stored on the client. Such as a URL path parameter. With Expert.js, no other client-side IDs are needed. As each Assistant calls an LLM backed Tool, it will find or create a thread for that tool as needed. Experts.js stores this parent -> child thread relationship for you using OpenAI's thread metadata.

Other Multi-Agent Systems (MAS)

A lot of research has been doing in this are and we can expect a lot more in 2024 in this space. I promise to share some clarity around where I think this industry is headed. In personal talks I have warned that multi-agent systems are complex and hard to get right. I've seen little evidence of real-world use cases too. So if you are considering exploring MAS, put your prosumer hat on, roll up your sleeves, and prepare to get hands dirty with Python ☹️

In my opinion, exploration of multi-agent systems is going to require a broader audience of engineers. For AI to become a true commodity, it needs to move out of the Python origins and into more popular languages like JavaScript 🟨, a major fact on why I wrote Experts.js.

I very much hope folks enjoy this framework and helps the community at large figure out where and how Multi Agent AI Systems can be effective. 💕

Consistent On-Brand Artwork using Ideogram + OpenAI Assistants

Ken Collins — Fri, 26 Apr 2024 12:07:41 +0000

Better AI Images

Chances are, if you're browsing an article online today, you're looking at an AI-generated image somewhere in that post. These images are typically clichéd and readily recognizable. They often feature a blue hue and depict scenes using robots or futuristic landscapes with brains and glowing cities, reminiscent of science fiction movies or modern cyberpunk themes.Casual analysis of AI related content posts on LinkedIn with a featured image. Of which 70% used an AI-Generated image of which more than half are still using older AI tropes.

Having your brand or content stand out with AI-generated artwork could be a key differentiator. Larger media companies learned this last year when a collaborative group built a platform called Better Images of AI to promote more engaging representations of AI. Today, our text to image models are way more sophisticated yet some folks may have not put in the work ✨ to get the outputs they need or still stick to older tropes.

So for today, I'd like to share my latest experiment utilizing a tool called Ideogram. They promote themselves as 'Helping People Become More Creative' and indeed, they do deliver on this promise. Especially with images requiring properly spelled and stylized text. This service will become our illustrator agent. Our creative concept artist assistant will be built using OpenAI's latest Assistants API. Our goal is to develop an interactive copilot capable of leveraging human feedback and past experiences to seamlessly ideate and execute our creative requirements.

Architecture & Process

This post will focus on the agentic architecture of our workflow vs. a technical deep dive into the code or the Assistants API behind it. This API (still in beta) is a foundational shift from their completions APIs which required you to manage all messages & tool calls within a LLM's context window. AKA memory. However, feel free to explore the code on GitHub.

Explore the code on GitHub. https://github.com/metaskills/unremarkable-ideogram-assistant

Take a moment and zoom into the image, this is not a formal sequence diagram. Each white box (except for the browser tool) represents an LLM-backed Assistant with its own thread. Similar to Custom GPTs, each Assistant can leverage tools such as File Search (retrieval), Code Interpreter, and custom tools (functions). Unlike Custom GPTs, Assistants can leverage Instructions (system prompt) up to 256,000 characters! Simply massive new capabilities y'all.

Why break this out into multiple assistants? Mainly because I am exploring various Panel of Experts architectures. However, there is a more practical answer here, attention via separation of responsibilities. Consider that both the Creative & Magic Prompts assistants need to embody a role with varying levels of knowledge about the customer. Yet only one needs to know about their brand, color schemes, and elements. All of which are required context to translate brand requirements into 3rd party prompts. That context could easily skew the Creative assistant's capabilities to be abstract or "think outside the box" when creating concepts & illustration instructions.

We will explore the Assistants API and expert-based architectures in future posts.

Everything else is straightforward, the user interface here is a simple Node.js console session supporting an interactive chat. What might not be obvious is why use a local browser executor? Ideogram, as amazing as they are, still lacks an API. So our agent will display your artwork using a macOS script to open the browser. Since the chat is ongoing, you can provide feedback on any or all Ideogram images. Repeating this cycle for as long as it takes to be happy with your creative.

Brand Identity

If you do not already have brand guidelines for your illustration needs, here are a few helpful ways to wrangle them together. The brand guidelines are part of our Magic Prompts assistant's instructions. This is where the assistant turns concepts & illustration details into on-brand magic prompts. Use descriptive color names that reflect your brand vs. their technical HEX or RGB values.

Here is a ChatGPT prompt that can help you talk about your brand identity and identify useful guidelines.

⬆️ PROMPT: Please come up with a plan to help me identify my brand guidelines by asking a series of questions. Using but not limited to, the target audience, brand personality, key messages, and visual style preferences such as color.

Most wont be needed. Try to stay high level. In the next section we will focus more on the artistic and illustration styles needs.

Example

The unRemarkable.ai brand embodies simplicity, catering specifically to AI practitioners rather than focusing on the broader origins of the industry in machine learning or data science. With a basic color palette and a visual style that leverages visual metaphors. Here is a list of things you may need:

Red Sharpie #DF413F
Yellow Sharpie: #FFC43C
Hand-drawn with a heavy marker is preferred.
Avoid paintbrush texture effects.

Brand Illustration Style

We will need a list of detailed illustration styles for your brand to include in our Magic Prompts assistant's instructions. If you are starting from scratch, here is a way you can turn your current artwork or any creative inspiration into descriptive guidelines. I'm going to use Anthropic's artwork as an example.

Describe with Ideogram

Earlier this month, Ideogram hit several new milestones. Included was the ability to describe your images. Turning them into details prompts which presumable can be used to generate the same image with Ideogram.

https://about.ideogram.ai/1.0-upgrade

Let's get a peak into a few of Anthropic's illustrations from their news posts. I'll feed these into Ideogram, capture their description, then render an image using that description as the prompt.

Anthropic	Ideogram Describe	Ideogram
	A minimalist illustration of a hand holding a pink-colored circle. The hand is positioned on the left side of the image, and the circle is being gently held between the thumb and index finger. The background is a soft beige color, and the overall design is simplistic and elegant.
	A hand holding a clipboard with a checklist. The clipboard is set against a vibrant red background. The checklist contains three squiggly lines, each marked with a check. The hand appears to be in the process of checking off the first line.
	A simple, hand-drawn depiction of a rectangular frame. Within this frame, there are three white spheres hanging vertically. One of the spheres is distinctly colored in a shade of orange, making it stand out from the others. The frame is set against a plain, light gray background.

Like most vision-capable LLMs, Ideogram is not capturing the conceptual details of each object needed to entirely recreate the source image. However, this is really good. Notice how it captured the colors and key descriptions? Also, in two cases it correctly inferred enough of the creative style (highlighted in bold) to just almost capture Anthropic's visual brand. Take notes when you see patterns like these.

Describe with GPT-4V

Just like Ideogram, ChatGPT users who have access to GPT-4V, OpenAI's vision model, can upload several images and ask the following.

⬆️ PROMPT: Classify the artwork style in these images using bullet points.

Minimalist: The images employ a very minimal amount of shapes and colors.
Modernist: The straightforward depiction with little to no embellishment.
Geometric: Use of geometric forms to abstract real-life objects.
Flat: The absence of shading or depth and the use of solid colors.
Line Art: The artwork relies on the clever use of lines to outline & define.

Instruction's Examples

Both the Creative and Magic Prompts agents are going to need feedback loops, examples of what makes a great concept & illustration or Ideogram prompt. Commonly called few-shot prompting or in-context learning, these help "fine tune" your agent's behavior over time. Start with the Creative assistant whose job it is to come up with a concept, the creative thinking behind it, and a detailed illustration description.

Starting from Scratch

When starting out, it could be helpful to come up with a few examples manually based on styles you like. You can use Ideogram's or GPT-4V's describe capabilities to help you. Focus first on the Creative assistant's concept needs by writing very clear concept names, thinking and illustration descriptions. For example, using Anthropic's first image above.

*Concept: A hand removing a stone from the middle of a structured pile.
Thinking: Like the puzzle game of Jenga, the hand is grabbing a stone which would cause the ones above it to fall if removed. This illustrates a basic concept of 'Safety' as the stones above could hurt your hand. Or it could illustrate the 'How' of safety and if done wrong could cause negative impacts in other areas.
Illustration Description: A haphazardly stacked small pile of circular stones in the shape of a triangular pile. The stones get smaller as they are stacked up 3 or 4 high. The pile consists roughly of 7 to 9 stones of varying sizes. The stone being pulled out has a few others resting on top. An arm extends from the left side with a hand holding onto a stone in the middle of the triangular pile indicating some might fall when it is pulled from the pile.

Capturing good Ideograms

While chatting with the assistant, occasionally an amazing concept and illustration will surface. I'll use this Ideogram illustration I really liked when exploring the subject of a post tag called "Emergent Behavior".

When I saw the assistant come up with this idea, I captured the relevant bits and added them to the examples in the Creative assistant's instructions. Note how the concept and illustration description are abstract and lack brand context. That's the goal.

Concept: Digital Vineyard
Thinking: Depicts data as vine plants spreading across a digital landscape, symbolizing how information grows and intertwines, creating new pathways and connections, much like vines in a vineyard, representing the organic proliferation of digital networks.
Illustration Description: This image features two rows of utility poles connected by multiple horizontal wires, with green vines and leaves intertwining with the wires. Glowing orbs representing data are interspersed among the leaves, giving the impression of lights along the wires.

Here is the magic prompt that was created for the creative concept. I added this one along with the concept above to the Magic Prompt assistant's instructions.

Magic Prompt: A minimalist and abstract illustration, hand-drawn with bold, heavy strokes in black marker on a yellow background. Wires and poles stretch across the canvas, with vines in dark green carrying glowing data nodes, intertwining and expanding, depicting the organic spread of digital networks.

Using the Creative Assistant

https://github.com/metaskills/unremarkable-ideogram-assistant?ref=unremarkable.ai

Simply run the npm command in your terminal after to start your creative copilot. Our demo code is written in such a way that any changes to the assistants or their instructions will recreate the underlying OpenAI Assistants so you can get immediate feedback after adding new instructions such as brand guidelines or examples to learn from.

$ npm run assistant

A demonstration of me chatting with my Ideogram assistant for this blog post.

Tips & Improvements

Overall this is a very iterative process and I'm still learning to to make this assistant work for me using less feedback. Here are some thing I recommend if you are doing something similar.

Look for artistic hints in the prompts generated that work for you. Adding them to your Magic Prompt assistant's instructions as brand guidelines or examples could help future iterations.
When starting, your Ideogram prompts might not be magic ✨. Try setting Ideograms "Magic Prompt" to on and see if it helps.
Leverage Ideograms editor to correct spelling mistakes using their remix feature with an 80-90% image weight.
In our examples we focus on illustrations, this process should work for any creative image type. For example realistic photography.
Post process with your favorite image editor. Try not to make Ideogram do everything. For example, all my images are color corrected with Pixelmator's replace color feature.
If you are not on a Mac, feel free to my usage of AppleScript to something else. ⚠️ Ideogram uses MUI and I found it near impossible to automate their UI with JavaScript.

I'm certain there are numerous tools available for accomplishing this task, and I encourage you to share your methods in the comments. 💞

Other Examples

Here are a few Ideograms created by my assistant while working on this post.

Thanks everyone. Please let me know if you found this useful and I would love to hear from folks that are solving this type of problem with other tools. Remember, to get updates on future posts, you can signup for my newsletter.

RAGs To Riches - Part #2 Building On Lambda

Ken Collins — Tue, 05 Sep 2023 01:10:25 +0000

Welcome to the second part of this two-part series on using AWS Lambda to build a retrieval-augmented generation (RAG) application with OpenAI. In this part, we will cover creating a ChatGPT proxy application that you can run locally and explore integration patterns with OpenAI. Please read the first part of this series Generative AI & Retrieval which covers the basics of generative AI and retrieval and the purpose of this demo application.

Demo Application

Want to jump right to the end and play around with your new LambdaRAG chat application? Head on over to the GitHub repo (https://github.com/metaskills/lambda-rag), clone or fork it, and follow the instructions in the README. Keep reading if you want to dig into more details of how this application works to retrieve external knowledge and generate responses.

Technologies Used

This demo application uses a split-stack architecture. Meaning there is a distinct front-end and back-end. The front-end is a 💚 Vue.js application with 🍍 Pinia for state and ⚡️ Vite for development. The front-end also uses 🌊 Tailwind CSS along with 🌼 daisyUI for styling. The back-end is a 🟨 Node.js application that uses ❎ Express for the HTTP framework, and 🪶 SQLite3 VSS along with 🏆 better-sqlite3 for vector storage and search.

Throughout the post we will explore various technologies in more detail and how they help us build a RAG application while learning the basics of AI driven integrations and prompt engineering. This is such a fun space. I hope you enjoy it as much as I do!

⚠️ DISCLAIMIER: I used ChatGPT to build most of this application. It has been several years since I did any heavy client-side JavaScript. I used this RAG application as an opportunity to learn Vue.js with AI's help.

Working Backwards - Why Lambda?

So let's start with the end in mind. Our LambdaRAG Demo runs locally to make it easy to develop and learn. At some point though you may want to ship it to production or share your work with others. So why deploy to Lambda and what benefits does that deployment option offer? A few thoughts:

Lambda makes it easy to deploy containerized applications.
Lambda's Function URLs are managed API Gateway reverse proxies.
The Lambda Web Adapter makes streaming API responses simple.
Container tools like Crypteia make secure SSM-backed secrets easy.
Lambda containers allow images up to 10GB in size. Great for an embedded SQLite DB.

Of all of these, I think Response Streaming is the most powerful. A relatively new feature for Lambda, this enables our RAG to stream text back to the web client just like ChatGPT. It also allows Lambda to break the 6MB response payload and 30s timeout limit. These few lines in the project's template.yaml along with the Lambda Web Adapter make it all possible.

FunctionUrlConfig:
  AuthType: NONE
  InvokeMode: RESPONSE_STREAM

Before you run ./bin/deploy for the first time. Make sure you to log into the AWS Console and navigate to SSM Parameter Store first. From there create a secret string parameter with the path /lambda-rag/OPENAI_API_KEY and paste in your OpenAI API key.

OpenAI API Basics

Our backend has a very basic src/utils/openai.js module. This exports an OpenAI client as well as a helper function to create embeddings. We cover Embeddings briefly in the Basic Architect section of the first part of this series. This function simply turns a user's query into a vector embedding which is later queried against our SQLite database. There are numerous ways to create and query embeddings. For now we are going to keep it simple and use OpenAI's text-embedding-ada-002 model which outputs 1536 dimensional embeddings.

import { OpenAI } from "openai";

export const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
});

export const createEmbedding = async (query) => {
  const response = await openai.embeddings.create({
    model: "text-embedding-ada-002",
    input: query,
  });
  return JSON.stringify(response.data[0].embedding);
};

So how does OpenAI's API work to create a chat interface and how does the Context Window discussed in part one come into play? Consider the following screenshot where I tell LambdaRAG my name and then ask if it remembers.

ChatGPT is stateless, like most web applications. It has no session for the LLM model. Every time you send a message you have to send all the previous messages (context) to the Completions endpoint. This is why we use 🍍 Pinia for client-side state management. So from an API perspective, it would look something like this below.

await openai.chat.completions.create({
 model: "gpt-3.5-turbo-16k",
 messages: [
    { role: "user", content: "Hello my name is Ken Collins." },
    { role: "assistant", content: "Hello Ken Collins! How can I..." },
    { role: "user", content: "Do you remember my name?" },
 ]
});

Did you notice how the assistant responded not only with my name but also knew it was here to help us with Luxury Apparel? This is a technique called Role Prompting. We do this in the LambdaRAG Demo by prepending this role to the user's first message in the src-frontend/utils/roleprompt.js file.

You may have noticed that the LambdaRAG Demo is written entirely in 💛 JavaScript vs. Python. As you learn more about building AI applications you may eventually have to learn Python as well as more advanced frameworks like 🦜️🔗 LangChain or Hugging Face's 🤗 Transformers.js. All of which have JavaScript versions. I hope this trend of providing JavaScript clients will continue. It feels like a more accessible language.

In the next section, we will cover how to create embeddings with your data and query for documents using SQLite's new VSS extension.

Proprietary Data & Embeddings

💁‍♂️ The LambdaRAG Demo application contains a ready-to-use SQLite database with ~5,000 products from the Luxury Apparel Dataset on Kaggle. It also has vector embeddings pre-seeded and ready to use!

Before we dig into sqlite-vss, I'd like to explain why I think this extension is so amazing. To date, I have found sqlite-vss the easiest and quickest way to explore vector embeddings. Many GenAI projects use Supabase which seems great but is difficult to run locally. The goal here is to learn!

As your application grows, I highly recommend looking at Amazon OpenSearch Serverless. It is a fully managed, highly scalable, and cost-effective service that supports vector similarity search. It even supports pre-filtering with FAISS.

Let's look at sqlite-vss a bit closer. This article A SQLite Extension for Vector Search does an amazing job covering the creation of standard tables as well as virtual tables for embeddings and how to query them both. The LambdaRAG Demo follows all these patterns closely in our db/create.js file. Our resulting schema is:

CREATE TABLE products (
    id INTEGER PRIMARY KEY,
    name TEXT,
    category TEXT,
    subCategory TEXT,
    description TEXT,
    embedding BLOB
  );
CREATE TABLE IF NOT EXISTS "vss_products_index"(rowid integer primary key autoincrement, idx);
CREATE TABLE sqlite_sequence(name,seq);
CREATE TABLE IF NOT EXISTS "vss_products_data"(rowid integer primary key autoincrement, _);
CREATE VIRTUAL TABLE vss_products using vss0 (
    embedding(1536)
  );

If you want to re-create the SQLite database or build a custom dataset, you can do so by changing the db/create.js and running npm run db:create. This will drop the existing database and re-create it with data from any CSV file(s), supporting schema, or process you are willing to code up.

> npm run db:create
> lambda-rag@1.0.0 db:create
> rm -rf db/lambdarag.db && node db/create.js
Using sqlite-vss version: v0.1.1
Inserting product data...
 ██████████████████████████████████░░░░░░ 84% | ETA: 2s | 4242/5001

Afterward you would need to run the npm run db:embeddings script which uses the OpenAI API to create embeddings for each product. This takes a few minutes to complete all the API calls. The task includes a local cache to make it faster to re-run. Lastly, there is a npm run db:clean script that calls a VACUUM on the DB to remove wasted space for the virtual tables. Again, all of this is only required if you want to re-create the database or build a custom dataset. There is a ./bin/setup-db wrapper script that does all these steps for you.

Retrieval with Function Calling

OK, so we have a database of products and their matching vector embeddings to use for semantic search. How do we code up going from chat to retrieving items from the database? OpenAI has this amazing feature named Function Calling. In our demo, it allows the LLM to search for products and describe the results to you.

But how does it know? You simply describe an array of functions that your application implments and during a chat completion API call. OpenAI will 1) automatically make a determination a function should be called 2) return the name of the function to call along with the needed parameters. Your request looks something like this.

await openai.chat.completions.create({
  model: "gpt-3.5-turbo-16k",
  functions: '[{"search_products":{"parameters": {"query": "string"}}}]',
  messages: [
    { role: "user", content: "I need a cool trucker hat." }
  ]
});

If a function has been selected, the response will include the name of the function and parameters. Your responsibility is to check for this, then call your application's code matching the function and parameters. For LambdaGPT, this will be querying the database and returning any matching rows. We do this in our src/models/products.js file.

For OpenAI to respond with the results, we send it another request that now has two additional messages included. The first is of type "function" and includes the name and parameters of the function you were asked to call. The second is of type "user" which includes the JSON data of the products returned from our retrieval process. OpenAI will now respond as if it has this knowledge all along!

await openai.chat.completions.create({
  model: "gpt-3.5-turbo-16k",
  functions: '[{"search_products":{"parameters": {"query": "string"}}}]',
  messages: [
    { role: "user", content: "I need a cool trucker hat." },
    { role: "function", name: "search_products", content: '{"query":"trucker hats"}' },
    { role: "user", content: '[{"id":3582,"name":"Mens Patagonia Logo Trucker Hat..."}]' },
  ]
});

Since all messages are maintained in client-side state, you can see them using a neat debug technique. Open up the src-frontend/components/Message.vue file and make the following change.

  'border-b-base-300': true,
  'bg-base-200': data.role === 'user',
- 'hidden': data.hidden,
+ 'hidden': false,

You can now see all the messages' state in the UI. This is a great way to debug your application and see what is happening.

More To Explore

I hope you found this quick overview of how OpenAI's chat completions can be augmented for knowledge retrieval. There is so much more to explore and do. Here are some ideas to get you started:

All responses are streamed from the server. The fetchResponse in the src-frontend/stores/messages.js Pinia store does all the work here and manages client side state.
That same file also converts the streaming responses Markdown code into HTML. This is how the demo can build tables just like ChatGPT does.
Sometimes the keywords passed to the search products function can be sparse. Consider making an API call to extend the keywords of the query using the original message. You can use functions here too!
Consider adding more retrieval methods to the src/utils/functions.json file. For example, a find_style by ID method that would directly query the database.

❤️ I hope you enjoyed these posts and find the LambdaRAG Demo application useful in learning how to use AI for knowledge retrieval. Feel free to ask questions and share your thoughts on this post. Thank you!

RAGs To Riches - Part #1 Generative AI & Retrieval

Ken Collins — Mon, 04 Sep 2023 03:30:00 +0000

LambdaRAG Demo: https://github.com/metaskills/lambda-rag

Welcome to my two-part series on using AWS Lambda to build a retrieval-augmented generation (RAG) application with OpenAI. In this first part, we will explore the basics of generative AI and retrieval. In the second part, we will build a RAG application using AWS Lambda, Express, and SQLite VSS.

Knowledge Navigator

If you have interacted at all this year with OpenAI's 💬 ChatGPT you may think we are closer than ever to 🎥 Apple's 1987 Knowledge Navigator. A way to interact with a computer that most of us have only seen in SciFi movies. At first glance today, these language model's knowledge do feel eerily expansive and unlimited. With very little context they can solve interesting problems in highly probabilistic ways.

Yet for knowledge workers, these models are still very limited in their ability to help us. We have all seen the "As of my last update..." or "I don't have real-time..." messages when asking about current events or within a highly specific domain. It is frustrating to hit these roadblocks for those looking to use data, often proprietary, with our friendly Large Language Models (LLMs). After all, half of the video above is an AI resembling Mark Zuckerberg in a bow tie responding to new data. But how?

Retrieval-Augmented Generation (RAG)

Did you know that "GPT" in ChatGPT stands for ℹ️ Generative Pre-Trained Transformers? The key words here are "generative" and "pre-trained". It is easy from these terms to understand that ChatGPT is pre-trained on massive amounts of knowledge and it generates real language responses based on that knowledge.

A GPT model can learn new knowledge in one of two ways. The first is via model weights on a training set (fine-tuning). The other is via model inputs or inserting knowledge into a context window (retrieval). While fine-tuning may seem like a straightforward method for teaching GPT, it is typically not recommended for factual recall, but rather for 💩 specialized tasks. OpenAI has a great cookbook titled 📚 Question answering using embeddings-based search where they make these points on your choices.

Fine-Tuning	Retrieval
As an analogy, model weights are like long-term memory. When you fine-tune a model, it's like studying for an exam a week away. When the exam arrives, the model may forget details, or misremember facts it never read.	In contrast, message inputs are like short-term memory. When you insert knowledge into a message, it's like taking an exam with open notes. With notes in hand, the model is more likely to arrive at correct answers.

Open book exam? I'm sold. After all, this is how the internet works today. Information is retrieved over many different protocols, locations, and APIs. If you would like to keep exploring this topic I have included some references below on how on retrieval-augmented generation fits into the current market and the opportunities and tools available today.

Basic Architecture

Our RAG's architecture with OpenAI is going to follow this diagram. Our demo application described in the next article of this series will only focus on using OpenAI's Function Calling feature since we are building a stand-alone chat application. If you were building ChatGPT Plugin, that is when you would use the OpenAPI Definition.

Our demo will also showcase using SQLite for Semantic Search as a fast and easy to use vector database that can store and query our Vector Embeddings. Vector embeddings are numerical representations of complex data like words or images, simplifying high-dimensional data into a lower-dimensional space for easier processing and analysis.

Lastly we are going to need some proprietary data to use with our RAG. I chose the Luxury Apparel Dataset from Kaggle. It contains ~5,000 products with good descriptions and categories to facet.

Context Windows

So how much data can we insert into a message when retrieving data? The answer depends on the size of the model's Context Window. This refers to the maximum number of tokens the model can consider as input for generating outputs. There is a growing trend and demand for LLMs with larger context windows. For instance, some previous generation models could only consider 2,000 token inputs, while some more advanced versions can handle up to 32,000 tokens.

So is the context window our Moore's Law for LLMs? Some think so. There is even thinking that we get diminishing returns with larger models. While newer Claude models are pushing 100K tokens, some think Less is More and solid retrieval patterns are the key to good results. We should all keep an eye out as we rapidly explore this area of AI.

💔 Goodbye Cold Starts ❤️Hello Proactive Initialization

Ken Collins — Sun, 16 Jul 2023 16:29:38 +0000

💁 Full Rails & Lambda Details at:
https://lamby.cloud/docs/cold-starts

As described in AJ Stuyvenberg's post on the topic Understanding AWS Lambda Proactive Initialization, AWS Lambda may have solved some of your cold start issues for you since March 2023. Stated in an excerpt from AWS' docs:

For functions using unreserved (on-demand) concurrency, Lambda occasionally pre-initializes execution environments to reduce the number of cold start invocations. For example, Lambda might initialize a new execution environment to replace an execution environment that is about to be shut down. If a pre-initialized execution environment becomes available while Lambda is initializing a new execution environment to process an invocation, Lambda can use the pre-initialized execution environment.

This means the Monitoring with CloudWatch is just half the picture. But how much is your application potentially benefiting from proactive inits? Since Lamby v5.1.0, you can now find out easily using CloudWatch Metrics. To turn metrics on, enable the config like so:

config.lamby.cold_start_metrics = true

Lamby will now publish CloudWatch Embedded Metrics in the Lamby namespace with a custom dimension for each application's name. Captured metrics include counts for Cold Starts vs. Proactive Initializations. Here is an example running sum of 3 days of data for a large Rails application in the us-east-1 region.

This data shows the vast majority of your initialized Lambda Contaienrs are proactively initialized. Hence, no cold starts are felt by end users or consumers of your function. If you need to customize the name of your Rails application in the CloudWatch Metrics dimension, you can do so using this config.

config.lamby.metrics_app_name = 'MyServiceName'

The Elusive Lambda Console; A Specification Proposal.

Ken Collins — Sat, 17 Jun 2023 20:15:35 +0000

After years of smashing Cloud & Rails together, I've come up with an idea. Better than an idea, a working specification! One where us Rails & Lambda enthusiasts can once again "console into" our "servers" and execute CLI tasks like migrations or interact via our beloved IRB friend, the Rails console. Today, I would like to present, the Lambda Console project. An open specification proposal for any AWS Lambda runtime to adopt.

Lambda Console

npm install -g lambda-console-cli

The Lambda Console is a CLI written in Node.js that will interactively create an AWS SDK session for you to invoke your Lambda functions with two types of modes.

CLI Runner
Interactive Commands

Think of the CLI Runner as a bash prompt. You can run any process command or interact with the filesystem or environment. For Rails users, running rake tasks or DB migrations. These tasks assume the Lambda task root as the present working directory.

Interactive commands however are evaluated in the context of your running application. For Ruby and Rails applications, this simulates IRB (Interactive Ruby Shell). For Lamby users, this mode simulates the Rails console. Making it easy for users to query their DB or poke their models and code.

The Proposal

There is nothing about the Lambda Console that is coupled to Ruby or Rails. The idea is simple, as a Lambda community, could we do the following?

Finalize a Lambda Console request/response specification.
Create more runtime-specific language implementations.
Build an amazing CLI client for any runtime.

Here is what we have today. The request specification, a simple event structure that is only a few dozen lines of JSON schema.

{ "X_LAMBDA_CONSOLE": { "run": "cat /etc/os-release" } }

{ "X_LAMBDA_CONSOLE": { "interact": "User.find(1)" } }

Any Lambda runtime code or framework could implement the handling of these event in their own language-specific pakages. You can find the Ruby implementation of these in the Lambda Console's first reference implementations.

Ruby: The lambda-console-ruby gem for any Ruby Lambda.
Rails: Integrated into the Lamby v5.0.0 for Rails on Lambda.

The Possibilities

What I really want is an amazing CLI client. The current Lambda Console CLI was hacked together in a few days using some amazing Node.js tools that make building interactive CLIs so so easy. But I've never done this before. If this type of tooling sounds interesting to you and you like Node.js, let me know! It would be amazing to see implementation packages for these for Node, PHP, Python, and other frameworks using these languages. Here are some ideas on where I could see this going.

Live STDOUT & STDERR: We could take advantage of Lambda's new Response Streaming and send output buffers as they happen.

Pseudo TTY: Is there a way to better simulate a real TTY session? Could this even include ANSI colors?

Quality of Life Improvements: Everything from, Allowing the CLI tool to switch modes without restarting it; Creating a command buffer to up arrow navigate history; Prettier UI.

Formal Response JSON Schema: As the features grow, should the response JSON be standardized? For example, if the client wanted to syntax highlight interactive language commands, how would it know what language was being used? We could have a X_LAMBDA_CONSOLE_LANG response header.

What else would you like to see in a Lambda Console client?

Using Tailscale on Lambda for a Live Development Proxy

Ken Collins — Sat, 03 Jun 2023 11:55:06 +0000

⚠️ DISCLAIMER: In no way am I advocating for the use of live proxies as a normal way to develop against cloud resources. However in some edge cases, such as developing a new system, live dev proxies or the general use of Tailscale in Lambda could be useful.

🐋 Tailscale on Lambda

Tailscale makes networking easy. Like really easy. It shines in situations where private networks do not allow inbound connections. Tailscale can connect your devices and development environments for easy access to remote resources, or allow those remote systems to access your home or office network devices.

A few years ago Corey Quinn wrote a Tailscale Lambda Extension. It is great and helped a lot of folks. Today, I'd like to share a new project based on Corey's work that makes it even easier to use Tailscale in Lambda Container. Check it out here.

🔗 Tailscale Lambda Extension for Containers on GitHub 🐙

This new version tries to improve upon Corey's work. Initialization is now stable, there are more configuration options, and we even have multi-platform Docker container packages for both x86_64 and arm64. We even have Amazon Linux 2 and Debian/Ubuntu variants. Installation is really easy, simply add one line to your Dockerfile. For example:

FROM public.ecr.aws/lambda/ruby:3.2
RUN yum install -y curl
COPY --from=ghcr.io/rails-lambda/tailscale-extension-amzn:1 /opt /opt

Once your container starts, taking to any device within your tailnet can be done by using the local SOCKS5 proxy. In the example below, we are using Ruby's socksify gem.

require 'socksify/http'
Net::HTTP.socks_proxy('localhost', 1055).start(...) do |http|
  # your http code here...
end

🔌 ActionCable on Lambda

How did I use Tailscale for the Rails on Lambda work? A few months ago, I started work on the last critical part of the Rails ecosystem which did not work on Lambda... ActionCable & WebSockets. Specifically, I wanted Hotwire to work.

So far, everything is working great with our new LambdaCable gem. Eventually it will be a drop-in adapter for ActionCable and join the ranks of other popular alternatives like AnyCable. To bring the project to completion faster, I needed feedback loops that were much faster than deploying code to the cloud. I needed a development proxy! One where my Rails application would receive events from both Lambda's Function URLs and the WebSocket events from API Gateway. Illustrated below with a demo video.

If you are curious to learn more about how Rails & Lambda work together, check out our Lamby project. The architecture of Lambda Containers works so well with Rails since our framework distills everything from HTTP, Jobs, Events, & WebSocket connections down to Docker's CMD interface. The architecture above at the proxy layer was easy to build and connect up to our single delegate function, Lamby.cmd. Shown below:

For our Rails application on Lambda, here are the changes we made to leverage this. All outlined in our WebSockets Demo Pull Request.

Created a .localdev folder. Added a copy of our SAM template.yaml for all AWS Resources.
Made a simple .localdev/Dockerfile that included the Tailscale Extension along with basic proxy code.
Leveraged Lamby's Local Development Proxy Sever.
Ensured our Devcontainers exposed port 3000 to all local network devices so Tailscale could detect the service.

I hope you find reasons to learn more about Tailscale and how using a SOCKS5 proxy from Lambda could help your development or production needs. More so, I hope you like the new Lambda Extension project of ours making it easy for containerized applications to use. Drop us a comment if you do.

🔗 Tailscale Lambda Extension for Containers on GitHub 🐙

Trigger CircleCI Workflow. AKA Simple Deploy Button

Ken Collins — Fri, 03 Mar 2023 16:52:17 +0000

Very simple, no parameters needed, no enums, no booleans... just a really easy way to trigger a deploy with CircleCI. We can do this making use of the trigger_source pipeline value. When you click the button in CircleCI to "Trigger Pipeline" the value would be api vs something like webhook.

version: 2.1
jobs:
  deploy:
    machine:
      image: ubuntu-2204:current
    steps:
      - run: echo 'Deploying...'
workflows:
  deploy:
    when: { equal: [ api, << pipeline.trigger_source >> ] }
    jobs:
      - deploy

If your workflow needs a test job, consider doing something a bit more complicated. Here we use two when conditions to work with a parameter.

version: 2.1
parameters:
  workflow:
    type: enum
    default: test
    description: The workflow to trigger.
    enum: [test, deploy]
jobs:
  test-job:
    machine:
      image: ubuntu-2204:current
    steps:
      - run: echo 'Testing...'  
  deploy-job:
    machine:
      image: ubuntu-2204:current
    steps:
      - run: echo 'Deploying...'
workflows:
  test:
    when: { equal: [ test, << pipeline.parameters.workflow >> ] }
    jobs:
      - test-job
  deploy:
    when: { equal: [ deploy, << pipeline.parameters.workflow >> ] }
    jobs:
      - deploy-job

Now your CircleCI config will run tests by default and you can easily trigger a deploy via any branch using the "Trigger Pipeline" button.

Ruby on Rails on Lambda on Arm64/Graviton2!

Ken Collins — Sat, 18 Feb 2023 15:40:37 +0000

Today I am happy to announce that Lamby (Simple Rails & AWS Lambda Integration using Rack) now demonstrates just how easy it is to use multi-platform arm64 images on AWS Lambda. If this sounds interesting to you, jump right into our Quick Start guide and deploy a new Rails 7 on Ruby 3.2 Ubuntu image to see it for yourself.

How It Works?

First, AWS has made this incredibly easy since their release in September '21 where AWS SAM can simply switch the deployment architecture in your serverless project's template.yml file:

 Properties:
   Architectures:
-    - arm64
+    - x86_64

Second, make sure your base Docker image supports the arm64 architecture. Most popular images use multi-platform builds already. For example, here is the official Ruby image we use in Lamby's demo project.

$ docker manifest inspect ruby:3.2 | grep arch
            "architecture": "amd64",
            "architecture": "arm64",

Lastly, make sure your deployment machine matches the production target architecture. This is needed to ensure native dependencies (like the MySQL client) are built to match the Docker image architecture eventually being run in production. If you are on a M1/M2 Mac, you can deploy from your own machine.

However, for real production CI/CD, you are better off using something like CircleCI's Arm Execution Environment. Currently GitHub Actions lacks native support for Arm64 Runners, but that issue is being tracked and I suspect is soon to come.

In the meantime, Lamby's demo projects includes a working CircleCI CI/CD example for you that leverages the arm.large machine type.

default-machine: &default-machine
  machine:
    image: ubuntu-2204:current
    docker_layer_caching: true
  resource_class: arm.large

Thanks! Please take the time to learn more about Rails on Lambda using Lamby along with how to use arm64 with Graviton2 with your Lambda applications on our site:

Trigger CircleCI Workflow. AKA Simple Deploy Button

Ken Collins — Sun, 05 Feb 2023 14:29:28 +0000

version: 2.1
jobs:
  deploy:
    machine:
      image: ubuntu-2204:current
    steps:
      - run: echo 'Deploying...'
workflows:
  deploy:
    when: { equal: [ api, << pipeline.trigger_source >> ] }
    jobs:
      - deploy

If your workflow needs a test job, consider doing something a bit more complicated. Here we use two when conditions to work with a parameter.

version: 2.1
parameters:
  workflow:
    type: enum
    default: test
    description: The workflow to trigger.
    enum: [test, deploy]
jobs:
  test-job:
    machine:
      image: ubuntu-2204:current
    steps:
      - run: echo 'Testing...'  
  deploy-job:
    machine:
      image: ubuntu-2204:current
    steps:
      - run: echo 'Deploying...'
workflows:
  test:
    when: { equal: [ test, << pipeline.parameters.workflow >> ] }
    jobs:
      - test-job
  deploy:
    when: { equal: [ deploy, << pipeline.parameters.workflow >> ] }
    jobs:
      - deploy-job

Now your CircleCI config will run tests by default and you can easily trigger a deploy via any branch using the "Trigger Pipeline" button.

New Amazon Linux Dev Container Features

Ken Collins — Mon, 31 Oct 2022 01:48:24 +0000

🆕 Want to use Codespaces with Amazon Linux 2? Check out customink/codespaces-features for two custom features. 1) common-amzn 2) docker-in-docker-amzn.

So, last year I shared how we could integrate Codespaces into our AWS Lambda docker compose patterns. Since then Microsoft's Development Containers specification has come a LONG way. Everything is wrapped up nice and neatly at the containers.dev site. Take a look if you have not already seen it.

Dev Containers?

So why are Development Containers & Codespaces such a big deal? I can illustrate some Lambda & Kubernetes use cases below, but first I would like to spell out a few features that may be new to some. Including existing Codespaces users.

The Dev Container specification at the lowest level of Codespaces is open to everyone and lots of tooling exists around it by a growing community.
The specification has a reference implementation via a node Command Line Interface. Think of this devcontainer CLI as a higher order docker compose. You can use this to run Codespaces projects locally!
Atop of the CLI, there is CI tooling for GitHub Actions. This means you can use the same development container as your CI/CD environment.

Containers Usage Areas

So where are containers used in your organization or projects? Here are some phases that most of us can identify with. Where projects move from the left to the right.

Development: Most of us have tried to use docker or compose at some point. For example, the most common use of this area would be running a database like MySQL. Docker makes this easy.

CI/CD: Typically where we run tests and hopefully build and/or deploy our code to production. If you have used CircleCI before, again a database container here might feel familiar.

Runtime: Which is often our final container environment. We can think of this as production for most of us but it could be any container orchestration like Kubernetes, Lambda, or Fargate.

Old AWS SAM Patterns with Docker Compose

Today our Lambda SAM cookiecutters leverage SAM's build image via docker compose to ensure local development is within the same environment for our CI/CD tooling. We ended up with something like this image

At the bottom we can see the host platform typically associated with each of these stages. Because we use Docker, we can be cross-platform and consistent. The problem? ⚠️ Making up your own docker/compose patterns are a huge drag. From SSH patterns to Docker-in-Docker gotchas.

New AWS SAM Patterns with Dev Containers

In the coming weeks the Lamby Cookiecutter will be updated to use Development Containers so folks with (or without) Codespaces can easily work with the project. The result will be something like this.

With Development Containers, so much docker compose boilerplate can be removed. Thanks in huge part to our newly released common & docker-in-docker Amazon Linux 2 features work. If you want to see an example of how this helps everyone including running Codespaces locally with VS Code, checkout our Crypteia Project's Development section. You can even use all this without VS Code & GitHub Codespaces. For example:

devcontainer build --workspace-folder .
devcontainer up --workspace-folder .
devcontainer run-user-commands --workspace-folder .
devcontainer exec --workspace-folder . ./bin/setup
devcontainer exec --workspace-folder . ./bin/test-local

Unexplored Development Container Space

So can Development Containers be used in your projects without the Lambda patterns above? Yes! Consider the following diagram that has a Platform Engineering team building base images.

These teams typically approach containers from the right to the left. Where base OS images are made into language specific images with variants for CI/CD. Just like SAM has build and runtime images. Technically for them "Runtime" is some container registry like Amazon ECR, but you get the idea.

At Custom Ink we are using our CircleCI images for our Kubernetes projects with Codespaces. The Microsoft team makes this easy since all of their features work with Ubuntu out of the box.

If your development stages look something like the image above. Please consider adopting Development Containers based on your CI/CD images and explore that big purple space by connecting your container value chain in a beautiful little circle. Thanks for reading!