Forem: Savannah Norem

Let’s build an AI app - pt. 2

Savannah Norem — Tue, 17 Sep 2024 05:54:48 +0000

As AI continues to make headlines, with Mistral releasing their first multi-modal model and OpenAI releasing a “reasoning” model, becoming an AI developer continues to be at the forefront of a lot of people's minds. And for most developers, knowing how to go from nothing to a functional website is the goal, so that’s what I’m here to attempt to teach you. If you’re already developing AI apps, I would love to hear about the tools you’re using, what you’re trying to do, and especially anything you’re struggling with.

If you want the quick and dirty of it - here you’ll find a demo and RedisVL data loader. RedisVL is the Redis Vector Library for Python, which is currently the go to language for AI development. In the last week I went from searching images of strawberries and returning the name of the closest images to searching images from anime posters and returning both the actual closest matched image and the name of the anime it came from. Below you'll see that a search for "swords" brings up the poster for Kimetsu no Yaiba: Katanakaji no Sato-hen, which prominently features...drumroll please...everyone holding a sword!

How'd we get here?

So last week I started looking for a dataset. If you’re getting into data science or machine learning - you’ve probably already heard of Kaggle and that’s where I started. I was thinking about some recommendations I’ve had for anime recently, and so started searching for anime datasets, I'm starting with this one but will have to do a bit of image gathering, since what's in the CSV is a link to the image, which is simply a limitation of CSVs.

So you have some data, now what?

The dataset I'm using has a lot more in it than I want, but filtering is a lot easier than supplementing and trying to tie two datasets together. There are 24 columns - anime_id, Name, English name, Other name, Score, Genres, Synopsis, Type, Episodes, Aired, Premiered, Status, Producers, Licensors, Studios, Source, Duration, Rating, Rank, Popularity, Favorites, Scored By, Members, Image URL

I decided I only wanted to look at TV shows and movies, so I filtered for that along with only storing the values that I care about (for now, always subject to change). The other thing about this dataset is that it has over 20,000 rows of anime in it. While that's totally great for them, I don't enjoy the time needed to fetch 20,000 images from their URLs. So I chopped mine to the top rated 1,000 - so I started with this:



sorted_anime = sorted(reader, key=lambda x: x[4])
anime_reader = sorted_anime[:1000]

And if you're curious, in that particular dataset, the lowest rated is Aki no Puzzle at 2.37/10. Then I added a reverse=True to my sorting and found that UNKNOWN is a higher value than any number! Since I wanted this code to be follow-able and clean, I got rid of all of my print(row[5], type(row[5])) but just know that they were there! This is when I decided that the easiest way to handle some of the filtering I wanted to do would be to throw the whole CSV into a data frame, get rid of the stuff I don't want, then throw it back into a CSV to do some secondary parsing. If you go look at the code you may notice that I actually chop the CSV at 1,010 rows and have slightly more than 1,000 rows. That's because URLs can fail, and they might do so unpredictably. I figured a 1% error rate would be safe to make sure I have at least 1,000 in the end.

The problem with a CSV reader is that the only way to really use it is row by row. While that'll work just fine for changing the genres from a string to a list, it doesn't really work for sorting the CSV while excluding certain values. This should also save me a lot of time, since instead of having to check on each row for TV or Movie, and an actual numeric rating, I'll only be dealing with rows that have already cleared a few checks.

After getting the data how I want it, let’s make it searchable.
RedisVL uses a schema to define how your data looks, and for what I have currently, this is how my schema looks:



schema = {
    "index": {
        "name": "anime_demo",
        "prefix": "anime",
    },
    "fields": [
        {"name": "title", "type": "text"},
        {"name": "english_name", "type": "text"},
        {"name": "episodes", "type": "numeric"},
        {"name": "rating", "type": "numeric"},
        {"name": "synopsis", "type": "text"},
        {"name": "genres", "type": "tag"},
        {"name": "popularity_rank", "type": "numeric"},
        {"name": "poster_vector", "type": "vector", 
            "attrs": {
                 "dims": 768,
                 "distance_metric": "cosine",
                 "algorithm": "flat",
                 "datatype": "float32"
            }
         }
    ]
}

We’ll have a title, an English name, the number of episodes, the numeric rating, a synopsis, the genres of the show or movie, the popularity ranking, and a vector embedding of the poster image. The dimensions of your vector will change based on what model you’re using - but knowing that information and correctly passing it along to Redis are critical steps to make vector search work correctly.

After that the changes to the Gradio portion were pretty simple. I changed up the structure a bit due to needing to pass the image in a format that would work for Gradio. So now we have a demo that contains the elements for the webpage and the search functionality, then we launch it.



with gr.Blocks() as demo:

   search_term = gr.Textbox(label="Search the top 100 anime by their 

      posters")

   search_results = gr.Textbox(label="Closest Anime (by poster 

      search)")

   poster = gr.Image(label="Closest Anime Poster")

def anime_search(text):

       embedding = vectorizer.embed(text, as_buffer=True)

       query = VectorQuery(vector = embedding, vector_field_name = 

           "poster_vector", return_fields=["title", "img_name"])

       results = index.query(query)[0]

       title = results['title']

       img = results['img_name']

       return title, Image.open(f'anime_posters/{img}')

gr.Button("Search").click(fn=anime_search, inputs=search_term, 

      outputs=[search_results, poster])

demo.launch()

Things I'll do next

I'm still on the hunt for a bit more data about these anime - I would love a longer description and potentially more images. I definitely want to see how many I can add, and how many fields can be vectors, in a free tier Redis database.

I’m also planning to see what all Gradio can do - ideally I’d like to be able to cycle through the results and display both the poster and the anime title for about the top 5 results from the vector search.

The other thing I’m going to do this week is a bit of refactoring. I’d like to add some of the standard best practices to this repository, like a real README and a requirements.txt that will help you try it out if you want.

I would love to hear from you about the AI apps you’re building, if you know how to scroll through images in Gradio, and if you have favorite tools for scraping dynamic web content. Check out the RedisVL project yourself if you’re considering an AI app backed by Redis, as it really does make this whole thing so much easier.

Let’s build an AI app

Savannah Norem — Tue, 10 Sep 2024 12:15:00 +0000

AI is everywhere these days, and all of us are feeling the pressure to become AI developers. We all probably have an idea or two somewhere in the noggin that we think could make it big, but getting started in this new and unstable landscape can be incredibly daunting. As new tools pop up left and right, as information changes and hallucinations happen, it can be hard to figure out where to even start. So, let’s build an AI app.

Why let’s? Because your feedback can be incorporated! I’m not saying it will be, but it certainly could be. Because I don’t already have a finished project sitting in GitHub somewhere waiting for me to show it to you all nice and neat and tidy. I’m building an AI application out, week by week, and sharing with y’all what I did, what I’m going to attempt next, and what I learned along the way.

So, what have I done already?

The first page of almost any instruction manual is going to list out all the parts and tools you have and what you need, and I’ve already made a few decisions on what you’ll need to follow me along. Now the big caveat is that since this is not already a finished product, some of these tools might not actually be the ones that get used in the end! Maybe something new will come out in the next few weeks. Well, there’s almost a guarantee that more than one new AI framework tool will come out in the next few weeks, but maybe it’ll easily drop in and be perfect for what I need.

So this is a “the things I’m starting with, and the direction I’m heading” kickoff.

Tools I’m planning on using (and why):

RedisVL (Redis Vector Library) is “the ultimate Python client designed for AI applications harnessing the power of Redis”; so stated in the readme, so shall it be. This tool makes starting an AI app backed by Redis incredibly simple. RedisVL has built in integrations with popular embedding providers like HuggingFace, Cohere, and OpenAI (among others). This means that while you still have to pick a model and install it, instead of you as the developer trying to get two different things to talk to each other, RedisVL developers have figured it out for you.

The model itself is the piece that’s most subject to change. I chose HuggingFace for now because they have a multi-modal model (clip-ViT-L-14) and RedisVL has a HuggingFace integration. I want to do both text and image embeddings for this project, and a model that does both makes that sound a lot easier. HuggingFace is well known and doesn’t seem to be going anywhere. Which is important! Not that the stakes for this project are anywhere near as high, but I’ll always avoid tech that seems so buzzwordy and out there that it risks leaving its users with bionic eyes and no support for them.

I’ve seen Gradio all over recently - and am frankly just wanting to try it out. Their website makes it seem so incredibly simple and as a primarily Python developer, frontends scare me.

So what else?

I don’t know if you’ve ever gathered all your supplies, gotten super excited and started working on a project, only to realize that the glue sticks you bought don’t actually fit into the glue gun you're using, but I have. So the next step was to see if I could get the most minimally viable project in the history of MVPs working. RedisVL currently only supports embeddings for text, but you’ll find an experimental fork on my GitHub that literally only has a check for string type commented out to make this work for now. But don’t forget, type checks keep things from falling over like they got hit by a tank and help ensure things fail gracefully. I’m hoping for some changes to be implemented in that library that would make it so you don’t have to run an experimental fork for this to work and will be updating my code as they update RedisVL.

Now feast your eyes on the gorgeous application that lets you do text search over a whole host of images (there are currently six) and returns the vector distance and the image names in search relevance order! You then get to go lookup what they look like and decide for yourself.

But here you can see I searched “strawberries in a bowl” and the top result is in fact the image I so creatively titled “strawberries_white_bowl_brown_background”.

Also on my GitHub you’ll find a demo for this app - now, don’t forget what you’re looking at right above this. There’s not even a requirements.txt just yet, and you’d need to use the experimental fork of RedisVL that’s down a sanity check to make sure the models can do what you’re asking them to. But you can check it out yourself and see that by using Gradio and RedisVL, getting from images to embeddings to searching took ~60 lines of code.

What’s next?

Up next I’ll be searching for larger datasets, looking at what makes a dataset good for this type of project, and going into the different types of vector similarity (but only to a level that’s relevant). I’ll also be getting (at least) the closest result image to display on the web page so you can see how you think the search went without having to cross reference what you named your images.

So keep reading for things I learned this week, and drop a comment if you have recommendations for dataset searching (aside from Kaggle as that’s where I’m starting), have thoughts on what makes a good dataset for an image / text vector search app, or just to say hey.

And in case you’re curious why all the images I have this week are of strawberries, you can check out my first post on why LLMs don’t know how many “r”s are in “strawberry”.

Things I learned this week

A lot of “how to build an AI app” articles are outdated. You no longer need to know the difference between KNN and decision trees, you don’t have to use TensorFlow (or PyTorch or Google AutoML), you no longer need to build or fine tune a model on your own. And while some articles are catching up with the new tools, right now it seems your best bet is to pick a tool to work with, and check out their documentation.
That you can pip install a local library. Whenever I’ve previously worked with a library that I was also editing, I would typically have an import with a path. However with a virtual environment, for me personally, the workflow of having the pip install with the path to my local library followed by the python3 vl_demo.py meant I was just pressing up in my CLI to go between the two until I made progress. Which was really easy and convenient. Instead of figuring out the import path, rebuilding the directory after any changes, worrying about my sys.path all the time, this worked for me. It may not be your favorite workflow, but I didn’t know that as long as there’s a setup.py or pyproject.toml, you can pip install it.
That embeddings from a specific model are deterministic. While getting started on this I was searching using the embedding of one of the images I’d already saved. I found that I got a vector_distance of 5.36441802979e-07 which is essentially zero (0.00000053…). But I wasn’t sure if this was because the embedding had changed ever so slightly (if they were non-deterministic) or if it was a simple case of large numbers not always dividing and rounding perfectly. So I learned that they are in fact deterministic and so the vector distance should be zero, but math.
The idea of Command Query Responsibility Segregation (CQRS). This idea says that you don’t necessarily have to read and update your data in the same way. It makes sense, especially for applications that are going to be very heavy on one xor the other. It does potentially add complexity, and it’s very likely I won’t touch this at all for this project. But it’s something I learned about this week that is definitely tangentially related.

How many 'r's are in strawberry? And do LLMs know how to spell?

Savannah Norem — Fri, 30 Aug 2024 13:00:00 +0000

Well the short answers are three and kind of… but not really.

Any which way you cut it, there are three ‘r’s in strawberry. But different large language models (LLMs) are evidently struggling with this question. So let’s take a look at what they’re “saying”, how they’re justifying it, what some of the flaws are, and some of the broader implications for LLMs.

All these screenshots are from me simply asking different LLMs the question of “how many ‘r’s are in the word strawberry?”, and since LLMs are not deterministic, you may or may not get the same answers. But it’s definitely not just a me problem, and given that (reportedly) OpenAI’s new model coming this fall is being called Strawberry, it seems that people have taken note.

Can large language models spell?

LLMs from gpt to jamba are saying that “strawberry” has two “r”s in it.

Side note: if you haven’t checked out lmarena.ai, you should. You’re able to pit models against each other anonymously and vote on which is better, which can also be super helpful for showing how bad they all are.

To be fair, they're only sometimes wrong. But when they are, they sometimes double down, sometimes correct themselves, but never seem to know where the “r”s actually are.

With Pi actually adding a “second e” into strawberry...somewhere...? And Claude simply skips over one.

Claude had a particularly interesting response and said that there is no third “r” because it’s part of a “double r”. To be fair, when prompted about why that didn’t count, Claude backed up and said that for both spelling and counting purposes, a double “r” is in fact two “r”s, and therefore strawberry has three.

So why is this hard?

It’s so intuitive to humans to count how many “r”s are in strawberry, or how many “e”s are in timekeeper or “o”s in bookshop, but for LLMs there’s clearly a different story. There are a few different potential reasons for this, and they center around how LLMs decide what to say.

If you’re not familiar with how LLMs work, the briefest explanation is that they’ve looked at a lot of data and have an idea of what words go together and how sentences are supposed to look, but hallucinations and spelling mistakes occur because they’re basically playing a probability game of what word will come next. So for a counting task, like determining the number of "r"s in "strawberry," the model isn't directly calculating this count. Instead, it's predicting what the most likely correct answer would be based on similar patterns it has seen during training.

Probability
The most basic explanation is just that the model determined two is the most likely word to come next. This could be particularly affected with words like strawberry and others with double consonants, since if I was googling how to spell strawberry, I’m probably wondering whether “berry” has one “r” or two, which could lead to their training data being skewed towards two.

If you’re not deep in tech, this is probably the only answer you really need to know. Since LLMs are trained off of data, and asking “how many “r”s are in the word strawberry?” is not a particularly common question, there’s probably not a lot of webpages out there that explicitly state “strawberry has three ‘r’s”, so it’s probably not something LLMs know a lot about.

See how much probability can be involved?

Tokenization
Digging a bit deeper is the issue of tokenization. LLMs work by creating “tokens” to represent words since they can’t actually read the same way you and I do. When it comes to "strawberry," the model might tokenize it in a way that splits the word into chunks like "straw" and "berry." Now, here's where the fun begins (or the confusion, depending on how you look at it). The model might focus on the "berry" part and count the 'r's in that token, potentially ignoring "straw" altogether. Or, it could tokenize "strawberry" into even smaller chunks, like "str," "aw," and "berry," which might lead to even more confusion about how many 'r's are actually in there. And here's another twist—different LLMs might tokenize the word differently depending on how they were trained and what algorithms they use. This means that one model might handle "strawberry" fairly well, while another could completely fumble the task.

So how do you fix tt?
There’s a whole field right now around prompt engineering—basically, the art and science of figuring out how to ask LLMs questions in a way that gets you the best possible answers. When it comes to getting LLMs to count the 'r's in "strawberry" correctly, a few tricks can help.

One approach is to be super specific in your prompt. Instead of just asking, “How many 'r's are in the word strawberry?” you might say something like, “Can you spell out the word 'strawberry' and count each 'r' as you go?” This way, you’re guiding the model to break down the word step by step, reducing the chances of it glossing over those pesky 'r's.

Pressing the models you’re interacting with to verify the information, asking them to prove it with code, or telling the model that they’re wrong to see how they respond are all skills to build up if you’re trying to get the most out of the LLMs you use.

But here’s the thing; even with these strategies, LLMs might still trip up. Not all LLMs can run code in every environment, and so even when code is generated that would objectively give a different answer, the LLM doesn’t necessarily know that.

That’s because the root of the problem lies in how these models are built and trained. They’re not perfect, and they weren’t designed to be perfect at tasks like counting letters. So the real fix isn’t just about tweaking your prompts—it’s about understanding what LLMs are good at (and what they’re not so good at) and knowing when to step in and double-check their work.

Final Thoughts

While it might seem like just a fun quirk, these errors underscore some significant challenges in relying on LLMs for more critical tasks. If an LLM struggles with something as simple as counting letters in a word, what does that say about its reliability in more complex, nuanced situations? These quirks highlight the importance of human oversight and the need for users to be aware of the limitations of AI, especially when accuracy is crucial.

The “strawberry” question is a fun, if not slightly concerning, example of how even the most advanced AI can trip up on simple tasks. As developers, users, and enthusiasts, it’s essential to approach LLMs with both excitement and caution. Understanding their strengths and weaknesses allows us to leverage these tools effectively while avoiding potential pitfalls.

Try it yourself: Experiment with other words or phrases and see how LLMs handle them. Ask them to count letters in "bookkeeper," "committee," or any other word with a tricky spelling. Share your findings in the comments—I’d love to see what results you get!
Think critically: As you use LLMs in your work or daily life, keep in mind that they’re not infallible. Use these tools wisely, and always be prepared to double-check their output.
Join the conversation: What do you think about the broader implications of these errors? Have you encountered similar quirks in other AI models? Share your thoughts and experiences in the comments below.