Forem: Jimmy Guerrero

How to Build a Semantic Search Engine for Emojis

Jimmy Guerrero — Wed, 10 Jan 2024 20:13:56 +0000

Author: Jacob Marks - MLE & Developer Evangelist at Voxel51

Find The Sentiment You’re Looking For 🔍🤔😀🚀

If you’ve ever used Google Docs, or Slack, you may have noticed that when you type a “:” immediately followed by another character, a list of emojis pops up:

Since I discovered this, I’ve been making major use out of the feature. I add emojis into way more of my messages, blog posts, and other written works than I ever imagined I would. I actually got so accustomed to this means of adding emojis that I installed Rocket — a free app that brings the same emoji searchability to all text boxes and text editors on the computer. It’s a game changer.

But as I’ve used these emoji search engines more and more, I’ve noticed a frustrating limitation: all of the searches are based on the exact text in your query and in the name and description of the emoji. Essentially, you need to search for something incredibly precisely for any results to show up.

Here’s an example: if we search for “audio”, not a single result shows up:

This isn’t because the set of emojis is lacking in the audio category. If we were to type in “music” or “speaker”, we would get a long list of results. Instead, it has to do with the fact that the specific string of text “audio” does not show up in the name or textual description associated with any of the emojis.

This relatively minor inconvenience bothered me so much that I decided to build this:

By “this”, I mean an open-source semantic emoji search engine, with both UI-centric and CLI versions. The Python CLI library can be found here, and the UI-centric version can be found here. You can also play around with a hosted (also free) version of the UI emoji search engine online here.

Command line version of the Semantic Emoji Search Engine

Building this was not as simple or straightforward as I initially hoped. It took a lot of experimentation, and a lot of ideas I thought were quite clever fell essentially flat. But in the end, I was able to create an emoji search engine that works fairly well.

Here’s how I built it, what worked, and what didn’t, and the lessons learned along the way.

What is an Emoji
The Data
Emojis versus Images and Text
Bridging the Modality Gap
Using the Emoji Search Engine

What is an Emoji

Before building a semantic search engine for emojis, it’s worth briefly explaining what exactly an emoji is. The term emoji derives from the Japanese kanji 絵 (eh) meaning picture, and 文字 (mōji) meaning letter or character. Essentially, this means that an emoji is etymologically a pictogram, and while it is connected to the English word emotion, it is not an “emotion icon” — that is an emoticon.

Along with alphanumeric characters, African click sounds, mathematical and geometric symbols, dingbats, and computer control sequences, emojis can be represented as Unicode characters, making them computer-readable. Unlike alphanumeric characters and other symbols, however, emojis are maintained by the Unicode Consortium. The consortium solicits proposals for new emojis, and regularly selects which emojis will be added to the standard.

At the time of writing, in November 2023, there are more than 3,600 recognized emojis, symbolizing a wide range of ideas and sentiments. Some emojis are represented by a single unicode character, or code-point. For example, the “grinning face” emoji, 😀, is represented in unicode as U+1F600.

Others are represented with sequences of code-points. These sequences, which combine single code-point emojis with the zero-width-joiner unicode character, are known as ZWJ sequences, and allow for the combining of concepts, in much the same way as Chinese radicals can be combined to create a character that tells a story. As an example, the emoji 👨‍👩‍👧is a zero-width joining of the emojis for man 👨(U+1F468), woman 👩(U+1F469), and girl 👧(U+1F467), connected by the ZWJ code-point U+200D:

👨‍👩‍👧= U+1F468 U+200D U+1F469 U+200D U+1F467

According to the Unicode Consortium, 92% of the world’s online population uses emojis in their communications, and the ten most-used emojis in 2021 were: 😂 ❤️ 🤣 👍 😭 🙏 😘 🥰 😍 😊.

Starting with the Data

Given that emojis are pictographs of sorts, I wanted to utilize both textual and visual information in the search process. My initial hypothesis was that for many emojis, the name — the text string used to invoke the emoji — conveys but a fraction of its meaning. This can be due to many reasons, from the limitations of natural language, to the additional meanings imbued by cultures and visual similarities. In order to truly bring the full essence of the emoji to bear, I needed to make use of visual information.

I found this Kaggle Emojis dataset from 2021, which has data about 1816 emojis, including the emoji representation, the text associated with it, the unicode code (or codes), and a base64 encoded image. Here’s what the first few rows of the dataset look like, loaded as a pandas DataFrame:

There are separate columns with names Apple, Google, Facebook, etc. because the emoji renders differently depending on the computer, website, or application. I decoded the images from base64 and converted them into Pillow images. Here is the first image from the Kaggle dataset (grinning face):

import base64
from io import BytesIO
from PIL import Image
## decode and convert first row Apple image
im_str = df.Apple[0].replace('data:image/png;base64,', '')
im = Image.open(BytesIO(base64.b64decode(im_str)))
im

Upon conversion, however, it became clear that the images were very low resolution. This one, for instance, is only 72×72 pixels. To improve the quality of the images that I was going to pass into downstream models, and to improve the quality of the experience in the eventual UI-based application, I passed all of these low-resolution images into Real-ESRGAN to 10x the resolution.

This is what the resulting images looked like:

Not all of the emojis had images for all of the image columns in the pandas DataFrame, so I used the first viable base64 encoding for each row.

Emojis Versus Images and Text

Before diving any deeper, I want to emphasize one crucial element of emojis that makes them so special, and deserving of their own semantic search engine: in a sense, they are both images and text. From the human perspective, we can represent each emoji as a unicode character, on the same playing field as text characters, and we can represent it as a standalone image, both of which we saw in the previous section. Said another way, if we squint with one eye, we can see a pictogram as a picture, and if we squint with the other eye, we can see the same pictogram as text.

Computers, however, are not known for their ability to squint. While a computer may be able to display a unicode code-point as an emoji, a machine learning model may not have a good way of interpreting the emoji as text or images.

Whenever I’m working on semantic search applications that connect images and text, I start with a family of models known as contrastive language image pre-training (CLIP). These models are trained on image-text pairs to generate similar vector representations or embeddings for images and their captions, and dissimilar vectors when images are paired with other text strings. There are multiple CLIP-style models, including OpenCLIP and MetaCLIP, but for simplicity we’ll focus on the original CLIP model from OpenAI. No model is perfect, and at a fundamental level there is no right way to compare images and text, but CLIP certainly provides a good starting point.

Interpreting Emojis as Text

At a high level, language models process input text by converting it into an ordered sequence of tokens, and then encoding the tokens and positional information in a dense numerical vector. Each language model has its own vocabulary of tokens to decompose a text string into, spanning from individual letters to complete words. Some tokens are easily interpretable by a human, while others are not, and in the case of CLIP, the vocabulary has 49,408 entries.

Let’s see an explicit example. Assuming the CLIP library is installed, we can tokenize a text string “a dog” with:

import clip
text_tokens = clip.tokenize("a dog")
print(text_tokens)

tensor([[49406,   320,  1929, 49407,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]], dtype=torch.int32)

The output tensor contains four nonzero entries: 49406, 320, 1929, and 49407. To make sense of these, we can map these values back to keys in the CLIP vocabulary dictionary. The first number, 49406, corresponds to the key “<|startoftext|>”, and the last number, 49407 corresponds to the key “<|endoftext|>”. These are special tokens denoting the beginning and end of the text string to be encoded. The second number, 320, maps back to “a”, which signifies the character “a” followed by a new word. Finally, 1929 is the value for key “dog”.

If we try to tokenize a string containing an emoji, however, we quickly run into a hitch: emojis don’t get tokenized in the same way as other characters do. Let’s start with the dog emoji 🐶:

clip.tokenize("🐶")
## [49406, 10631, 49407, 0, ...]

Doing a reverse lookup for the key associated with 10,631, we get the token “ðŁĲ¶”. But if we pass this string into the tokenizer, we get a completely different set of token IDs:

clip.tokenize("ðŁĲ¶")
## [49406, 127, 108, 40419, 72, 329, 126, 370, 49407, 0, ...]

An even more curious case concerns the flag emojis. If we take the emoji for the flag of Cameroon, for instance, we get:

clip.tokenize("🇨🇲")
## [49406, 8989, 366, 49407, 0, ...]

The two non-start/end tokens here correspond to “ðŁĩ¨ðŁĩ” and “²”. If we plug the first of these back into the tokenizer, we get another completely different set of token IDs, but the second maps back to itself.

Things get even more precarious when we start comparing embeddings of text strings with embeddings of emojis, parsed as text strings via this tokenizer. After all, we want to find the most relevant emojis given a text query. We can use the cosine distance as a way to measure how similar or different two vectors are — and by proxy the inputs that generated those embedding vectors are. A distance of 0 means that two vectors are completely aligned, and a distance of 1 implies that two vectors are orthogonal. If we wanted to treat emojis as text, we would want the name for an emoji to be relatively close to the tokenized emoji in the embedding space, but this is not always the case!

The utility below will compare an emoji and a list of text prompts:

!pip install fiftyone
from scipy.spatial.distance import cosine
import fiftyone.zoo as foz
model = foz.load_zoo_model("clip-vit-base32-torch")
def compare_emoji_to_texts(emoji, texts):
    emoji_embedding = model.embed_prompt(emoji)
    text_embeddings = model.embed_prompts(texts)
    for text, text_embedding in zip(texts, text_embeddings):
        print(f"Dist b/w {emoji} and {text}: {cosine(emoji_embedding, text_embedding):.4f}")

Here’s an example, where according to CLIP, the encoding for the “birthday” emoji 🎂is closer to “man” than “birthday”, closer to “dog” than “birthday present”, and closer to “car” than “candle”, “date”, or “holiday”:

texts=["birthday", "birthday present", "cake", "candle", "car", "date", "dog", "holiday", "man"]
compare_emoji_to_texts("🎂", texts)

Dist b/w 🎂 and birthday: 0.1205
Dist b/w 🎂 and birthday present: 0.1385
Dist b/w 🎂 and cake: 0.1238
Dist b/w 🎂 and candle: 0.2030
Dist b/w 🎂 and car: 0.1610
Dist b/w 🎂 and date: 0.1921
Dist b/w 🎂 and dog: 0.1344
Dist b/w 🎂 and holiday: 0.1844
Dist b/w 🎂 and man: 0.0849

Sometimes, the emoji and its name (and similar concepts) are close together in the embedding space, but sometimes they are most certainly not.

We can also go the other way and retrieve the emojis whose embeddings most closely match the embedding of an input text prompt. For instance, for the input “love”, we get the following:

Of course, we can do way better than this!

Interpreting Emojis as Images

The high-resolution images of emojis that we generated using Real-ESRGAN provide an alternative pathway to searching through our emojis: treating emojis as images. We can use CLIP’s vision encoder to embed the images into the same vector space, and then query these image embeddings with our input text prompt.

For applications like cross-modal retrieval (or semantically searching images with text), CLIP typically works best when the image embeddings are compared to a text prompt that is the user’s query wrapped in the phrase “A photo of ”. As an example, the image embedding for a photo of a dog will be closer (in terms of the angle between the vectors) to the embedding of “A photo of a dog” than the embedding of the raw query “dog”.

However, when I used this template, the results were underwhelming. For instance, here are the 25 top results for the query “A photo of a dog”:

Because emojis aren’t exactly photos, I decided to dig a little deeper into this and try out a few templating, or wrapping strategies. To cover my bases, I test five formats for text prompts:

<emoji_name>
A photo of a <emoji_name>
An emoji of <emoji_name>
A photo of a <emoji_name> emoji
A <emoji_name> emoji

I generated embeddings for all 1816 emojis with each of these methods, and computed the CLIPScore (cosine similarity multiplied by 100) of these vectors with the corresponding image embedding vectors.

Here were the aggregate results:

Method        Min       Mean     Max
A             16.96     29.04    37.49
B             15.85     29.47    38.43
C             18.94     33.25    44.60
D             19.47     32.59    42.57
E             18.95     31.83    43.35

From these statistics, I thought that the “An emoji of” descriptors were the best fit of the bunch, as they had the highest mean and max. But when I tried to use this, the results were again less than ideal. They seemed to preference faces (e.g. 😄😢🙃👦👧), to the detriment of other emojis like symbols, animals, and flags. When it came to semantic emoji searches, I found that entering the raw text tended to work best. In other words, the CLIP embedding of “dog” worked better than “A photo of a dog”, or “An emoji of a dog”.

There were a few takeaways from this:

Overall image-text “alignment” isn’t necessarily important for semantic search
The images of the emojis encode (to some degree) the fact that they are not photos
The word “emoji” biases CLIP toward faces

Bridging the Modality Gap

By this point, I had come to the conclusion that treating emojis as just images or just text leaves a lot of rich information on the table. To build a robust semantic emoji search engine, I wanted to incorporate both textual and image information, and bridge the gap between these two modalities.

I tried generating descriptions of the emoji images using Adept’s multimodal Fuyu-8b model, but these descriptions proved far too detailed; I tried using other CLIP-style models like MetaCLIP, but saw the same behavior as in CLIP; I even tried using GPT-4V to generate captions for the emoji images, but was cut off by OpenAI because the rate limit for the model is 100 queries per day.

In the end, I was able to pass the emoji unicode characters into the base GPT-4 API with the prompt:

QUERY_TEXT = """
Your task is to write a brief description of the emoji {emoji}, in the format 'A photo of a ...'.  For example, 'A photo of a dog'. Do not include the emoji name or unicode in your description. Do not include the skin tone of the emoji. Do not include the word yellow in your response.  You may include the word 'emoji' in your description, but it is not necessary. Your description should be a single phrase, of no more than 10 words.
"""

After post-processing these captions, I removed the “A photo of” prefix and used these descriptions in the semantic search pipeline.

The emoji search engine works as follows, taking in an input query:

Generate a set of 100 candidate emojis (out of 1816) with an image similarity search that compares the image embeddings to the query embedding. Save this ordering, clip_image_ordering.
Order these candidate emojis by the similarity of the CLIP embeddings of the emoji names to the query’s embedding (clip_name_ordering).
Using a cross-encoder, order the emojis based on the similarity of their name (cross_encoder_name_ordering) and their description generated by GPT-4 (cross_encoder_description_ordering) to the query.
Combine all four orderings using reciprocal rank fusion, and return the top results!

The resulting search engine isn’t perfect, but it does a decent job at incorporating textual and visual information. Because using a cross-encoder is more computationally expensive (and higher latency), this is reserved for the pared-down set of candidates. I use the distilroberta-base checkpoint with the CrossEncoder class from the Sentence Transformers library.

When all of these steps are combined, this is the result:

Again, it isn’t perfect. But it’s not bad!

Using the Emoji Search Engine

There are three ways to use this emoji search engine: hosted (free), locally via UI (open source), or locally via command line (also open source). All three options are quite easy!

Online

Head over to try.fiftyone.ai/datasets/emojis, sign in (it’s free), and click on the emoji button in the menu above the grid of images. That’s it!

Locally via the UI

If you want to perform emoji searches locally with the same visual interface, you can do so with the Emoji Search plugin for FiftyOne.

First, install FiftyOne:

pip install fiftyone

Then download the Emoji Search plugin and install its requirements:

fiftyone plugins download https://github.com/jacobmarks/emoji-search-plugin
fiftyone plugins requirements @jacobmarks/emoji_search --install

Launch the FiftyOne App:

fiftyone app launch

Click on the “browse operations” text, search for “emoji”, and click on the entry “Create Emoji Dataset”. This will download the high-resolution images of the emojis, along with embeddings and all other relevant data. At the top left of the app, click in the “Select dataset” box and select “Emojis”. Now you should see the same UI as in the hosted version.

Locally via the CLI

Finally, you can search via the command line using the Emoji Search Python CLI library. Install the package from GitHub repository with:

pip install git+https://github.com/jacobmarks/emoji_search.git

Then you can start searching using the emoji-search command, followed by the text query (with or without quotation marks).

emoji-search beautiful sunset

+-------+-----------------+---------+
| Emoji |       Name       Unicode  |
+-------+-----------------+---------+
|  🌞   |   sun with face | U+1F31E |
|  🌇   |      sunset     | U+1F307 |
|  🌅   |      sunrise    | U+1F305 |
|  🔆   |   bright button | U+1F506 |
|  🌆   |cityscape at dusk| U+1F306 |
+-------+-----------------+---------+

The first search you perform will download embeddings to your device if necessary. All three versions support copying an emoji to clipboard with pyperclip. In the UI, click on the image for an emoji, and you’ll see a copy button appear in the menu. In the CLI, pass the -c argument to copy the top result to clipboard.

Conclusion

Emojis might seem like a silly subject to obsess over. And in practice, the utility of a semantic emoji search engine over lexical emoji search may be somewhat limited. The real value in this endeavor is in understanding the boundaries and overlaps between two modalities we traditionally think of as distinct: images and text. Emojis sit squarely in this intersection and as such, they allow us to probe the strengths and weaknesses — the capabilities and limitations of today’s multimodal models.

The Semantic Emoji Search Engine I ended up building is far from perfect. Frankly, emojis have subjectivity, connoting different things for different people, that is impossible to precisely bottle up. But going back to the motivating example, when I type in “an audio player”, I get some solid results:

I’ll end with a quote from Nancy Gibbs, a Professor at the Harvard Kennedy School and former managing editor for TIME magazine:

What makes emojis special is the fact that [they have] helped millions express themselves better than even the wide array of words in the Oxford dictionary [could].

Nancy Gibbs

Why 2023 was the most exciting year in computer vision history (so far)

Jimmy Guerrero — Wed, 03 Jan 2024 20:47:37 +0000

Author: Jacob Marks - MLE & Developer Evangelist at Voxel51

The 10 developments that reshaped computer vision

2023 was the year of chatbots and large language models (LLMs). From GPT-4 to Mixtral, to Llamas, Vicunas, Dolphins, and Orcas, it seemed like every day saw a new state-of-the-art model on some benchmark. At the same time, every week brought a new breakthrough in prompting, fine-tuning, quantizing, or serving LLMs. There was so much chatter about chatbots that it was hard to keep track of everything!

The LLM-mania was so intense and the headline-grabbing so severe that it dominated the public discourse on AI. But in reality, machine learning research and applications saw progress across many modalities, from images to audio!

Computer vision had a banner year in 2023. From new foundation models to accurate real-time detection, there’s far too much to fill a single blog post. Nevertheless, I’ve selected ten developments that I believe paint the greatest picture of computer vision’s progress.

YOLO Is Reborn: NextGen Object Detection
SAM: The Foundation Model for Segmentation
DINOv2: SOTA Models from Self-supervised Learning
Gaussian Splatts Give NeRFs a Run for their Money
T2I Models Turn the Corner
LoRA: Flexible and Affordable Fine-tuning
Ego-Exo4D: The Foundation Dataset for Video Perception
T2V Models Arrive
Multimodal LLMs
LLM-aided visual reasoning

YOLO Is Reborn: NextGen Object Detection

YOLO-NAS predictions for an image from the MS COCO dataset, visualized in the FiftyOne App. Image courtesy of the author.

For the greater part of a decade, the You Only Look Once (YOLO) family of models has been incredibly popular choices for near real-time object detection tasks. Prior to 2023, YOLO had already gone through many iterations, with popular variants including YOLOv5 and YOLOv8 (December 2022) from Ultralytics, YOLOv6 from Meituan, and YOLOv7.

In May, Deci AI released YOLO-NAS, a YOLO-style model created with the help of Neural Architecture Search (NAS). The model is faster and significantly more accurate than previous YOLO models, and has strong support for quantization. The smallest, quantized variant achieves a mean average precision (mAP) of 47.03 at just 2.36ms latency! YOLO-NAS also forms the basis for Deci’s state-of-the-art (SOTA) pose estimation model, YOLO-NAS-Pose.

📚 Additional Resources:

Tutorial on Fine-tuning YOLOv8
Blog post on SOTA object detection with YOLO-NAS
Gold-YOLO (TL;DR: version)
DeciCoder and DeciLM-7B from Deci AI

Segment Anything: The Foundation Model for Segmentation

Auto-segmentation with Meta AI’s Segment Anything Model. Image originally from SAM GitHub repo.

The Segment Anything Model (SAM) from Meta AI Research is arguably the first foundation model for segmentation tasks in computer vision. In the past, if you wanted to generate high-quality pixel-level classification predictions for your data, you would need to train a segmentation model from scratch.

SAM has completely changed the game. Now, you can segment everything in an image, or instance segment objects in an image via prompting the model with a bounding box or positive/negative keypoints. The GitHub repository has 40k stars and counting!

SAM and the 1.1 billion mask dataset co-developed with the model have already spawned tons of incredible related projects, including:

Smaller, derivative segmentation models: FastSAM, MobileSAM, NanoSAM, EdgeSAM
Composite applications: Recognize Anything, Inpaint Anything, Track-Anything, Grounded-Segment-Anything
3D segmentation applications: Segment Anything in 3D with NeRFs, Segment Anything 3D
And medical segmentation models: MedSAM, SAM-Med3D

💡 For specialized applications and deployed solutions, you will likely still want to train or fine-tune your own model!

📚 Additional Resources:

DINOv2: SOTA Models from Self-supervised Learning

Example of depth estimation from a single image with DINOv2. Image originally from DINOv2 demo site.

A standard technique in natural language processing applications is self-supervised learning, wherein the model is trained on signals generated from the input data itself, rather than pre-specified annotations. In LLM pretraining, for instance, the model can be trained to predict which token comes next in a text sequence. Self-supervised approaches like this can be helpful in reducing reliance on human-annotated data.

In the context of computer vision, approaches like contrastive learning (see CLIP) rely heavily on human-provided captions and metadata, restricting the model’s understanding to the quality of captions and the diversity of annotated images. DINOv2 overcomes these limitations by applying a self-supervised approach to vision tasks.

When pre-trained on a diverse set of 142M images and combined with basic task-specific heads, Meta’s DINOv2 backbone achieves state-of-the-art performance across a variety of vision tasks, from depth estimation to semantic segmentation. More to the point, the DINOv2 approach provides a template for anyone to train a high-quality model with few labeled images!

📚 Additional Resources:

Gaussian Splatting

Comparison of 3D Gaussian Splatting technique (labeled “Ours”) with other competitive techniques for view synthesis, demonstrating advantages in training time, latency, and accuracy. Image originally from 3D Gaussian Splatting GitHub repo.

For the first half of 2023, Neural Radiance Fields (NeRFs) dominated discussions on novel view synthesis. As we documented in May, in advance of CVPR, the term “radiance” appeared 80% more often in CVPR 2023 paper titles than in CVPR 2022 paper titles.

The second half of 2023 has seen the emergence of an alternative method called Gaussian Splatting, which represents scenes with, well, 3-dimensional (or higher) Gaussians. The rasterization technique achieves SOTA visual quality and real-time (>100 fps) rendering. Gaussian splatting also has additional benefits compared to NeRFs, including much faster training.

📚 Additional Resources:

3D Gaussian Splatting (original project)
Dynamic 3D Gaussians
4D Gaussian Splatting

Text-to-Image Models Turn the Corner

Comparison of image generations across different Midjourney versions for the same prompt: “dungeons and dragons, female knight, of the rolling plains, full body, dark azure, victorian genre paintings, serene face, realistic depiction of light, golden light –seed 5”. Image courtesy of aituts.

In 2022, DALL-E 2 was named one of TIME Magazine’s 100 inventions of the year, Midjourney launched their v1, and Stable Diffusion was released, paving the way for text-to-image (T2I) models. The promise was there, but results were mixed — hands with six fingers, undesired spatial compositions, and unsatisfying aesthetic characteristics were all common. What’s more, image generation inference times could be substantial, slowing experimentation.

This year, T2I models have taken massive leaps forward. Midjourney creations have gone from somewhat recognizable to breathtakingly lifelike; DALL-E 3 refines your text prompt for you; Stable Diffusion XL can generate realistic faces and legible text; and Imagen 2 allows you to add invisible watermarks into your AI generated images.

I want to call attention to two tranches of innovation worth keeping an eye on as we move into 2024:

The push toward real-time T2I generation: latent consistency models (LCM) and SDXL Turbo’s Adversarial Diffusion Distillation (ADD)
Efforts to improve alignment of T2I generated images with human preferences: ImageReward and Pick-a-Pic.

In spite of all of these improvements, T2I models are still far from perfect. A team of researchers recently created the first holistic evaluation benchmark for T2I models (HEIM), and found that no single model excels across the board!

ControlNet

The dominant generative modeling technique underlying T2I models in 2023 is the diffusion model. In the context of image generation, a diffusion model is essentially tasked with iteratively turning a noisy initial image into a coherent, lower-noise image. This technique is incredibly powerful, but in a vacuum, exerting control over the final generated image via just a text prompt can be imprecise and unwieldy.

ControlNet enables a far greater degree of control over composition and style of the output of T2I diffusion models. With ControlNet, you can expressly control the contours of objects in the generated image from Canny edge maps or scribbles, the pose of a generated person, and so much more. If you’ve seen some of the stunning generative artwork that also functions as a working QR code, ControlNet is the technique behind it!

📚 Additional Resources:

LoRA

Image generated in the style of emojis with an SDXL LoRA emoji fine-tune, with the text prompt: “A TOK emoji of a tiger face, white background”. Image originally from Replicate.

Originally developed for fine-tuning LLMs, LoRA is a technique for parameter-efficient fine-tuning which makes adaptation of existing models easy, affordable, and accessible. The method works by inserting small, low-rank matrices in the base model, and learning a good configuration for these weights over the fine-tuning data while keeping the weights of the original model fixed. In effect, LoRA adapts the original model for a new purpose — all while adding just megabytes to GB-sized models!

The predominant application of LoRA in computer vision has been for fine-tuning diffusion models to match certain styles, from pixel art to voxels. There’s even a gallery of LoRA fine-tunes on Hugging Face! LoRA models have also been used to bring the reduced inference steps of latent consistency models to stable diffusion (LCM-LoRA).

But LoRA models are applicable in other vision contexts as well, from semantic segmentation to DreamBooth fine-tuning. One particularly interesting application of LoRA is DreamSim, where the technique is used to learn a SOTA human visual similarity metric.

📚 Additional Resources:

Ego-Exo4D: The Foundation Dataset for Video Perception

Video footage from egocentric and exocentric vantage points from the Ego-Exo4D dataset. Video originally from Ego-Exo4D project page.

Over the past two years, researchers across Meta AI and 15 universities have worked together to collect the largest and highest quality dataset to-date for video learning and multimodal perception. This Ego-Exo4D dataset contains 1,400 hours of footage of 800+ participants performing skill-based human activities, from cooking to dancing, and has the potential to impact how both humans and robots learn and acquire skills.

For each scene, the dataset contains synchronized video footage from the first-person (egocentric) perspective, captured with Meta’s Aria glasses, and third-person (exocentric) perspective. This video data is augmented with first-person narrations, third-person play-by-plays, and annotations for tasks like 3D body and hand pose estimation. In conjunction with the dataset, Meta provides a benchmark suite, and plans to host a benchmark challenge in 2024.

📚 Additional Resources:

T2V

Video generated by Emu Video for prompt “A hamster wearing virtual reality headsets is a dj in a disco“. Video originally from Emu Video Project Page.

If generating images from text is hard, generating high-quality videos from text verges on impossible. Or at least that’s what many people thought heading into 2023. Over the past twelve months, however, the question has gone from an “if” to a “when”.

AI Creativity Tools powerhouse Runway has been leading the charge, releasing both Gen-1 and Gen-2 of its text-to-video (T2V) model, as well tools for frame interpolation and generating motion from masked regions. But Runway is far from the only player in the T2V game: in November, Pika Labs announced their “idea-to-video” platform and a $55M funding round, and Meta announced their SOTA model Emu Video, which splits T2V tasks into (i) text-conditioned image generation, and (ii) video generation conditioned on image and text prompt.

It is also worth mentioning a few open-source T2V models introduced in 2023:

While the quality of generated videos lags behind their commercial counterparts, the models form a solid foundation for open-source T2V efforts to come!

📚 Additional Resources:

Multimodal LLMs

Illustration of LLaVA-1.5’s multimodal capabilities. Image originally from LLaVA-1.5 paper.

2023 was the year of LLMs after all, so we were bound to see LLMs go multimodal. LLMs are so powerful, the argument went, but they can only natively process text. If we want to let LLMs loose in the real world as agents, they will need other senses with which to perceive the world.

Multimodal LLMs (MLLMs) bridge this modality gap by giving an LLM the ability to accept tokens of more than one modality as input. In most cases, a pre-trained LLM is connected to a vision module with an adapter, whose weights are tuned through a multimodal task like image-text matching or contrastive learning.

The MLLMs which made the most noise were OpenAI’s GPT-4 Vision and Google DeepMind’s Gemini. Additional noteworthy (and open-source!) multimodal LLMs include LLaVA, CogVLM, InstructBLIP, Fuyu-8B, and IDEFICS.

📚Additional Resources:

LLM-Aided Visual Reasoning

Illustration of ViperGPT combining the general reasoning capabilities of LLMs with expert vision models to answer visual questions. Video originally from ViperGPT project page.

An alternative approach to bridging the modality gap is to use LLMs as reasoning engines, and allow them to invoke vision models. This approach disentangles the visual understanding and logical reasoning generally present in multimodal tasks, reducing the burden placed on vision models.

LLMs can act as reasoning engines, determining what specific vision tasks need to be performed, delegating the execution of these tasks to expert models, and drawing conclusions based on the outputs from these models. Such an approach is inherently modular (vision models can be added or replaced) and more interpretable (failures can be traced back to specific reasoning steps).

In 2023, we saw a variety of viral projects fitting this mold, including Chameleon, HuggingGPT, VisProg, and ViperGPT. The latter of these, ViperGPT, set a new SOTA for zero-shot visual question answering and grounded question answering tasks!

📚 Additional Resources:

Conclusion

This post only scratches the surface of the vast ocean of advances we saw in 2023. If you enjoyed this and want to dive deeper into projects from specific conferences, check out these collections of 10 papers you won’t want to miss from CVPR 2023, ICCV 2023, or NeurIPS 2023. For last year’s recap of the top computer vision developments, check out Why 2022 was the most exciting year in computer vision history (so far).

Here are 10 other incredibly cool developments that I did not have space to cover but still deserve recognition include (in alphabetical order):

Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation (Note: code not yet available)
DEVA: Tracking Anything with Decoupled Video Segmentation
Emu Edit: Precise Image Editing via Recognition and Generation Tasks
GroundingDINO: State-of-the-art zero-shot object detector
ImageBind: One Embedding Space To Bind Them All
LEDITS++: Limitless Image Editing using Text-to-Image Models
MagicBrush: A Manually Annotated Dataset for Instruction-Guided Image Editing
MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model
SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation
Stable Video Diffusion (Note: add SVD videos directly to your dataset with this workflow)

Let’s see what 2024 has in store!