<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Jimmy Guerrero</title>
    <description>The latest articles on Forem by Jimmy Guerrero (@jimmyguerrero).</description>
    <link>https://forem.com/jimmyguerrero</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F676003%2F2af86bca-41aa-4146-8f22-e2a315d6ea93.jpeg</url>
      <title>Forem: Jimmy Guerrero</title>
      <link>https://forem.com/jimmyguerrero</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/jimmyguerrero"/>
    <language>en</language>
    <item>
      <title>How to Build a Semantic Search Engine for Emojis</title>
      <dc:creator>Jimmy Guerrero</dc:creator>
      <pubDate>Wed, 10 Jan 2024 20:13:56 +0000</pubDate>
      <link>https://forem.com/voxel51/how-to-build-a-semantic-search-engine-for-emojis-31ff</link>
      <guid>https://forem.com/voxel51/how-to-build-a-semantic-search-engine-for-emojis-31ff</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;strong&gt;Author:&lt;/strong&gt; &lt;a href="https://www.linkedin.com/in/jacob-marks/" rel="noopener noreferrer"&gt;Jacob Marks&lt;/a&gt; - MLE &amp;amp; Developer Evangelist at Voxel51&lt;/em&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Find The Sentiment You’re Looking For 🔍🤔😀🚀
&lt;/h1&gt;

&lt;p&gt;If you’ve ever used Google Docs, or Slack, you may have noticed that when you type a “:” immediately followed by another character, a list of emojis pops up:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2fpsgrlj0p7ydjm5qqw3.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2fpsgrlj0p7ydjm5qqw3.gif" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since I discovered this, I’ve been making major use out of the feature. I add emojis into way more of my messages, blog posts, and other written works than I ever imagined I would. I actually got so accustomed to this means of adding emojis that I installed &lt;a href="https://matthewpalmer.net/rocket/" rel="noopener noreferrer"&gt;Rocket&lt;/a&gt; — a free app that brings the same emoji searchability to all text boxes and text editors on the computer. It’s a game changer. &lt;/p&gt;

&lt;p&gt;But as I’ve used these emoji search engines more and more, I’ve noticed a frustrating limitation: all of the searches are based on the exact text in your query and in the name and description of the emoji. Essentially, you need to search for something incredibly precisely for any results to show up. &lt;/p&gt;

&lt;p&gt;Here’s an example: if we search for “audio”, not a single result shows up:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvqjyg2c7n2qicza7d5vt.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvqjyg2c7n2qicza7d5vt.gif" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This isn’t because the set of emojis is lacking in the audio category. If we were to type in “music” or “speaker”, we would get a long list of results. Instead, it has to do with the fact that the specific string of text “audio” does not show up in the name or textual description associated with any of the emojis.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fozfn1l04jx23lxhv6pn2.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fozfn1l04jx23lxhv6pn2.gif" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This relatively minor inconvenience bothered me so much that I decided to build this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyoc5b2rmt230wp5h35rk.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyoc5b2rmt230wp5h35rk.gif" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;By “this”, I mean an open-source semantic emoji search engine, with both UI-centric and CLI versions. The Python CLI library can be found &lt;a href="https://github.com/jacobmarks/emoji_search" rel="noopener noreferrer"&gt;here&lt;/a&gt;, and the UI-centric version can be found &lt;a href="https://github.com/jacobmarks/emoji-search-plugin" rel="noopener noreferrer"&gt;here&lt;/a&gt;. You can also play around with a hosted (also free) version of the UI emoji search engine online &lt;a href="http://try.fiftyone.ai/datasets/emojis" rel="noopener noreferrer"&gt;here&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fosug0bm5p5fvsdz7y43j.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fosug0bm5p5fvsdz7y43j.gif" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Command line version of the Semantic Emoji Search Engine&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Building this was not as simple or straightforward as I initially hoped. It took a lot of experimentation, and a lot of ideas I thought were quite clever fell essentially flat. But in the end, I was able to create an emoji search engine that works fairly well.&lt;/p&gt;

&lt;p&gt;Here’s how I built it, what worked, and what didn’t, and the lessons learned along the way.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What is an Emoji&lt;/li&gt;
&lt;li&gt;The Data&lt;/li&gt;
&lt;li&gt;Emojis versus Images and Text&lt;/li&gt;
&lt;li&gt;Bridging the Modality Gap&lt;/li&gt;
&lt;li&gt;Using the Emoji Search Engine&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="1"&gt;What is an Emoji&lt;/h2&gt;

&lt;p&gt;Before building a semantic search engine for emojis, it’s worth briefly explaining what exactly an emoji is. The term emoji derives from the Japanese kanji 絵 (eh) meaning picture, and 文字 (mōji) meaning letter or character. Essentially, this means that an emoji is etymologically a pictogram, and while it is connected to the English word emotion, it is not an “emotion icon” — that is an &lt;a href="https://en.wikipedia.org/wiki/Emoticon" rel="noopener noreferrer"&gt;emoticon&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Along with &lt;a href="https://en.wikipedia.org/wiki/List_of_Unicode_characters#Latin_script" rel="noopener noreferrer"&gt;alphanumeric characters&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Latin_Extended-B#African_letters_for_clicks" rel="noopener noreferrer"&gt;African click sounds&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/List_of_Unicode_characters#Mathematical_symbols" rel="noopener noreferrer"&gt;mathematical&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/List_of_Unicode_characters#Geometric_Shapes" rel="noopener noreferrer"&gt;geometric symbols&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/List_of_Unicode_characters#Dingbats" rel="noopener noreferrer"&gt;dingbats&lt;/a&gt;, and &lt;a href="https://en.wikipedia.org/wiki/List_of_Unicode_characters#Control_codes" rel="noopener noreferrer"&gt;computer control sequences&lt;/a&gt;, emojis can be represented as Unicode characters, making them computer-readable. Unlike alphanumeric characters and other symbols, however, emojis are maintained by the &lt;a href="https://home.unicode.org/" rel="noopener noreferrer"&gt;Unicode Consortium&lt;/a&gt;. The consortium solicits proposals for new emojis, and regularly selects which emojis will be added to the standard.&lt;/p&gt;

&lt;p&gt;At the time of writing, in November 2023, there are more than &lt;a href="https://home.unicode.org/emoji/about-emoji/" rel="noopener noreferrer"&gt;3,600 recognized emojis&lt;/a&gt;, symbolizing a wide range of ideas and sentiments. Some emojis are represented by a single unicode character, or code-point. For example, the “grinning face” emoji, 😀, is represented in unicode as U+1F600. &lt;/p&gt;

&lt;p&gt;Others are represented with sequences of code-points. These sequences, which combine single code-point emojis with the zero-width-joiner unicode character, are known as ZWJ sequences, and allow for the combining of concepts, in much the same way as Chinese radicals can be combined to create a character that tells a story. As an example, the emoji 👨‍👩‍👧is a zero-width joining of the emojis for man 👨(U+1F468), woman 👩(​​U+1F469), and girl 👧(U+1F467), connected by the ZWJ code-point U+200D:&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;&lt;code&gt;👨‍👩‍👧= U+1F468 U+200D U+1F469 U+200D U+1F467&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;p&gt;According to the Unicode Consortium, 92% of the world’s online population uses emojis in their communications, and the ten most-used emojis in 2021 were: 😂 ❤️ 🤣 👍 😭 🙏 😘 🥰 😍 😊.&lt;/p&gt;

&lt;h2 id="2"&gt;Starting with the Data&lt;/h2&gt;

&lt;p&gt;Given that emojis are pictographs of sorts, I wanted to utilize both textual and visual information in the search process. My initial hypothesis was that for many emojis, the name — the text string used to invoke the emoji — conveys but a fraction of its meaning. This can be due to many reasons, from the limitations of natural language, to the additional meanings imbued by cultures and visual similarities. In order to truly bring the full essence of the emoji to bear, I needed to make use of visual information. &lt;/p&gt;

&lt;p&gt;I found this &lt;a href="https://www.kaggle.com/datasets/subinium/emojiimage-dataset" rel="noopener noreferrer"&gt;Kaggle Emojis dataset&lt;/a&gt; from 2021, which has data about 1816 emojis, including the emoji representation, the text associated with it, the unicode code (or codes), and a &lt;a href="https://en.wikipedia.org/wiki/Base64" rel="noopener noreferrer"&gt;base64&lt;/a&gt; encoded image. Here’s what the first few rows of the dataset look like, loaded as a pandas DataFrame:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5sco6qs2ijn28c0z55ys.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F5sco6qs2ijn28c0z55ys.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are separate columns with names &lt;code&gt;Apple&lt;/code&gt;, &lt;code&gt;Google&lt;/code&gt;, &lt;code&gt;Facebook&lt;/code&gt;, etc. because the emoji renders differently depending on the computer, website, or application. I decoded the images from base64 and converted them into &lt;a href="https://pypi.org/project/Pillow/" rel="noopener noreferrer"&gt;Pillow&lt;/a&gt; images. Here is the first image from the Kaggle dataset (grinning face):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import base64
from io import BytesIO
from PIL import Image
## decode and convert first row Apple image
im_str = df.Apple[0].replace('data:image/png;base64,', '')
im = Image.open(BytesIO(base64.b64decode(im_str)))
im
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzqamhoyq2weynfafbfmc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzqamhoyq2weynfafbfmc.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Upon conversion, however, it became clear that the images were very low resolution. This one, for instance, is only 72×72 pixels. To improve the quality of the images that I was going to pass into downstream models, and to improve the quality of the experience in the eventual UI-based application, I passed all of these low-resolution images into &lt;a href="https://replicate.com/nightmareai/real-esrgan" rel="noopener noreferrer"&gt;Real-ESRGAN&lt;/a&gt; to 10x the resolution. &lt;/p&gt;

&lt;p&gt;This is what the resulting images looked like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6sdw3gbkxtvars1koxmf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6sdw3gbkxtvars1koxmf.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Not all of the emojis had images for all of the image columns in the pandas DataFrame, so I used the first viable base64 encoding for each row.&lt;/p&gt;

&lt;h2 id="3"&gt;Emojis Versus Images and Text&lt;/h2&gt;

&lt;p&gt;Before diving any deeper, I want to emphasize one crucial element of emojis that makes them so special, and deserving of their own semantic search engine: in a sense, they are both images and text. From the human perspective, we can represent each emoji as a unicode character, on the same playing field as text characters, and we can represent it as a standalone image, both of which we saw in the previous section. Said another way, if we squint with one eye, we can see a pictogram as a picture, and if we squint with the other eye, we can see the same pictogram as text.&lt;/p&gt;

&lt;p&gt;Computers, however, are not known for their ability to squint. While a computer may be able to display a unicode code-point as an emoji, a machine learning model may not have a good way of interpreting the emoji as text or images.&lt;/p&gt;

&lt;p&gt;Whenever I’m working on semantic search applications that connect images and text, I start with a family of models known as &lt;a href="https://github.com/openai/CLIP" rel="noopener noreferrer"&gt;contrastive language image pre-training&lt;/a&gt; (CLIP). These models are trained on image-text pairs to generate similar vector representations or &lt;a href="https://towardsdatascience.com/neural-network-embeddings-explained-4d028e6f0526" rel="noopener noreferrer"&gt;embeddings&lt;/a&gt; for images and their captions, and dissimilar vectors when images are paired with other text strings. There are multiple CLIP-style models, including &lt;a href="https://github.com/mlfoundations/open_clip" rel="noopener noreferrer"&gt;OpenCLIP&lt;/a&gt; and &lt;a href="https://github.com/facebookresearch/metaclip" rel="noopener noreferrer"&gt;MetaCLIP&lt;/a&gt;, but for simplicity we’ll focus on the original CLIP model from OpenAI. No model is perfect, and at a fundamental level there is no right way to compare images and text, but CLIP certainly provides a good starting point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Interpreting Emojis as Text
&lt;/h2&gt;

&lt;p&gt;At a high level, language models process input text by converting it into an ordered sequence of tokens, and then encoding the tokens and positional information in a dense numerical vector. Each language model has its own vocabulary of tokens to decompose a text string into, spanning from individual letters to complete words. Some tokens are easily interpretable by a human, while others are not, and in the case of CLIP, the vocabulary has 49,408 entries.&lt;/p&gt;

&lt;p&gt;Let’s see an explicit example. Assuming the CLIP library is installed, we can tokenize a text string “a dog” with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import clip
text_tokens = clip.tokenize("a dog")
print(text_tokens)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tensor([[49406,   320,  1929, 49407,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     0,     0,     0]], dtype=torch.int32)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The output tensor contains four nonzero entries: 49406,  320, 1929, and 49407. To make sense of these, we can map these values back to keys in the &lt;a href="https://huggingface.co/openai/clip-vit-base-patch32/resolve/main/vocab.json" rel="noopener noreferrer"&gt;CLIP vocabulary dictionary&lt;/a&gt;. The first number, 49406, corresponds to the key “&amp;lt;|startoftext|&amp;gt;”, and the last number, 49407 corresponds to the key “&amp;lt;|endoftext|&amp;gt;”. These are special tokens denoting the beginning and end of the text string to be encoded. The second number, 320, maps back to “a”, which signifies the character “a” followed by a new word. Finally, 1929 is the value for key “dog”.&lt;/p&gt;

&lt;p&gt;If we try to tokenize a string containing an emoji, however, we quickly run into a hitch: emojis don’t get tokenized in the same way as other characters do. Let’s start with the dog emoji 🐶:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clip.tokenize("🐶")
## [49406, 10631, 49407, 0, ...]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Doing a reverse lookup for the key associated with 10,631, we get the token “ðŁĲ¶”. But if we pass this string into the tokenizer, we get a completely different set of token IDs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clip.tokenize("ðŁĲ¶")
## [49406, 127, 108, 40419, 72, 329, 126, 370, 49407, 0, ...]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An even more curious case concerns the flag emojis. If we take the emoji for the flag of Cameroon, for instance, we get:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;clip.tokenize("🇨🇲")
## [49406, 8989, 366, 49407, 0, ...]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The two non-start/end tokens here correspond to “ðŁĩ¨ðŁĩ” and “²”. If we plug the first of these back into the tokenizer, we get another completely different set of token IDs, but the second maps back to itself.&lt;/p&gt;

&lt;p&gt;Things get even more precarious when we start comparing embeddings of text strings with embeddings of emojis, parsed as text strings via this tokenizer. After all, we want to find the most relevant emojis given a text query. We can use the &lt;a href="https://medium.com/@milana.shxanukova15/cosine-distance-and-cosine-similarity-a5da0e4d9ded" rel="noopener noreferrer"&gt;cosine distance&lt;/a&gt; as a way to measure how similar or different two vectors are — and by proxy the inputs that generated those embedding vectors are. A distance of 0 means that two vectors are completely aligned, and a distance of 1 implies that two vectors are orthogonal. If we wanted to treat emojis as text, we would want the name for an emoji to be relatively close to the tokenized emoji in the embedding space, but this is not always the case!&lt;/p&gt;

&lt;p&gt;The utility below will compare an emoji and a list of text prompts:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;!pip install fiftyone
from scipy.spatial.distance import cosine
import fiftyone.zoo as foz
model = foz.load_zoo_model("clip-vit-base32-torch")
def compare_emoji_to_texts(emoji, texts):
    emoji_embedding = model.embed_prompt(emoji)
    text_embeddings = model.embed_prompts(texts)
    for text, text_embedding in zip(texts, text_embeddings):
        print(f"Dist b/w {emoji} and {text}: {cosine(emoji_embedding, text_embedding):.4f}")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here’s an example, where according to CLIP, the encoding for the “birthday” emoji 🎂is closer to “man” than “birthday”, closer to “dog” than “birthday present”, and closer to “car” than “candle”, “date”, or “holiday”:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;texts=​​["birthday", "birthday present", "cake", "candle", "car", "date", "dog", "holiday", "man"]
compare_emoji_to_texts("🎂", texts)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Dist b/w 🎂 and birthday: 0.1205
Dist b/w 🎂 and birthday present: 0.1385
Dist b/w 🎂 and cake: 0.1238
Dist b/w 🎂 and candle: 0.2030
Dist b/w 🎂 and car: 0.1610
Dist b/w 🎂 and date: 0.1921
Dist b/w 🎂 and dog: 0.1344
Dist b/w 🎂 and holiday: 0.1844
Dist b/w 🎂 and man: 0.0849
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sometimes, the emoji and its name (and similar concepts) are close together in the embedding space, but sometimes they are most certainly not.&lt;/p&gt;

&lt;p&gt;We can also go the other way and retrieve the emojis whose embeddings most closely match the embedding of an input text prompt. For instance, for the input “love”, we get the following:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwo28kwmsslc7fs1o7r9h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwo28kwmsslc7fs1o7r9h.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Of course, we can do way better than this!&lt;/p&gt;

&lt;h2&gt;
  
  
  Interpreting Emojis as Images
&lt;/h2&gt;

&lt;p&gt;The high-resolution images of emojis that we generated using Real-ESRGAN provide an alternative pathway to searching through our emojis: treating emojis as images. We can use CLIP’s vision encoder to embed the images into the same vector space, and then query these image embeddings with our input text prompt.&lt;/p&gt;

&lt;p&gt;For applications like cross-modal retrieval (or semantically searching images with text), CLIP typically works best when the image embeddings are compared to a text prompt that is the user’s query wrapped in the phrase “A photo of ”. As an example, the image embedding for a photo of a dog will be closer (in terms of the angle between the vectors) to the embedding of “A photo of a dog” than the embedding of the raw query “dog”.&lt;/p&gt;

&lt;p&gt;However, when I used this template, the results were underwhelming. For instance, here are the 25 top results for the query “A photo of a dog”:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fla7vk21w3kagjqx8bv91.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fla7vk21w3kagjqx8bv91.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Because emojis aren’t exactly photos, I decided to dig a little deeper into this and try out a few templating, or wrapping strategies. To cover my bases, I test five formats for text prompts:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;code&gt;&amp;lt;emoji_name&amp;gt;&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;A photo of a &lt;code&gt;&amp;lt;emoji_name&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;An emoji of &lt;code&gt;&amp;lt;emoji_name&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;A photo of a &lt;code&gt;&amp;lt;emoji_name&amp;gt;&lt;/code&gt; emoji&lt;/li&gt;
&lt;li&gt;A &lt;code&gt;&amp;lt;emoji_name&amp;gt;&lt;/code&gt; emoji&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I generated embeddings for all 1816 emojis with each of these methods, and computed the CLIPScore (cosine similarity multiplied by 100) of these vectors with the corresponding image embedding vectors.&lt;/p&gt;

&lt;p&gt;Here were the aggregate results:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Method        Min       Mean     Max
A             16.96     29.04    37.49
B             15.85     29.47    38.43
C             18.94     33.25    44.60
D             19.47     32.59    42.57
E             18.95     31.83    43.35
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From these statistics, I thought that the “An emoji of” descriptors were the best fit of the bunch, as they had the highest mean and max. But when I tried to use this, the results were again less than ideal. They seemed to preference faces (e.g. 😄😢🙃👦👧), to the detriment of other emojis like symbols, animals, and flags. When it came to semantic emoji searches, I found that entering the raw text tended to work best. In other words, the CLIP embedding of “dog” worked better than “A photo of a dog”, or “An emoji of a dog”.&lt;/p&gt;

&lt;p&gt;There were a few takeaways from this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Overall image-text “alignment” isn’t necessarily important for semantic search&lt;/li&gt;
&lt;li&gt;The images of the emojis encode (to some degree) the fact that they are not photos&lt;/li&gt;
&lt;li&gt;The word “emoji” biases CLIP toward faces&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="4"&gt;Bridging the Modality Gap&lt;/h2&gt;

&lt;p&gt;By this point, I had come to the conclusion that treating emojis as just images or just text leaves a lot of rich information on the table. To build a robust semantic emoji search engine, I wanted to incorporate both textual and image information, and bridge the gap between these two modalities.&lt;/p&gt;

&lt;p&gt;I tried generating descriptions of the emoji images using Adept’s multimodal &lt;a href="https://www.adept.ai/blog/fuyu-8b" rel="noopener noreferrer"&gt;Fuyu-8b&lt;/a&gt; model, but these descriptions proved far too detailed; I tried using other CLIP-style models like &lt;a href="https://github.com/facebookresearch/metaclip" rel="noopener noreferrer"&gt;MetaCLIP&lt;/a&gt;, but saw the same behavior as in CLIP; I even tried using &lt;a href="https://openai.com/research/gpt-4v-system-card" rel="noopener noreferrer"&gt;GPT-4V&lt;/a&gt; to generate captions for the emoji images, but was cut off by OpenAI because the rate limit for the model is 100 queries per day.&lt;/p&gt;

&lt;p&gt;In the end, I was able to pass the emoji unicode characters into the base GPT-4 API with the prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;QUERY_TEXT = """
Your task is to write a brief description of the emoji {emoji}, in the format 'A photo of a ...'.  For example, 'A photo of a dog'. Do not include the emoji name or unicode in your description. Do not include the skin tone of the emoji. Do not include the word yellow in your response.  You may include the word 'emoji' in your description, but it is not necessary. Your description should be a single phrase, of no more than 10 words.
""" 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After post-processing these captions, I removed the “A photo of” prefix and used these descriptions in the semantic search pipeline.&lt;/p&gt;

&lt;p&gt;The emoji search engine works as follows, taking in an input query:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Generate a set of 100 candidate emojis (out of 1816) with an image similarity search that compares the image embeddings to the query embedding. Save this ordering, &lt;em&gt;clip_image_ordering&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Order these candidate emojis by the similarity of the CLIP embeddings of the emoji names to the query’s embedding (&lt;em&gt;clip_name_ordering&lt;/em&gt;).&lt;/li&gt;
&lt;li&gt;Using a &lt;a href="https://ai.plainenglish.io/decoding-sentence-representations-a-comprehensive-guide-to-cross-encoders-and-bi-encoders-67c4ac16e35f" rel="noopener noreferrer"&gt;cross-encoder&lt;/a&gt;, order the emojis based on the similarity of their name (&lt;em&gt;cross_encoder_name_ordering&lt;/em&gt;) and their description generated by GPT-4 (&lt;em&gt;cross_encoder_description_ordering&lt;/em&gt;) to the query.&lt;/li&gt;
&lt;li&gt;Combine all four orderings using &lt;a href="https://www.elastic.co/guide/en/elasticsearch/reference/current/rrf.html" rel="noopener noreferrer"&gt;reciprocal rank fusion&lt;/a&gt;, and return the top results!&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The resulting search engine isn’t perfect, but it does a decent job at incorporating textual and visual information. Because using a cross-encoder is more computationally expensive (and higher latency), this is reserved for the pared-down set of candidates. I use the &lt;code&gt;distilroberta-base&lt;/code&gt; checkpoint with the &lt;code&gt;CrossEncoder&lt;/code&gt; class from the &lt;a href="https://www.sbert.net/index.html" rel="noopener noreferrer"&gt;Sentence Transformers&lt;/a&gt; library.&lt;/p&gt;

&lt;p&gt;When all of these steps are combined, this is the result:&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/5vHi0c5Z89w"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Again, it isn’t perfect. But it’s not bad!&lt;/p&gt;

&lt;h2 id="5"&gt;Using the Emoji Search Engine&lt;/h2&gt;

&lt;p&gt;There are three ways to use this emoji search engine: hosted (free), locally via UI (open source), or locally via command line (also open source). All three options are quite easy!&lt;/p&gt;

&lt;h2&gt;
  
  
  Online
&lt;/h2&gt;

&lt;p&gt;Head over to &lt;a href="https://try.fiftyone.ai/datasets/emojis/samples" rel="noopener noreferrer"&gt;try.fiftyone.ai/datasets/emojis&lt;/a&gt;, sign in (it’s free), and click on the emoji button in the menu above the grid of images. That’s it!&lt;/p&gt;

&lt;h2&gt;
  
  
  Locally via the UI
&lt;/h2&gt;

&lt;p&gt;If you want to perform emoji searches locally with the same visual interface, you can do so with the &lt;a href="https://github.com/jacobmarks/emoji-search-plugin" rel="noopener noreferrer"&gt;Emoji Search plugin&lt;/a&gt; for &lt;a href="https://github.com/voxel51/fiftyone" rel="noopener noreferrer"&gt;FiftyOne&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;First, install FiftyOne:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install fiftyone
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then download the Emoji Search plugin and install its requirements:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fiftyone plugins download https://github.com/jacobmarks/emoji-search-plugin
fiftyone plugins requirements @jacobmarks/emoji_search --install
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Launch the FiftyOne App:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;fiftyone app launch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Click on the “browse operations” text, search for “emoji”, and click on the entry “Create Emoji Dataset”. This will download the high-resolution images of the emojis, along with embeddings and all other relevant data. At the top left of the app, click in the “Select dataset” box and select “Emojis”. Now you should see the same UI as in the hosted version.&lt;/p&gt;

&lt;h2&gt;
  
  
  Locally via the CLI
&lt;/h2&gt;

&lt;p&gt;Finally, you can search via the command line using the &lt;a href="https://github.com/jacobmarks/emoji_search" rel="noopener noreferrer"&gt;Emoji Search&lt;/a&gt; Python CLI library. Install the package from GitHub repository with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install git+https://github.com/jacobmarks/emoji_search.git
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then you can start searching using the &lt;code&gt;emoji-search&lt;/code&gt; command, followed by the text query (with or without quotation marks).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;emoji-search beautiful sunset
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;+-------+-----------------+---------+
| Emoji |       Name       Unicode  |
+-------+-----------------+---------+
|  🌞   |   sun with face | U+1F31E |
|  🌇   |      sunset     | U+1F307 |
|  🌅   |      sunrise    | U+1F305 |
|  🔆   |   bright button | U+1F506 |
|  🌆   |cityscape at dusk| U+1F306 |
+-------+-----------------+---------+

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first search you perform will download embeddings to your device if necessary. All three versions support copying an emoji to clipboard with &lt;a href="https://pypi.org/project/pyperclip/" rel="noopener noreferrer"&gt;pyperclip&lt;/a&gt;. In the UI, click on the image for an emoji, and you’ll see a copy button appear in the menu. In the CLI, pass the &lt;code&gt;-c&lt;/code&gt; argument to copy the top result to clipboard.&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Emojis might seem like a silly subject to obsess over. And in practice, the utility of a semantic emoji search engine over lexical emoji search may be somewhat limited. The real value in this endeavor is in understanding the boundaries and overlaps between two modalities we traditionally think of as distinct: images and text. Emojis sit squarely in this intersection and as such, they allow us to probe the strengths and weaknesses — the capabilities and limitations of today’s multimodal models.&lt;/p&gt;

&lt;p&gt;The Semantic Emoji Search Engine I ended up building is far from perfect. Frankly, emojis have subjectivity, connoting different things for different people, that is impossible to precisely bottle up. But going back to the motivating example, when I type in “an audio player”, I get some solid results:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgm7yykfsrcd01o3z785v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgm7yykfsrcd01o3z785v.png" alt="Image description"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I’ll end with a quote from &lt;a href="https://en.wikipedia.org/wiki/Nancy_Gibbs" rel="noopener noreferrer"&gt;Nancy Gibbs&lt;/a&gt;, a Professor at the Harvard Kennedy School and former managing editor for TIME magazine:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What makes emojis special is the fact that [they have] helped millions express themselves better than even the wide array of words in the Oxford dictionary [could].&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;Nancy Gibbs&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

</description>
      <category>computervision</category>
      <category>ai</category>
      <category>machinelearning</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Why 2023 was the most exciting year in computer vision history (so far)</title>
      <dc:creator>Jimmy Guerrero</dc:creator>
      <pubDate>Wed, 03 Jan 2024 20:47:37 +0000</pubDate>
      <link>https://forem.com/voxel51/why-2023-was-the-most-exciting-year-in-computer-vision-history-so-far-5ae0</link>
      <guid>https://forem.com/voxel51/why-2023-was-the-most-exciting-year-in-computer-vision-history-so-far-5ae0</guid>
      <description>&lt;p&gt;&lt;em&gt;&lt;strong&gt;Author:&lt;/strong&gt; &lt;a href="https://www.linkedin.com/in/jacob-marks/"&gt;Jacob Marks&lt;/a&gt; - MLE &amp;amp; Developer Evangelist at Voxel51&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The 10 developments that reshaped computer vision
&lt;/h2&gt;

&lt;p&gt;2023 was the year of chatbots and large language models (LLMs). From GPT-4 to Mixtral, to Llamas, Vicunas, Dolphins, and Orcas, it seemed like every day saw a new state-of-the-art model on some benchmark. At the same time, every week brought a new breakthrough in prompting, fine-tuning, quantizing, or serving LLMs. There was so much chatter about chatbots that it was hard to keep track of everything!&lt;/p&gt;

&lt;p&gt;The LLM-mania was so intense and the headline-grabbing so severe that it dominated the public discourse on AI. But in reality, machine learning research and applications saw progress across many modalities, from images to audio!&lt;/p&gt;

&lt;p&gt;Computer vision had a banner year in 2023. From new foundation models to accurate real-time detection, there’s far too much to fill a single blog post. Nevertheless, I’ve selected ten developments that I believe paint the greatest picture of computer vision’s progress. &lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;YOLO Is Reborn: NextGen Object Detection&lt;/li&gt;
&lt;li&gt;SAM: The Foundation Model for Segmentation&lt;/li&gt;
&lt;li&gt;DINOv2: SOTA Models from Self-supervised Learning&lt;/li&gt;
&lt;li&gt;Gaussian Splatts Give NeRFs a Run for their Money&lt;/li&gt;
&lt;li&gt;T2I Models Turn the Corner&lt;/li&gt;
&lt;li&gt;LoRA: Flexible and Affordable Fine-tuning&lt;/li&gt;
&lt;li&gt;Ego-Exo4D: The Foundation Dataset for Video Perception&lt;/li&gt;
&lt;li&gt;T2V Models Arrive&lt;/li&gt;
&lt;li&gt;Multimodal LLMs&lt;/li&gt;
&lt;li&gt;LLM-aided visual reasoning&lt;/li&gt;
&lt;/ol&gt;

&lt;h2 id="1"&gt;YOLO Is Reborn: NextGen Object Detection&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--yJvXHinx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3bu0tf5czyn8wf5e9ybp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--yJvXHinx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/3bu0tf5czyn8wf5e9ybp.png" alt="Image description" width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;YOLO-NAS predictions for an image from the MS COCO dataset, visualized in the FiftyOne App. Image courtesy of the author.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For the greater part of a decade, the &lt;a href="https://arxiv.org/abs/1506.02640"&gt;You Only Look Once&lt;/a&gt; (YOLO) family of models has been incredibly popular choices for near real-time object detection tasks. Prior to 2023, YOLO had already gone through many iterations, with popular variants including &lt;a href="https://github.com/ultralytics/yolov5"&gt;YOLOv5&lt;/a&gt; and &lt;a href="https://github.com/ultralytics/ultralytics"&gt;YOLOv8&lt;/a&gt; (December 2022) from Ultralytics, &lt;a href="https://github.com/meituan/YOLOv6"&gt;YOLOv6&lt;/a&gt; from Meituan, and &lt;a href="https://github.com/WongKinYiu/yolov7"&gt;YOLOv7&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In May, Deci AI released YOLO-NAS, a YOLO-style model created with the help of &lt;a href="https://en.wikipedia.org/wiki/Neural_architecture_search"&gt;Neural Architecture Search&lt;/a&gt; (NAS). The model is faster and significantly more accurate than previous YOLO models, and has strong support for quantization. The smallest, quantized variant achieves a mean average precision (mAP) of 47.03 at just &lt;a href="https://docs.ultralytics.com/models/yolo-nas/"&gt;2.36ms latency!&lt;/a&gt; YOLO-NAS also forms the basis for Deci’s state-of-the-art (SOTA) pose estimation model, &lt;a href="https://github.com/Deci-AI/super-gradients/blob/master/YOLONAS-POSE.md"&gt;YOLO-NAS-Pose&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;📚 Additional Resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.voxel51.com/tutorials/yolov8.html"&gt;Tutorial on Fine-tuning YOLOv8&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/voxel51/state-of-the-art-object-detection-with-yolo-nas-fiftyone-f1530826b28e"&gt;Blog post on SOTA object detection with YOLO-NAS&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/pdf/2309.11331.pdf"&gt;Gold-YOLO&lt;/a&gt; (&lt;a href="https://www.linkedin.com/posts/jacob-marks_yolo-computervision-objectdetection-activity-7112145772095123456-CU9o?utm_source=share&amp;amp;utm_medium=member_desktop"&gt;TL;DR: version&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://deci.ai/blog/decicoder-efficient-and-accurate-code-generation-llm/"&gt;DeciCoder&lt;/a&gt; and &lt;a href="https://huggingface.co/Deci/DeciLM-7B"&gt;DeciLM-7B&lt;/a&gt; from Deci AI&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="2"&gt;Segment Anything: The Foundation Model for Segmentation&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--PxEN529a--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fv8xqay8lzyi43re715r.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--PxEN529a--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/fv8xqay8lzyi43re715r.jpeg" alt="Image description" width="800" height="534"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Auto-segmentation with Meta AI’s Segment Anything Model. Image originally from SAM GitHub repo.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/facebookresearch/segment-anything"&gt;Segment Anything Model&lt;/a&gt; (SAM) from Meta AI Research is arguably the first foundation model for segmentation tasks in computer vision. In the past, if you wanted to generate high-quality pixel-level classification predictions for your data, you would need to train a segmentation model from scratch. &lt;/p&gt;

&lt;p&gt;SAM has completely changed the game. Now, you can segment everything in an image, or instance segment objects in an image via prompting the model with a bounding box or positive/negative keypoints. The GitHub repository has 40k stars and counting!&lt;/p&gt;

&lt;p&gt;SAM and the &lt;a href="https://ai.meta.com/datasets/segment-anything/"&gt;1.1 billion mask dataset&lt;/a&gt; co-developed with the model have already spawned tons of incredible related projects, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Smaller, derivative segmentation models: &lt;a href="https://github.com/CASIA-IVA-Lab/FastSAM"&gt;FastSAM&lt;/a&gt;, &lt;a href="https://github.com/ChaoningZhang/MobileSAM"&gt;MobileSAM&lt;/a&gt;, &lt;a href="https://github.com/NVIDIA-AI-IOT/nanosam"&gt;NanoSAM&lt;/a&gt;, &lt;a href="https://github.com/chongzhou96/EdgeSAM"&gt;EdgeSAM&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Composite applications: &lt;a href="https://github.com/xinyu1205/recognize-anything"&gt;Recognize Anything&lt;/a&gt;, &lt;a href="https://github.com/geekyutao/Inpaint-Anything"&gt;Inpaint Anything&lt;/a&gt;, &lt;a href="https://github.com/gaomingqi/Track-Anything"&gt;Track-Anything&lt;/a&gt;, &lt;a href="https://github.com/IDEA-Research/Grounded-Segment-Anything"&gt;Grounded-Segment-Anything&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;3D segmentation applications: &lt;a href="https://github.com/Jumpat/SegmentAnythingin3D"&gt;Segment Anything in 3D with NeRFs&lt;/a&gt;, &lt;a href="https://github.com/Pointcept/SegmentAnything3D"&gt;Segment Anything 3D&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;And medical segmentation models: &lt;a href="https://github.com/bowang-lab/MedSAM"&gt;MedSAM&lt;/a&gt;, &lt;a href="https://github.com/uni-medical/SAM-Med3D"&gt;SAM-Med3D&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;💡 For specialized applications and deployed solutions, you will likely still want to train or fine-tune your own model!&lt;/p&gt;

&lt;p&gt;📚 Additional Resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://towardsdatascience.com/see-what-you-sam-4eea9ad9a5de"&gt;See What You Segment: SAM Blog Post&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/voxel51/facet-a-benchmark-dataset-for-fairness-in-computer-vision-2260c82e1662"&gt;FACET benchmark blog post&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/UX-Decoder/Segment-Everything-Everywhere-All-At-Once"&gt;SEEM: Segment Everything Everywhere All at Once&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/SysCV/sam-hq"&gt;SAM-HQ: Segment Anything in High Quality&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="3"&gt;DINOv2: SOTA Models from Self-supervised Learning&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--LhanmIQR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/s6d198hfadevm8a8dk2n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--LhanmIQR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/s6d198hfadevm8a8dk2n.png" alt="Image description" width="600" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Example of depth estimation from a single image with DINOv2. Image originally from &lt;a href="https://dinov2.metademolab.com/"&gt;DINOv2 demo site&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;A standard technique in natural language processing applications is &lt;a href="https://en.wikipedia.org/wiki/Self-supervised_learning"&gt;self-supervised learning&lt;/a&gt;, wherein the model is trained on signals generated from the input data itself, rather than pre-specified annotations. In LLM pretraining, for instance, the model can be trained to predict which token comes next in a text sequence. Self-supervised approaches like this can be helpful in reducing reliance on human-annotated data.&lt;/p&gt;

&lt;p&gt;In the context of computer vision, approaches like contrastive learning (see CLIP) rely heavily on human-provided captions and metadata, restricting the model’s understanding to the quality of captions and the diversity of annotated images. &lt;a href="https://ai.meta.com/blog/dino-v2-computer-vision-self-supervised-learning/"&gt;DINOv2&lt;/a&gt; overcomes these limitations by applying a self-supervised approach to vision tasks.&lt;/p&gt;

&lt;p&gt;When pre-trained on a diverse set of 142M images and combined with basic task-specific heads, Meta’s DINOv2 backbone achieves state-of-the-art performance across a variety of vision tasks, from depth estimation to semantic segmentation. More to the point, the DINOv2 approach provides a template for anyone to train a high-quality model with few labeled images!&lt;/p&gt;

&lt;p&gt;📚 Additional Resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dinov2.metademolab.com/demos"&gt;DINOv2 demo website
&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/facebook/dinov2-large"&gt;DINOv2 checkpoint on Hugging Face&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="4"&gt;Gaussian Splatting&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--RPmgbAWa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/yveyafape3le14xwn0d7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--RPmgbAWa--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/yveyafape3le14xwn0d7.png" alt="Image description" width="800" height="183"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Comparison of 3D Gaussian Splatting technique (labeled “Ours”) with other competitive techniques for view synthesis, demonstrating advantages in training time, latency, and accuracy. Image originally from 3D Gaussian Splatting GitHub repo.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For the first half of 2023, &lt;a href="https://www.matthewtancik.com/nerf"&gt;Neural Radiance Fields&lt;/a&gt; (NeRFs) dominated discussions on &lt;a href="https://paperswithcode.com/task/novel-view-synthesis"&gt;novel view synthesis&lt;/a&gt;. As we &lt;a href="https://voxel51.com/blog/cvpr-2023-and-the-state-of-computer-vision/#whats-trending"&gt;documented in May&lt;/a&gt;, in advance of CVPR, the term “radiance” appeared 80% more often in CVPR 2023 paper titles than in CVPR 2022 paper titles.&lt;/p&gt;

&lt;p&gt;The second half of 2023 has seen the emergence of an alternative method called &lt;a href="https://huggingface.co/blog/gaussian-splatting"&gt;Gaussian Splatting&lt;/a&gt;, which represents scenes with, well, 3-dimensional (or higher) Gaussians. The rasterization technique achieves SOTA visual quality and real-time (&amp;gt;100 fps) rendering. Gaussian splatting also has additional benefits compared to NeRFs, including much faster training.&lt;/p&gt;

&lt;p&gt;📚 Additional Resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/"&gt;3D Gaussian Splatting&lt;/a&gt; (original project)&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dynamic3dgaussians.github.io/"&gt;Dynamic 3D Gaussians&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://guanjunwu.github.io/4dgs/index.html"&gt;4D Gaussian Splatting&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="5"&gt;Text-to-Image Models Turn the Corner&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--5aG-Bl61--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/65fkaygwzyxr77d3coyy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--5aG-Bl61--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/65fkaygwzyxr77d3coyy.png" alt="Image description" width="800" height="957"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Comparison of image generations across different Midjourney versions for the same prompt: “dungeons and dragons, female knight, of the rolling plains, full body, dark azure, victorian genre paintings, serene face, realistic depiction of light, golden light –seed 5”. Image courtesy of &lt;a href="https://aituts.com/midjourney-versions/"&gt;aituts&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In 2022, DALL-E 2 &lt;a href="https://time.com/collection/best-inventions-2022/6225486/dall-e-2/"&gt;was named one of TIME Magazine’s 100 inventions of the year&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Midjourney#:~:text=The%20Midjourney%20image%20generation%20platform%20first%20entered%20open%20beta%20on%20July%2012%2C%202022"&gt;Midjourney launched their v1&lt;/a&gt;, and &lt;a href="https://en.wikipedia.org/wiki/Stable_Diffusion#:~:text=to%2Dimage%20model-,released%20in%202022,-based%20on%20diffusion"&gt;Stable Diffusion was released&lt;/a&gt;, paving the way for &lt;a href="https://en.wikipedia.org/wiki/Text-to-image_model"&gt;text-to-image&lt;/a&gt; (T2I) models. The promise was there, but results were mixed — hands with six fingers, undesired spatial compositions, and unsatisfying aesthetic characteristics were all common. What’s more, image generation inference times could be substantial, slowing experimentation.&lt;/p&gt;

&lt;p&gt;This year, T2I models have taken massive leaps forward. &lt;a href="https://aituts.com/midjourney-versions/"&gt;Midjourney creations&lt;/a&gt; have gone from somewhat recognizable to breathtakingly lifelike; &lt;a href="https://openai.com/dall-e-3"&gt;DALL-E 3&lt;/a&gt; refines your text prompt for you; &lt;a href="https://stablediffusionxl.com/"&gt;Stable Diffusion XL&lt;/a&gt; can generate realistic faces and legible text; and &lt;a href="https://deepmind.google/technologies/imagen-2/"&gt;Imagen 2&lt;/a&gt; allows you to add invisible watermarks into your AI generated images.&lt;/p&gt;

&lt;p&gt;I want to call attention to two tranches of innovation worth keeping an eye on as we move into 2024:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The push toward real-time T2I generation: &lt;a href="https://github.com/luosiallen/latent-consistency-model"&gt;latent consistency models&lt;/a&gt; (LCM) and &lt;a href="https://stability.ai/news/stability-ai-sdxl-turbo"&gt;SDXL Turbo’s Adversarial Diffusion Distillation &lt;/a&gt;(ADD)&lt;/li&gt;
&lt;li&gt;Efforts to improve alignment of T2I generated images with human preferences: &lt;a href="https://github.com/THUDM/ImageReward"&gt;ImageReward&lt;/a&gt; and &lt;a href="https://github.com/yuvalkirstain/PickScore"&gt;Pick-a-Pic&lt;/a&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In spite of all of these improvements, T2I models are still far from perfect. A team of researchers recently created the first holistic evaluation benchmark for T2I models (&lt;a href="https://medium.com/voxel51/neurips-2023-survival-guide-2f957d5b07c9#b237"&gt;HEIM&lt;/a&gt;), and found that no single model excels across the board!&lt;/p&gt;

&lt;h3&gt;ControlNet&lt;/h3&gt;

&lt;p&gt;The dominant generative modeling technique underlying T2I models in 2023 is the &lt;a href="https://www.assemblyai.com/blog/diffusion-models-for-machine-learning-introduction/"&gt;diffusion model&lt;/a&gt;. In the context of image generation, a diffusion model is essentially tasked with iteratively turning a noisy initial image into a coherent, lower-noise image. This technique is incredibly powerful, but in a vacuum, exerting control over the final generated image via just a text prompt can be imprecise and unwieldy. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/lllyasviel/ControlNet"&gt;ControlNet&lt;/a&gt; enables a far greater degree of control over composition and style of the output of T2I diffusion models. With ControlNet, you can expressly control the contours of objects in the generated image from &lt;a href="https://github.com/lllyasviel/ControlNet?tab=readme-ov-file#controlnet-with-canny-edge"&gt;Canny edge maps&lt;/a&gt; or &lt;a href="https://github.com/lllyasviel/ControlNet?tab=readme-ov-file#controlnet-with-user-scribbles"&gt;scribbles&lt;/a&gt;, the &lt;a href="https://github.com/lllyasviel/ControlNet?tab=readme-ov-file#controlnet-with-human-pose"&gt;pose&lt;/a&gt; of a generated person, and so much more. If you’ve seen some of the &lt;a href="https://arstechnica.com/information-technology/2023/06/redditor-creates-working-anime-qr-codes-using-stable-diffusion/"&gt;stunning generative artwork&lt;/a&gt; that also functions as a working QR code, &lt;a href="https://huggingface.co/DionTimmer/controlnet_qrcode"&gt;ControlNet is the technique behind it&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;📚 Additional Resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/jacobmarks/text-to-image"&gt;Plugin to Add T2I images directly to your dataset&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://try.fiftyone.ai/datasets/llava-instruct/samples"&gt;Browse the (cleaned) ImageRewardDB dataset online&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/voxel51/conquering-controlnet-voxel51-c665dc8dd358"&gt;Conquering ControlNet: Harness the Power of Diffusion Models with Higher-Quality Data&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/gligen/GLIGEN"&gt;GLIGEN: Open-Set Grounded Text-to-Image Generation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/damo-vilab/composer"&gt;Composer: Creative and Controllable Image Synthesis with Composable Conditions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="6"&gt;LoRA&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BIZXGR6s--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gxb1kg3jl8tbdttnxuk9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BIZXGR6s--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/gxb1kg3jl8tbdttnxuk9.png" alt="Image description" width="800" height="800"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Image generated in the style of emojis with an &lt;a href="https://huggingface.co/SvenN/sdxl-emoji"&gt;SDXL LoRA emoji fine-tune&lt;/a&gt;, with the text prompt: “A TOK emoji of a tiger face, white background”. Image originally from &lt;a href="https://replicate.com/fofr/sdxl-emoji/examples"&gt;Replicate&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Originally developed for fine-tuning LLMs, &lt;a href="https://arxiv.org/abs/2106.09685"&gt;LoRA&lt;/a&gt; is a technique for &lt;a href="https://huggingface.co/docs/peft/index"&gt;parameter-efficient fine-tuning&lt;/a&gt; which makes adaptation of existing models easy, affordable, and accessible. The method works by inserting small, low-rank matrices in the base model, and learning a good configuration for these weights over the fine-tuning data while keeping the weights of the original model fixed. In effect, LoRA adapts the original model for a new purpose — all while adding just megabytes to GB-sized models!&lt;/p&gt;

&lt;p&gt;The predominant application of LoRA in computer vision has been for fine-tuning diffusion models to match certain styles, from &lt;a href="https://huggingface.co/nerijs/pixel-art-xl"&gt;pixel art&lt;/a&gt; to &lt;a href="https://huggingface.co/Fictiverse/Voxel_XL_Lora"&gt;voxels&lt;/a&gt;. There’s even a gallery of LoRA fine-tunes on &lt;a href="https://huggingface.co/spaces/huggingface-projects/diffusers-gallery"&gt;Hugging Face&lt;/a&gt;! LoRA models have also been used to bring the reduced inference steps of latent consistency models to stable diffusion (&lt;a href="https://huggingface.co/blog/lcm_lora"&gt;LCM-LoRA&lt;/a&gt;).&lt;/p&gt;

&lt;p&gt;But LoRA models are applicable in other vision contexts as well, from &lt;a href="https://huggingface.co/docs/peft/task_guides/semantic_segmentation_lora"&gt;semantic segmentation&lt;/a&gt; to &lt;a href="https://huggingface.co/docs/peft/task_guides/dreambooth_lora"&gt;DreamBooth fine-tuning&lt;/a&gt;. One particularly interesting application of LoRA is &lt;a href="https://dreamsim-nights.github.io/"&gt;DreamSim&lt;/a&gt;, where the technique is used to learn a SOTA human visual similarity metric.&lt;/p&gt;

&lt;p&gt;📚 Additional Resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/docs/diffusers/training/lora"&gt;LoRA in Hugging Face Diffusers library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/rohitgandikota/sliders"&gt;Concept Sliders: LoRA Adaptors for Precise Control in Diffusion Models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://medium.com/voxel51/teaching-androids-to-dream-of-sheep-18d72f44f2b"&gt;DreamSim blog: learning perceptual similarity&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2311.03285v2"&gt;S-LoRA: Serving Thousands of Concurrent LoRA Adapters&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="7"&gt;Ego-Exo4D: The Foundation Dataset for Video Perception&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XnyxVkOf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wa48jkpja66wu0bhudwv.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XnyxVkOf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/wa48jkpja66wu0bhudwv.gif" alt="Image description" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Video footage from egocentric and exocentric vantage points from the &lt;a href="https://ai.meta.com/blog/ego-exo4d-video-learning-perception/"&gt;Ego-Exo4D&lt;/a&gt; dataset. Video originally from &lt;a href="https://arxiv.org/abs/2311.03285v2"&gt;Ego-Exo4D project page&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Over the past two years, researchers across Meta AI and 15 universities have worked together to collect the largest and highest quality dataset to-date for video learning and multimodal perception. This Ego-Exo4D dataset contains 1,400 hours of footage of 800+ participants performing skill-based human activities, from cooking to dancing, and has the potential to impact how both humans and robots learn and acquire skills. &lt;/p&gt;

&lt;p&gt;For each scene, the dataset contains synchronized video footage from the first-person (egocentric) perspective, captured with Meta’s &lt;a href="https://www.projectaria.com/glasses/"&gt;Aria glasses&lt;/a&gt;, and third-person (exocentric) perspective. This video data is augmented with first-person narrations, third-person play-by-plays, and annotations for tasks like 3D body and hand pose estimation. In conjunction with the dataset, Meta provides a benchmark suite, and plans to host a benchmark challenge in 2024.&lt;/p&gt;

&lt;p&gt;📚 Additional Resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.projectaria.com/"&gt;Project Aria&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ego-exo4d-data.org/paper/ego-exo4d.pdf"&gt;Ego-Exo4D Paper&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://try.fiftyone.ai/datasets/egoobjects-val/samples"&gt;Browse Meta’s related EgoObjects dataset online&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="8"&gt;T2V&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--hcFwCREU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q3tuciln2yfkbr4y5xwn.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--hcFwCREU--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/q3tuciln2yfkbr4y5xwn.gif" alt="Image description" width="512" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Video generated by Emu Video for prompt “A hamster wearing virtual reality headsets is a dj in a disco“. Video originally from &lt;a href="https://emu-video.metademolab.com/"&gt;Emu Video Project Page&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If generating images from text is hard, generating high-quality videos from text verges on impossible. Or at least that’s what many people thought heading into 2023. Over the past twelve months, however, the question has gone from an “if” to a “when”. &lt;/p&gt;

&lt;p&gt;AI Creativity Tools powerhouse &lt;a href="https://runwayml.com/"&gt;Runway&lt;/a&gt; has been leading the charge, releasing both &lt;a href="https://research.runwayml.com/gen1"&gt;Gen-1&lt;/a&gt; and &lt;a href="https://research.runwayml.com/gen2"&gt;Gen-2&lt;/a&gt; of its text-to-video (T2V) model, as well tools for &lt;a href="https://runwayml.com/ai-magic-tools/frame-interpolation/"&gt;frame interpolation&lt;/a&gt; and &lt;a href="https://help.runwayml.com/hc/en-us/articles/21423169912595-Gen-2-Motion-Brush"&gt;generating motion from masked regions&lt;/a&gt;. But Runway is far from the only player in the T2V game: in November, &lt;a href="https://pika.art/"&gt;Pika Labs&lt;/a&gt; announced their “idea-to-video” platform and &lt;a href="https://techcrunch.com/2023/11/28/pika-labs-which-is-building-ai-tools-to-generate-and-edit-videos-raises-55m/"&gt;a $55M funding round&lt;/a&gt;, and Meta announced their SOTA model &lt;a href="https://emu-video.metademolab.com/"&gt;Emu Video&lt;/a&gt;, which splits T2V tasks into (i) text-conditioned image generation, and (ii) video generation conditioned on image and text prompt.&lt;/p&gt;

&lt;p&gt;It is also worth mentioning a few open-source T2V models introduced in 2023: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://modelscope.cn/models/damo/text-to-video-synthesis/summary"&gt;ModelScope’s Text-to-Video&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/Picsart-AI-Research/Text2Video-Zero?tab=readme-ov-file"&gt;Text2Video-Zero&lt;/a&gt; from PicsArt&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/AILab-CVC/VideoCrafter"&gt;VideoCrafter1: Open Diffusion Models for High-Quality Video Generation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://vvictoryuki.github.io/animatezero.github.io/"&gt;AnimateZero: Video Diffusion Models are Zero-Shot Image Animators
&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;While the quality of generated videos lags behind their commercial counterparts, the models form a solid foundation for open-source T2V efforts to come!&lt;/p&gt;

&lt;p&gt;📚 Additional Resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/docs/diffusers/api/pipelines/text_to_video"&gt;Text-to-video model in Hugging Face Diffusers library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/docs/diffusers/api/pipelines/text_to_video_zero"&gt;Text2Video-Zero model in Hugging Face Diffusers library&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.synthesia.io/main"&gt;Synthesia: T2V platform for avatars&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="9"&gt;Multimodal LLMs&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--yPDwXzqx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/eonb583o68cqws9pxrpd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--yPDwXzqx--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/eonb583o68cqws9pxrpd.png" alt="Image description" width="755" height="1024"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Illustration of LLaVA-1.5’s multimodal capabilities. Image originally from &lt;a href="https://arxiv.org/abs/2310.03744"&gt;LLaVA-1.5 paper.&lt;/a&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;2023 was the year of LLMs after all, so we were bound to see LLMs go multimodal. LLMs are so powerful, the argument went, but they can only natively process text. If we want to let LLMs loose in the real world as agents, they will need other senses with which to perceive the world. &lt;/p&gt;

&lt;p&gt;Multimodal LLMs (MLLMs) bridge this modality gap by giving an LLM the ability to accept tokens of more than one modality as input. In most cases, a pre-trained LLM is connected to a vision module with an adapter, whose weights are tuned through a multimodal task like image-text matching or contrastive learning.&lt;/p&gt;

&lt;p&gt;The MLLMs which made the most noise were OpenAI’s &lt;a href="https://openai.com/research/gpt-4v-system-card"&gt;GPT-4 Vision&lt;/a&gt; and Google DeepMind’s &lt;a href="https://blog.google/technology/ai/google-gemini-ai/"&gt;Gemini&lt;/a&gt;. Additional noteworthy (and open-source!) multimodal LLMs include &lt;a href="https://llava-vl.github.io/"&gt;LLaVA&lt;/a&gt;, &lt;a href="https://github.com/THUDM/CogVLM"&gt;CogVLM&lt;/a&gt;, &lt;a href="https://medium.com/voxel51/neurips-2023-survival-guide-2f957d5b07c9#c8e2"&gt;InstructBLIP&lt;/a&gt;, &lt;a href="https://www.adept.ai/blog/fuyu-8b"&gt;Fuyu-8B&lt;/a&gt;, and &lt;a href="https://medium.com/voxel51/neurips-2023-survival-guide-2f957d5b07c9#e062"&gt;IDEFICS&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;📚Additional Resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://try.fiftyone.ai/datasets/llava-instruct/samples"&gt;Browse the LLaVA-Instruct dataset online&lt;/a&gt; and &lt;a href="https://voxel51.com/blog/understanding-llava-large-language-and-vision-assistant/"&gt;dive deep into LLaVA&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/jacobmarks/gpt4-vision-plugin"&gt;Chat with your images using GPT-4 Vision&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/jacobmarks/vqa-plugin"&gt;Run VQA on your images using LLaVA-13B, Fuyu-8B, and BLIPv2 &lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2 id="10"&gt;LLM-Aided Visual Reasoning&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SLsmCPnT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/cxgo4i8envxo8dozct3c.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SLsmCPnT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/cxgo4i8envxo8dozct3c.gif" alt="Image description" width="600" height="233"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Illustration of ViperGPT combining the general reasoning capabilities of LLMs with expert vision models to answer visual questions. Video originally from &lt;a href="https://viper.cs.columbia.edu/"&gt;ViperGPT project page&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;An alternative approach to bridging the modality gap is to use LLMs as reasoning engines, and allow them to invoke vision models. This approach disentangles the visual understanding and logical reasoning generally present in multimodal tasks, reducing the burden placed on vision models. &lt;/p&gt;

&lt;p&gt;LLMs can act as reasoning engines, determining what specific vision tasks need to be performed, delegating the execution of these tasks to expert models, and drawing conclusions based on the outputs from these models. Such an approach is inherently modular (vision models can be added or replaced) and more interpretable (failures can be traced back to specific reasoning steps). &lt;/p&gt;

&lt;p&gt;In 2023, we saw a variety of viral projects fitting this mold, including &lt;a href="https://chameleon-llm.github.io/"&gt;Chameleon&lt;/a&gt;, &lt;a href="https://huggingface.co/spaces/microsoft/HuggingGPT"&gt;HuggingGPT&lt;/a&gt;, &lt;a href="https://prior.allenai.org/projects/visprog"&gt;VisProg&lt;/a&gt;, and &lt;a href="https://viper.cs.columbia.edu/"&gt;ViperGPT&lt;/a&gt;. The latter of these, ViperGPT, set a new SOTA for zero-shot visual question answering and grounded question answering tasks!&lt;/p&gt;

&lt;p&gt;📚 Additional Resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/voxel51/voxelgpt"&gt;VoxelGPT: An AI Assistant for Computer Vision&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/spaces/microsoft/visual_chatgpt"&gt;Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/blog/vision_language_pretraining"&gt;A Dive into Vision-Language Models&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=5joBkbTy2yQ"&gt;YouTube Video: How LLMs are Transforming Computer Vision&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This post only scratches the surface of the vast ocean of advances we saw in 2023. If you enjoyed this and want to dive deeper into projects from specific conferences, check out these collections of 10 papers you won’t want to miss from &lt;a href="https://medium.com/voxel51/cvpr-2023-survival-guide-504e965e1f8b"&gt;CVPR 2023&lt;/a&gt;, &lt;a href="https://medium.com/voxel51/iccv-2023-survival-guide-10-papers-you-wont-want-to-miss-97b8a8e14e52"&gt;ICCV 2023&lt;/a&gt;, or &lt;a href="https://medium.com/voxel51/neurips-2023-survival-guide-2f957d5b07c9"&gt;NeurIPS 2023&lt;/a&gt;. For last year’s recap of the top computer vision developments, check out &lt;a href="https://medium.com/voxel51/why-2022-was-the-most-exciting-year-in-computer-vision-history-so-far-7a4ab8693b27"&gt;Why 2022 was the most exciting year in computer vision history (so far)&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here are 10 other incredibly cool developments that I did not have space to cover but still deserve recognition include (in alphabetical order):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://github.com/HumanAIGC/AnimateAnyone"&gt;Animate Anyone&lt;/a&gt;: Consistent and Controllable Image-to-Video Synthesis for Character Animation (Note: code not yet available)&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://medium.com/voxel51/iccv-2023-survival-guide-10-papers-you-wont-want-to-miss-97b8a8e14e52#ffb3"&gt;DEVA&lt;/a&gt;: Tracking Anything with Decoupled Video Segmentation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://emu-edit.metademolab.com/"&gt;Emu Edit&lt;/a&gt;: Precise Image Editing via Recognition and Generation Tasks&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/IDEA-Research/GroundingDINO"&gt;GroundingDINO&lt;/a&gt;: State-of-the-art zero-shot object detector&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/facebookresearch/ImageBind"&gt;ImageBind&lt;/a&gt;: One Embedding Space To Bind Them All&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://leditsplusplus-project.static.hf.space/index.html"&gt;LEDITS++&lt;/a&gt;: Limitless Image Editing using Text-to-Image Models&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://osu-nlp-group.github.io/MagicBrush/"&gt;MagicBrush&lt;/a&gt;: A Manually Annotated Dataset for Instruction-Guided Image Editing&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/magic-research/magic-animate"&gt;MagicAnimate&lt;/a&gt;: Temporally Consistent Human Image Animation using Diffusion Model&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://sadtalker.github.io/"&gt;SadTalker&lt;/a&gt;: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://stability.ai/news/stable-video-diffusion-open-ai-video-model"&gt;Stable Video Diffusion&lt;/a&gt; (Note: add SVD videos directly to your dataset with &lt;a href="https://voxel51.com/blog/computer-vision-generating-videos-from-images-with-stable-video-diffusion-and-fiftyone/"&gt;this workflow&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s see what 2024 has in store!&lt;/p&gt;

</description>
      <category>computervision</category>
      <category>ai</category>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
