Forem: Markus Stoll

How to build an interactive HF 🤗 Space to visualize an Image Dataset

Markus Stoll — Mon, 18 Dec 2023 13:30:57 +0000

Interactive visualization of CIFAR-10[2] with Spotlight; Source: created by the author.

The Hugging Face ecosystem provides a rich range of datasets, including unstructured data types like images, videos, and audio. These datasets are widely used for the training and validation of many models available inside and outside Hugging Face Hub.

Many datasets with unstructured data can be overwhelming due to their extensive size, often containing numerous images that are impossible to review individually. Using foundation models to create embeddings brings structure to this data. By employing dimension reduction techniques like t-SNE or UMAP, you can generate similarity maps, making it easier to navigate through the data.

This article offers a tutorial on creating a Hugging Face space with an interactive visualization of an image dataset using Renumics Spotlight. The visualization includes a similarity map, filters, and statistics to navigate the data along with the ability to review each image in detail.

1 Load the dataset

First install the required dependencies:

!pip install renumics-spotlight datasets

Now you can load the image dataset for which you wish to create the visualization. As an example, here CIFAR-10 [1] is used. The CIFAR-10 dataset is a benchmark dataset in computer vision for image classification. It consists of 10 different classes. The dataset contains 60,000 small color images with dimensions of 32x32 pixels. For our analysis, we will focus on the 10,000 test images. You can choose your own dataset or any image classification datasets from Hugging Face here.

    import datasets
    # load dataset containing raw data (images and labels)
    ds = datasets.load_dataset("cifar10", split="test")

2 Create embeddings for the dataset

Embeddings created using foundation models bring structure to unstructured image data. They offer semantic information for tasks like data exploration, generating insights, and detecting outliers. By converting images into a lower-dimensional space, these embeddings allow the exploration of similarities in the data with the creation of similarity maps by techniques like t-SNE or UMAP:

UMAP of CIFAR-10 with selected clusters of similar images; Source: created by the author.

We recommend storing your embeddings directly in a second Hugging Face dataset, separate from the first original image dataset. You can use the transformers library to compute the embeddings using e.g., the google/vit-base-patch16–224-in21k [2] model. Utilize the infer function

    # load model and define inference functions
    import torch
    import transformers

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
    model_name = "google/vit-base-patch16-224-in21k"
    processor = transformers.ViTImageProcessor.from_pretrained(model_name)
    cls_model = transformers.ViTForImageClassification.from_pretrained(model_name).to(
        device
    )
    fe_model = transformers.ViTModel.from_pretrained(model_name).to(device)


    def infer(batch):
        images = [image.convert("RGB") for image in batch]
        inputs = processor(images=images, return_tensors="pt").to(device)
        with torch.no_grad():
            outputs = cls_model(**inputs)
            probs = torch.nn.functional.softmax(outputs.logits, dim=-1).cpu().numpy()
            embeddings = fe_model(**inputs).last_hidden_state[:, 0].cpu().numpy()
        return {"embedding": embeddings}

to extract embeddings. The embeddings are stored in the new dataset ds_enrichments with a single column embedding:

    # enrich dataset with predictions and embeddings
    ds_enrichments = ds.map(infer, input_columns="img", batched=True, batch_size=32).remove_columns(['img','label'])

3 Try the visualization locally

Before publishing the embeddings we can review results in Spotlight:

    from renumics import spotlight
    ds_enriched = datasets.concatenate_datasets([ds, ds_enrichments], axis=1)
    spotlight.show(ds_enriched, dtype={'embedding':spotlight.Embedding})

This will open a new browser window:

In the visualization section, the top left displays a table showing all the fields present in the dataset. On the top right, you can observe the UMAP representation represents the embeddings generated from the foundation model. At the bottom the selected images are displayed.

4 Publish the embeddings on the Hugging Face Hub

When you are satisfied with the results, you can publish the embeddings as a new dataset on Hugging Face:

    from huggingface_hub import login
    login()
    from huggingface_hub import create_repo
    USERNAME = "YOUR_ACCOUNT"
    create_repo(f"{USERNAME}/cifar10-enrichments", repo_type="dataset")
    ds_enrichments.push_to_hub(f"{USERNAME}/cifar10-enrichments")

5 Create a Hugging Face Space

To showcase your dataset together with the embeddings on the Hugging Face Hub, you can use Hugging Face spaces to launch a Spotlight visualization for it. You can use the prepared example space for the MNIST Image dataset on the hub, duplicate it and specify your datasets in the HF_DATASET and HF_ENRICHMENT variables:

After a few minutes the space will be ready:

6 Summary

The article demonstrates how foundation models can structure large, unstructured image datasets like CIFAR-10 with embeddings. The use of Renumics Spotlight in a Hugging Face space allows an interactive visualization of image datasets. This includes creating similarity maps using dimension reduction techniques like t-SNE or UMAP, enabling easier analysis and navigation of the data.

Try this workflow on your own image dataset and explore the possibilities. After applying these techniques, feel free to share your experience and feedback with us.

References

[1] Alex Krizhevsky, Learning Multiple Layers of Features from Tiny Images (2009), University Toronto

[2] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020), arXiv

Hacktoberfest Machine Learning Projects for JS/TS Developers 🎃

Markus Stoll — Fri, 20 Oct 2023 05:27:52 +0000

Finding machine learning projects that are suitable for JS/TS developers during Hacktoberfest can be daunting due to the overwhelming abundance of open-source projects. We’ve simplified this process, offering you a refined selection of opportunities where your coding skills can shine and make a real impact.

Welcome to our curated list of open-source machine learning GUI projects built with JavaScript and TypeScript, all of which are currently open to contributions! If you are interested in visualization and GUI for machine learning and proficient in JavaScript or TypeScript, you’re in for a treat! Each project listed is not only aligned with Hacktoberfest but is also actively maintained and has a handful of issues for you to tackle.

Selection Criteria:

To guarantee the quality and relevance of the projects, we’ve applied a selection process. Each repository in this list meets the following criteria:

Repo Topics: The projects are centered around machine learning and are participating in Hacktoberfest. They are developed using JavaScript or TypeScript.
Recent Activity: We’ve ensured that each listed project has had activity within the last 7 days.
Availability of Issues: Each project has more than 3 open issues that are suitable for contributors to tackle during Hacktoberfest.

With these criteria in mind, we present to you a handpicked selection of projects where your contributions can make a significant impact. Not only will you be contributing to meaningful projects, but you’ll also be honing your skills and learning more about machine learning and web development. So, let’s dive into these exciting opportunities!

Renumics Spotlight — Explore Unstructured ML Data

Renumics Spotlight is a powerful tool for intuitively exploring unstructured datasets directly from dataframes. It’s designed to simplify data comprehension, allowing users to create interactive visualizations and leverage data enrichments like embeddings, predictions, and uncertainties to identify critical data clusters.

Key Features:

Diverse Data Support: Seamlessly handles various unstructured data types, including images, audio, text, videos, time-series, and geometric data.
Interactive Visualizations: Enables the quick generation of interactive visuals for enhanced data understanding and communication.
Python Integration: Offers easy integration and usage with intuitive Python commands and compatibility with pandas DataFrames.

Recent Issues

Renumics is currently participating in the running Hacktoberfest 2023. If you would like to contribute to Spotlight, the easiest way is to have a look at the Contribution Docs and the CONTRIBUTING.md. Every accepted PR earns you a limited-edition T-Shirt.

Some current issues include:

CML (Continuous Machine Learning)

Continuous Machine Learning (CML) is an open-source command-line interface tool designed to enhance continuous integration and delivery (CI/CD) workflows, with a focus on Machine Learning Operations (MLOps). The tool facilitates automated development workflows, including machine provisioning, model training and evaluation, comparing machine learning experiments across your project’s history, and monitoring changing datasets.

Core Features and Principles

GitFlow for Data Science: CML encourages data scientists and engineers to leverage GitLab or GitHub for managing ML experiments. Track modifications in data, monitor who trained models, and when, and codify data and models with DVC.
Auto Reports for ML Experiments: Automatically generate visual reports containing metrics and plots with each pull request to make informed, data-driven decisions.
Self-contained: Build your customized ML platform using existing resources without the need for additional databases, services, or complex setups.

Recent Issues

With 67 active issues, the community and contributors have ample opportunities to contribute during this Hacktoberfest. Some current issues include:

Inclusive Code Reviews: Browser Extension

Inclusive Code Reviews is a prototype Chrome and Edge web extension aimed at improving the quality and inclusivity of online comments, particularly in the context of code reviews on platforms like GitHub or Azure DevOps. The extension prompts users with suggestions before they post a comment, allowing developers an opportunity to reevaluate and refine their phrasing to ensure it is constructive and inclusive.

Features

Word and Terminology Suggestions: The extension provides alternative terms that are more inclusive, fostering a positive and welcoming environment for collaborators.
Sentiment Analysis: Empowered by a custom machine learning model, the extension classifies comments to identify and moderate negative sentiments, ensuring communications are respectful and encouraging.
OpenAI Integration: An experimental feature that allows the utilization of OpenAI API for enhancing the identification and suggestion process of negative sentiments.

How It Works

As you draft a comment or code review, the extension evaluates the content and provides real-time feedback. For instance, if you use the term “whitelist,” the extension will suggest the more inclusive term “allowlist.” Furthermore, it employs a custom machine learning model that runs within the browser extension to classify comments and ensure they are positive and constructive.

Contribution Opportunities

With 27 open issues, there are ample opportunities for developers to contribute and refine the extension’s functionalities. Some highlighted issues include:

BeatBridge — A Music Player with a Recommendation Engine

BeatBridge is a dynamic web application, a music player that not only allows users to enjoy their favorite tunes but also offers personalized song recommendations, enhancing the musical experience. Developed with React and TailwindCSS, and empowered by the Spotify API, BeatBridge is a convergence of aesthetic design, seamless user experience, and intelligent recommendations.

Features

Spotify API Integration: Fetch and play songs directly via the highly reliable and extensive Spotify API.
Interactive GUI: A user-friendly and intuitive graphical user interface ensuring an enjoyable user experience.
Recommendation Engine: A clustering-based recommendation engine that suggests songs tailored to the users’ musical taste.

Getting Involved

BeatBridge invites developers and music enthusiasts to contribute and enhance the app’s features and functionalities. Here are some current open issues you can work on:

Your Next Steps Forward

Each project offers a unique blend of challenges and learning opportunities, inviting you to contribute and grow your skills and knowledge in the dynamic world of open source. Choose a project that resonates with you, select an issue, and make an impact 🚀.

How I Created an Animation Of the Embeddings During Fine-Tuning

Markus Stoll — Thu, 05 Oct 2023 10:32:02 +0000

How I Created an Animation Of the Embeddings During Fine-Tuning

Using Cleanlab, PCA, and Procrustes to visualize ViT fine-tuning on CIFAR-10

In the field of machine learning, Vision Transformers (ViT) are a type of model used for image classification. Unlike traditional convolutional neural networks, ViTs use the transformer architecture, which was originally designed for natural language processing tasks, to process images. Fine-tuning these models, for optimal performance can be a complex process.

In a previous article, I used an animation to demonstrate changes in the embeddings during the fine-tuning process. This was achieved by performing Principal Component Analysis (PCA) on the embeddings. These embeddings were generated from models at various stages of fine-tuning and their corresponding checkpoints.

Projection of embeddings with PCA during fine-tuning of a Vision Transformer (ViT) model [1] on CIFAR10 [3]; Source: created by the author

The animation received over 200,000 impressions. It was well-received, with many readers expressing interest in how it was created. This article is here to support those readers and anyone else interested in creating similar visualizations.

In this article, I aim to provide a comprehensive guide on how to create such an animation, detailing the steps involved: fine-tuning, creation of embeddings, outlier detection, PCA, Procrustes, review, and creation of the animation.

The complete code for the animation is also available in the accompanying notebook on GitHub.

Preparation: Fine-tuning

The first step is to fine-tune the google/vit-base-patch16–224-in21k Vision Transformer (ViT) model [1], which is pre-trained. We use the CIFAR-10 dataset [2] for this, containing 60,000 images classified into ten different classes: airplanes, cars, birds, cats, deer, dogs, frogs, horses, ships, and trucks.

You can follow the steps outlined in the Hugging Face tutorial for image classification with transformers to execute the fine-tuning process also for CIFAR-10. Additionally, we utilize a TrainerCallback to store the loss values during training into a CSV file for later use in the animation.

from transformers import TrainerCallback

class PrinterCallback(TrainerCallback):
    def on_log(self, args, state, control, logs=None, **kwargs):
        _ = logs.pop("total_flos", None)
        if state.is_local_process_zero:
            if len(logs) == 3:  # skip last row
                with open("log.csv", "a") as f:
                    f.write(",".join(map(str, logs.values())) + "\n")

It’s important to increase the save interval for checkpoints by setting save_strategy="step" and a low value for save_step in TrainingArguments to ensure enough checkpoints for the animation. Each frame in the animation corresponds to one checkpoint. A folder for each checkpoint and the CSV file are created during the training and are ready for further use.

Creation of the Embeddings

We use the AutoFeatureExtractor and AutoModel from the Transformers library to generate embeddings from the CIFAR-10 dataset’s test split using different model checkpoints.

Each embedding is a 768-dimensional vector representing one of the 10,000 test images for one model checkpoint. These embeddings can be stored in the same folder as the checkpoints to maintain a good overview.

Extracting Outliers

We can use the OutOfDistribution class provided by the Cleanlab library to identify outliers based on the embeddings for each checkpoint. The resulting scores can then identify the top 10 outliers for the animation.

from cleanlab.outlier import OutOfDistribution

def get_ood(sorted_checkpoint_folder, df):

  ...

  ood = OutOfDistribution()

  ood_train_feature_scores = ood.fit_score(features=embedding_np)

  df["scores"] = ood_train_feature_scores

Applying PCA and Procrustes Analysis

With a Principal Component Analysis (PCA) for the scikit-learn package, we visualize the embeddings in a 2D space by reducing the 768-dimensional vectors to 2 dimensions. When recalculating PCA for each timestep, large jumps in the animation due to axis-flips or rotations can occur. To address this issue, we apply an additional Procrustes Analysis [3] from the SciPy package to geometrically transform each frame onto the last frame, which involves only translation, rotation, and uniform scaling. This enables smoother transitions in the animation.

from sklearn.decomposition import PCA

from scipy.spatial import procrustes

def make_pca(sorted_checkpoint_folder, pca_np):

  ...

  embedding_np_flat = embedding_np.reshape(-1, 768)

  pca = PCA(n_components=2)

  pca_np_new = pca.fit_transform(embedding_np_flat)

  _, pca_np_new, disparity = procrustes(pca_np, pca_np_new)

Review in Spotlight

Before finalizing the entire animation, we conduct a review in Spotlight. In this process, we utilize the first and last checkpoints to perform embedding generation, PCA, and outlier detection. We load the resulting DataFrame in Spotlight:

Embeddings for CIFAR-10: PCA and 8 worst outliers for the first and the last checkpoint of a short fine-tuning— visualized with spotlight, source: created by the author

Spotlight provides a comprehensive table in the top left, showcasing all the fields present in the dataset. On the top right, two PCA representations are displayed: one for the embeddings generated using the first checkpoint and one for the last checkpoint. Finally, in the bottom section, selected images are presented.

Disclaimer: The author of this article is also one of the developers of Spotlight.

Create the animation

For each checkpoint, we create an image, which we then store alongside its corresponding checkpoint.

This is achieved through the utilization of the make_pca(...) and get_ood(...) functions, which generate the 2D points representing the embedding and extract the top 8 outliers, respectively. The 2D points are plotted with colors corresponding to their respective classes.The outliers are sorted based on their score, and their corresponding images are displayed in a highscore leaderboard. The training loss is loaded from a CSV file and plotted as a line graph.

Finally, all the images can be compiled into a GIF using libraries such as imageio or similar.

Conclusion

This article has provided a detailed guide on how to create an animation that visualizes the fine-tuning process of a Vision Transformer (ViT) model. We’ve walked through the steps of generating and analyzing embeddings, visualizing the results, and creating an animation that brings these elements together.

Creating such an animation not only helps in understanding the complex process of fine-tuning a ViT model but also serves as a powerful tool for communicating these concepts to others.

The complete code for the animation is available in the accompanying notebook on GitHub.

I am a professional with expertise in creating advanced software solutions for the interactive exploration of unstructured data. I write about unstructured data and use powerful visualization tools to analyze and make informed decisions.

References

[1] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020), arXiv

[2] Alex Krizhevsky, Learning Multiple Layers of Features from Tiny Images (2009), University Toronto

[3] Gower, John C. Generalized procrustes analysis (1975), Psychometrika

Changes of Embeddings during Fine-Tuning of Transformers

Markus Stoll — Fri, 14 Jul 2023 14:59:55 +0000

TL;DR

Fine-tuning significantly influences embeddings in image classification. Pre-fine-tuning embeddings offer general-purpose representations, whereas post-fine-tuning embeddings capture task-specific features. This distinction can lead to varying outcomes in outlier detection and other tasks. Both pre-fine-tuning and post-fine-tuning embeddings have their unique strengths and should be used in combination to achieve a comprehensive analysis in image classification and analysis tasks.

Checkout out one of the online Demos of the CIFAR-10 dataset [3] for this article:

https://huggingface.co/spaces/renumics/cifar10-outlier

1 Introduction

The use of pre-trained models on large datasets, such as ImageNet, followed by fine-tuning on specific target datasets, has become the default approach in image classification. However, when dealing with real-world target datasets, it is important to consider their inherent noise, which includes outliers, label errors, and other anomalies. Interactive exploration of datasets plays a crucial role in gaining a comprehensive understanding of the data, enabling the identification and resolution of critical data segments through the utilization of data enrichments.

Embeddings play a crucial role in analyzing unstructured image data. They provide high-level semantic information and support various tasks such as data analysis, insight generation and outlier detection. By representing images in a lower-dimensional space, embeddings make it easier to explore similarities and differences within the data and allows for the creation of similarity maps using techniques like t-SNE or UMAP. We will use Spotlight to interactively explore the enriched datasets we create:

Disclaimer: The author of this article is also one of the developers of Spotlight. Some of the code snippets in this article are also available in the Spotlight repository.

In this article, we will delve into the differences between pre and post fine-tuning embeddings, with an additonal focus on outlier detection. While it is important to note that using embeddings from fine-tuned models may not always yield the best results for outlier detection as we could also use the probabilities, it still presents an intriguing approach. The visualization of embeddings adds a visually appealing dimension to the analysis process.

To assess the performance and effectiveness of embeddings in outlier detection tasks, we will examine exemplary datasets that are widely used in image classification. Moreover, we will utilize two common foundation model. Through this exploration, we aim to gain insights into the effect of Model Fine-tuning on the embeddings, providing a better understanding of their capabilities and limitations.

2 Preparations

Install the required Python Packages:

!pip install renumics-spotlight datasets torch pandas cleanlab annoy

2.1 Extract Embeddings

We will use the following models based on google/vit-base-patch16–224-in21k [1] and microsoft/swin-base-patch4-window7–224 [2] available on Hugging Faces to extract pre-fine-tuning embeddings and the best-liked fine-tuned models for each dataset: araki/vit-base-patch16–224-in21k-finetuned-cifar10, MazenAmria/swin-tiny-finetuned-cifar100, nateraw/vit-base-beans, farleyknight/mnist-digit-classification-2022–09–04.

case = {
    "cifar10": {
        "base_model_name": "google/vit-base-patch16-224-in21k",
        "ft_model_name": "aaraki/vit-base-patch16-224-in21k-finetuned-cifar10",
    },
    "beans": {
        "base_model_name": "google/vit-base-patch16-224-in21k",
        "ft_model_name": "nateraw/vit-base-beans",
    },
    "mnist": {
        "base_model_name": "google/vit-base-patch16-224-in21k",
        "ft_model_name": "farleyknight/mnist-digit-classification-2022-09-04",
    },
    "cifar100": {
        "base_model_name": "microsoft/swin-base-patch4-window7-224",
        "ft_model_name": "MazenAmria/swin-tiny-finetuned-cifar100",
    },
}

To load the dataset, we utilize the load_dataset function from the datasets module and prepare it for the image classification task. You can choose from the tested and reported dataset of this article CIFAR-10 [3], CIFAR-100 [3], MNIST [4] and Beans [5] or try different image classification datasets from Hugging Face with corresponding models.

import datasets
# choose from cifar10, cifar100, mnist or beans.
# corresponding model will be selected automatically
DATASET = "cifar10"
ds = datasets.load_dataset(DATASET, split="train").prepare_for_task(
    "image-classification"
)
df = ds.to_pandas()
# df = df.iloc[:1000] # uncomment to limit the dataset size for testing

We define the huggingface_embedding function to extract embeddings from both the fine-tuned model and the base/foundation model. The embeddings are stored in separate columns ("embedding_ft" and "embedding_foundation") in the original dataframe (df):

import datasets
from transformers import AutoFeatureExtractor, AutoModel
import torch
import pandas as pd

ft_model_name = case[DATASET]["ft_model_name"]
base_model_name = case[DATASET]["base_model_name"]
def extract_embeddings(model, feature_extractor, image_name="image"):
    """
    Utility to compute embeddings.
    Args:
        model: huggingface model
        feature_extractor: huggingface feature extractor
        image_name: name of the image column in the dataset
    Returns:
        function to compute embeddings
    """
    device = model.device
    def pp(batch):
        images = batch[image_name]
        inputs = feature_extractor(
            images=[x.convert("RGB") for x in images], return_tensors="pt"
        ).to(device)
        embeddings = model(**inputs).last_hidden_state[:, 0].cpu()
        return {"embedding": embeddings}
    return pp

def huggingface_embedding(
    df,
    image_name="image",
    modelname="google/vit-base-patch16-224",
    batched=True,
    batch_size=24,
):
    """
    Compute embeddings using huggingface models.
    Args:
        df: dataframe with images
        image_name: name of the image column in the dataset
        modelname: huggingface model name
        batched: whether to compute embeddings in batches
        batch_size: batch size
    Returns:
        new dataframe with embeddings
    """
    # initialize huggingface model
    feature_extractor = AutoFeatureExtractor.from_pretrained(modelname)
    model = AutoModel.from_pretrained(modelname, output_hidden_states=True)
    # create huggingface dataset from df
    dataset = datasets.Dataset.from_pandas(df).cast_column(image_name, datasets.Image())
    # compute embedding
    device = "cuda" if torch.cuda.is_available() else "cpu"
    extract_fn = extract_embeddings(model.to(device), feature_extractor, image_name)
    updated_dataset = dataset.map(extract_fn, batched=batched, batch_size=batch_size)
    df_temp = updated_dataset.to_pandas()
    df_emb = pd.DataFrame()
    df_emb["embedding"] = df_temp["embedding"]
    return df_emb

embeddings_df = huggingface_embedding(
    df,
    modelname=ft_model_name,
    batched=True,
    batch_size=24,
)
embeddings_df_found = huggingface_embedding(
    df, modelname=base_model_name, batched=True, batch_size=24
)
df["embedding_ft"] = embeddings_df["embedding"]
df["embedding_foundation"] = embeddings_df_found["embedding"]

2.2 Calculate outlier score

Next we use [Cleanlab] https://github.com/cleanlab/cleanlab) to calculate outlier scores both the fine-tuned model and the base/foundation based on the embeddings. We utilize the OutOfDistribution class to compute the outlier scores. The resulting outlier scores are stored in the original dataframe (df):

from cleanlab.outlier import OutOfDistribution
import numpy as np
import pandas as pd
def outlier_score_by_embeddings_cleanlab(df, embedding_name="embedding"):
    """
    Calculate outlier score by embeddings using cleanlab
        Args:
            df: dataframe with embeddings
            embedding_name: name of the column with embeddings
        Returns:
            new df_out: dataframe with outlier score
    """
    embs = np.stack(df[embedding_name].to_numpy())
    ood = OutOfDistribution()
    ood_train_feature_scores = ood.fit_score(features=np.stack(embs))
    df_out = pd.DataFrame()
    df_out["outlier_score_embedding"] = ood_train_feature_scores
    return df_out

df["outlier_score_ft"] = outlier_score_by_embeddings_cleanlab(
    df, embedding_name="embedding_ft"
)["outlier_score_embedding"]
df["outlier_score_found"] = outlier_score_by_embeddings_cleanlab(
    df, embedding_name="embedding_foundation"
)["outlier_score_embedding"]

2.3 Find nearest neighbor

To evaluate the outliers, we calculate the nearest neighbor image with the Annoy library using the fine-tuned model only. The resulting images are stored in the original DataFrame (df):

from annoy import AnnoyIndex
import pandas as pd
def nearest_neighbor_annoy(
    df, embedding_name="embedding", threshold=0.3, tree_size=100
):
    """
    Find nearest neighbor using annoy.
    Args:
        df: dataframe with embeddings
        embedding_name: name of the embedding column
        threshold: threshold for outlier detection
        tree_size: tree size for annoy
    Returns:
        new dataframe with nearest neighbor information
    """
    embs = df[embedding_name]
    t = AnnoyIndex(len(embs[0]), "angular")
    for idx, x in enumerate(embs):
        t.add_item(idx, x)
    t.build(tree_size)
    images = df["image"]
    df_nn = pd.DataFrame()
    nn_id = [t.get_nns_by_item(i, 2)[1] for i in range(len(embs))]
    df_nn["nn_id"] = nn_id
    df_nn["nn_image"] = [images[i] for i in nn_id]
    df_nn["nn_distance"] = [t.get_distance(i, nn_id[i]) for i in range(len(embs))]
    df_nn["nn_flag"] = df_nn.nn_distance < threshold
    return df_nn

df_nn = nearest_neighbor_annoy(
    df, embedding_name="embedding_ft", threshold=0.3, tree_size=100
)
df["nn_image"] = df_nn["nn_image"]

2.4 Visualize

For visualization with Spotlight purposes, a new “label_str” column is created in the DataFrame by mapping integer labels to their string representations using a lambda function. The dtypes dictionary is used to specify the data type of each column to get the proper visualization, while the layout determines the arrangement and displayed columns in the visualization:

from renumics import spotlight
df["label_str"] = df["labels"].apply(lambda x: ds.features["labels"].int2str(x))
dtypes = {
    "nn_image": spotlight.Image,
    "image": spotlight.Image,
    "embedding_ft": spotlight.Embedding,
    "embedding_foundation": spotlight.Embedding,
}
spotlight.show(
    df,
    dtype=dtypes,
    layout="https://spotlight.renumics.com/resources//layout_pre_post_ft.json",
)

This will open a new browser window:

In the visualization section, the top left displays a comprehensive table showing all the fields present in the dataset. Images classified as outlier by embeddings from the foundations model are selected. On the top right, you can observe two UMAP representations: the first represents the embeddings generated from the foundation model, while the second represents the embeddings from the fine-tuned model. In the Bottom the selected images are display together with their nearest neigbor in den dataset.

3 Results

Now lets check the results for all datasets. You can go through all steps of section 2 using different input datasets to reproduce the results, or you can load preprocessed datasets using the code snippets below. Or you can checkout the linked online demos.

3. 1 CIFAR-10

Load the prepared CIFAR-10 dataset [3] with

from renumics import spotlight
import datasets
ds = datasets.load_dataset("renumics/cifar10-outlier", split="train")
df = ds.rename_columns({"img": "image", "label": "labels"}).to_pandas()
df["label_str"] = df["labels"].apply(lambda x: ds.features["label"].int2str(x))
dtypes = {
    "nn_image": spotlight.Image,
    "image": spotlight.Image,
    "embedding_ft": spotlight.Embedding,
    "embedding_foundation": spotlight.Embedding,
}
spotlight.show(
    df,
    dtype=dtypes,
    layout="https://spotlight.renumics.com/resources/layout_pre_post_ft.json",
)

or checkout the online demo at https://huggingface.co/spaces/renumics/cifar10-outlier to examine the outliers:

The UMAP visualization of the embeddings after fine-tuning reveals distinct patterns where certain classes are completely separated from all others, while some may be connected to only one or two other classes.

The outliers detected in CIFAR-10 using pre-fine-tuning embeddings do not appear to be significantly uncommon, as they have relatively similar neighboring images. In contrast, the outliers identified with post-fine-tuning embeddings are distinct and highly uncommon within the dataset.

3.2 CIFAR-100

Load the prepared CIFAR-100 dataset [3] with

from renumics import spotlight
import datasets
ds = datasets.load_dataset("renumics/cifar100-outlier", split="train")
df = ds.rename_columns({"img": "image", "fine_label": "labels"}).to_pandas()
df["label_str"] = df["labels"].apply(lambda x: ds.features["fine_label"].int2str(x))
dtypes = {
    "nn_image": spotlight.Image,
    "image": spotlight.Image,
    "embedding_ft": spotlight.Embedding,
    "embedding_foundation": spotlight.Embedding,
}
spotlight.show(
    df,
    dtype=dtypes,
    layout="https://spotlight.renumics.com/resources/layout_pre_post_ft.json",
)

or checkout the online demo at huggingface.co/spaces/renumics/cifar100-outlier to examine the outliers:

When examining the embeddings of CIFAR-100, which consists of 100 classes, we observe that even after fine-tuning, there are still more connected classes compared to the pre-fine-tuning embeddings. However, the structure within the embedding space becomes noticeably more defined and organized

The pre-fine-tuning embeddings do not show clear outliers that stand out from their neighboring images, indicating limited effectiveness in outlier detection. However, when utilizing post-fine-tuning embeddings, the performance improves. Out of the six outliers identified, the first three are effectively detected as uncommon within the dataset.

3.3 MNIST

Load the prepared MNIST dataset [4] with

from renumics import spotlight
import datasets
ds = datasets.load_dataset("renumics/mnist-outlier", split="train")
df = ds.rename_columns({"label": "labels"}).to_pandas()
df["label_str"] = df["labels"].apply(lambda x: ds.features["label"].int2str(x))
dtypes = {
    "nn_image": spotlight.Image,
    "image": spotlight.Image,
    "embedding_ft": spotlight.Embedding,
    "embedding_foundation": spotlight.Embedding,
}
spotlight.show(
    df,
    dtype=dtypes,
    layout="https://spotlight.renumics.com/resources/layout_pre_post_ft.json",
)

or checkout the online demo at huggingface.co/spaces/renumics/mnist-outlier to examine the outliers:

During the fine-tuning of MNIST, the embeddings experience significant changes. Pre-fine-tuning, there may be overlapping regions between different digit classes, making it challenging to distinguish them based on embedding proximity alone. However, after fine-tuning, the embeddings exhibit clearer separations between the digit classes.

The pre-fine-tuning embeddings reveal only one outlier that stands out from the neighboring images, indicating a moderate performance in outlier detection. However, when utilizing post-fine-tuning embeddings, the detection of outliers improves. Approximately 3 to 4 outliers could be identified as highly uncommon within the dataset.

3.4 Beans

Load the prepared beans dataset [3] with

from renumics import spotlight
import datasets
ds = datasets.load_dataset("renumics/beans-outlier", split="train")
df = ds.to_pandas()
df["label_str"] = df["labels"].apply(lambda x: ds.features["labels"].int2str(x))
dtypes = {
    "nn_image": spotlight.Image,
    "image": spotlight.Image,
    "embedding_ft": spotlight.Embedding,
    "embedding_foundation": spotlight.Embedding,
}
spotlight.show(
    df,
    dtype=dtypes,
    layout="https://spotlight.renumics.com/resources/layout_pre_post_ft.json",
)

or checkout the online demo at huggingface.co/spaces/renumics/beans-outlier to examine the outliers:

In the Beans dataset, after fine-tuning, most of the embeddings exhibit complete separation between the three classes. However, a few cases still show slight overlaps, possibly due to similarities between certain types of beans or misclassifications.

The outlier detection using both pre-fine-tuning and post-fine-tuning embeddings does not yield significant outliers that deviate from the norm. The identified outliers are not distinct or uncommon within the dataset.

4 Conclusion

In conclusion, fine-tuning has a significant impact on embeddings in image classification. Before fine-tuning, embeddings provide general-purpose representations, while after fine-tuning, they capture specific features for the task at hand.

This distinction is clearly reflected in the UMAP visualizations, where post-fine-tuning embeddings exhibit more structured patterns, with certain classes completely separated from others.

For outlier detection, using post-fine-tuning embeddings can be more effective. However, it’s worth noting that calculating outliers based on the probabilities obtained from fine-tuning might yield even better results compared to relying solely on the embeddings.

Both pre-fine-tuning and post-fine-tuning embeddings have their unique strengths and should be used in combination to achieve a comprehensive analysis in image classification and analysis tasks.

References

[1] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2020), arXiv

[2] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, Baining Guo Swin Transformer: Hierarchical Vision Transformer using Shifted Windows (2021), arXiv

[3] Alex Krizhevsky, Learning Multiple Layers of Features from Tiny Images (2009), University Toronto

[4] Yann LeCun, Corinna Cortes, Christopher J.C. Burges, MNIST handwritten digit database (2010), ATT Labs [Online]

[5] Makerere AI Lab, Bean disease dataset (2020), AIR Lab Makerere University