Forem: IgorSusmelj

RustyNum Follow-Up: Fresh Insights and Ongoing Development

IgorSusmelj — Sun, 16 Feb 2025 20:38:17 +0000

Hey Dev Community!

As a follow-up to my previous introduction to RustyNum, I want to share a developer-focused update about what I’ve been working on these last few weeks. RustyNum, as you might recall, is my lightweight, Rust-powered alternative to NumPy published on GitHub under MIT license. It uses Rust’s portable SIMD features for faster numerical computations, while staying small (around ~300kB for the Python wheel). In this post, I’ll explore a few insights gained during development, point out where it really helps, and highlight recent additions to the documentation and tutorials.

Brief Recap

If you missed the initial announcement, RustyNum focuses on:

High performance using Rust’s SIMD
Memory safety in Rust, without GC overhead
Small distribution size (much smaller than NumPy wheels)
NumPy-like interface to reduce friction for Python users

For a more detailed overview, head over to the official RustyNum website or check out my previous post on dev.to.

Developer’s Perspective: What’s New?

1. Working with Matrix Operations

I’ve spent a good chunk of time ensuring matrix operations feel familiar. Being able to do something like matrix-vector or matrix-matrix multiplication with minimal code changes from NumPy was a primary goal. A highlight is the .dot() function and the @ operator, which both support these operations.

Check out the dedicated tutorial:
Better Matrix Operations with RustyNum

Here’s a quick snippet:

import rustynum as rnp

matrix = rnp.NumArray([i for i in range(16)], dtype=“float32”).reshape([4, 4])
vector = rnp.NumArray([1, 2, 3, 4], dtype=“float32”)

# Use the dot function
result_vec = matrix.dot(vector)
print(“Matrix-Vector Multiplication Result:”, result_vec)

It’s neat to see how close this is to NumPy’s workflow. Benchmarks suggest RustyNum can often handle these tasks at speeds comparable to, and sometimes faster than, NumPy on smaller or medium-sized datasets. For very large matrices, I’m still optimizing the approach.

2. Speeding Up Common Analytics Tasks

Apart from matrix multiplications, I’ve kept refining operations like mean(), min(), max(), and dot(). These straightforward methods are prime candidates for SIMD acceleration. There’s also a tutorial on how to replace specific NumPy calls with RustyNum for analytics, which might be useful if you’re bottlenecked by Python loops.

Example:

import numpy as np
import rustynum as rnp

# Generate test data

data_np = np.random.rand(1_000_000).astype(np.float32)
data_rn = rnp.NumArray(data_np.tolist(), dtype=“float32”)

# NumPy approach
mean_np = data_np.mean()

# RustyNum approach
mean_rn = data_rn.mean().item()

print(“NumPy mean:”, mean_np)
print(“RustyNum mean:”, mean_rn)

The Python overhead can sometimes offset the raw Rust speed, but in many cases, RustyNum still shows advantages.

New Tutorials: Real-World Examples

One of the best ways to see RustyNum in action is through practical examples. I’ve added several new tutorials with real-world coding scenarios:

Better Matrix Operations – Focus on dot products, matrix-vector, and matrix-matrix tasks.
Replacing Core NumPy Calls – Demonstrates how to switch from NumPy’s mean, min, dot to RustyNum.
Streamlining ML Preprocessing – Explores scaling, normalization, and feature engineering for machine learning.

The last tutorial is a personal favorite. It covers the typical data transformations you’d do in a machine learning pipeline—just swapping out NumPy calls for RustyNum.

Check out a snippet of scaling code from that guide:

def min_max_scale(array):
    col_mins = []
    col_maxes = []
    for col_idx in range(array.shape[1]):
        col_data = array[:, col_idx]
        col_mins.append(col_data.min())
        col_maxes.append(col_data.max())

    scaled_data = []
    for col_idx in range(array.shape[1]):
        col_data = array[:, col_idx]
        numerator = col_data - col_mins[col_idx]
        denominator = col_maxes[col_idx] - col_mins[col_idx] or 1.0
        scaled_col = numerator / denominator
        scaled_data.append(scaled_col.tolist())

    return rnp.concatenate(
        [rnp.NumArray(col, dtype="float32").reshape([array.shape[0], 1]) for col in scaled_data],
        axis=1
    )

It’s a small snippet, but it shows how RustyNum can do row/column manipulations quite effectively. After scaling, you can still feed the data into your favorite machine learning frameworks. The overhead of converting RustyNum arrays back into NumPy or direct arrays is minimal compared to the cost of big model training steps.

Ongoing Work

1. Large Matrix Optimizations

I’ve noticed that for very large matrices (like 10k×10k), RustyNum’s current code paths aren’t yet fully optimized compared to NumPy. This area remains an active project. RustyNum is still young, and I’m hoping to introduce further parallelization or block-based multiplication techniques for better large-scale performance.

2. Expanded Data Types

RustyNum supports float32 and float64 well, plus some integer types. I’m considering adding stronger integer support for data science tasks like certain indexing or small transformations. Meanwhile, advanced data types (e.g., complex numbers) might appear further down the line if the community needs them.

3. Documentation and API Enhancements

The docs site at rustynum.com has an API reference and a roadmap. I’m continuously adding to it. If you spot anything missing or if you have a specific use case in mind, feel free to open a GitHub issue or submit a pull request.

4. The big goal of Rustynum

RustyNum is simply a learning exercise for me to combine Rust and Python. Since I spend every day around machine learning I would love to have RustyNum replace part of my daily Numpy routines. And we're slowly getting there. I started adding more and more methods around the topic of how to integrate RustyNum in ML pipelines.

Quick Code Example: ML Integration

To demonstrate how RustyNum fits into a data pipeline, here’s a condensed example:

import numpy as np
import rustynum as rnp
from sklearn.linear_model import LogisticRegression

# 1) Create synthetic data in NumPy
train_np = np.random.rand(1000, 10).astype(np.float32)
labels_np = np.random.randint(0, 2, size=1000)

# 2) Convert to RustyNum for fast scaling
train_rn = rnp.NumArray(train_np.flatten().tolist(), dtype="float32").reshape([1000, 10])

# Basic scaling (compute min and max per column)
scaled_rn = []
for col_idx in range(train_rn.shape[1]):
    col_data = train_rn[:, col_idx]
    mn = col_data.min()
    mx = col_data.max()
    rng = mx - mn if (mx != mn) else 1.0
    scaled_col = (col_data - mn) / rng
    scaled_rn.append(scaled_col.tolist())

train_scaled_rn = rnp.concatenate(
    [rnp.NumArray(col, dtype="float32").reshape([1000, 1]) for col in scaled_rn],
    axis=1
)

# 3) Convert back to NumPy for scikit-learn
train_scaled_np = np.array(train_scaled_rn.tolist(), dtype=np.float32)

# 4) Train a logistic regression model
model = LogisticRegression()
model.fit(train_scaled_np, labels_np)

print("Model Coefficients:", model.coef_)

This script highlights that RustyNum can handle data transformations with a Pythonic feel, after which you can pass the arrays into other libraries.

Final Thoughts

It’s been fun to expand RustyNum’s features and see how well Rust can integrate with Python for high-performance tasks. The recent tutorials are a window into how RustyNum might replace parts of NumPy in data science or ML tasks, especially when smaller array sizes or mid-range tasks are involved.

Check out the tutorials at rustynum.com
Contribute or report issues on GitHub
Share feedback if there’s a feature you’d love to see

Thanks for tuning in to this developer-focused update, and I look forward to hearing how RustyNum helps you in your own projects!

Happy Coding!
Igor

Building RustyNum: a NumPy Alternative with Rust and Python

IgorSusmelj — Sun, 22 Sep 2024 14:14:20 +0000

Hey Dev Community!

I wanted to share a side project I’ve been working on called RustyNum. As someone who uses NumPy daily for data processing and scientific computing, I often wondered how challenging it would be to create a similar library from scratch using Rust and Python. This curiosity sparked the development of RustyNum—a lightweight alternative to NumPy that leverages Rust’s powerful features.

What is RustyNum?

RustyNum combines the speed and memory safety of Rust with the simplicity and flexibility of Python. One of the standout features is that it's using Rust’s portable SIMD (Single Instruction, Multiple Data) feature, which allows RustyNum to optimize computations across different CPU architectures seamlessly. This means you can achieve high-performance array manipulations without leaving the Python ecosystem. I wanted to learn building a library from scratch and as a result RustyNum is not using any 3rd party dependencies.

Why RustyNum?

Performance Boost: By utilizing Rust’s portable SIMD, RustyNum can handle performance-critical tasks more efficiently than traditional Python libraries.
Memory Safety: Rust ensures memory safety without a garbage collector, reducing the risk of memory leaks and segmentation faults.
Learning Experience: This project has been a fantastic way for me to dive deeper into Rust-Python interoperability and explore the intricacies of building numerical libraries.
Because no external dependencies are used the Python wheels are super small (300kBytes) compared to alternatives such as Numpy (>10MBytes).

When to Consider RustyNum:

If you’re working on data analysis, scientific computing, or small-scale machine learning projects and find NumPy a bit heavy for your needs, RustyNum might be the perfect fit. It’s especially useful when you need optimized performance across various hardware without the complexity of integrating with C-based libraries. However, be aware that the library is pretty much in its early days and only covers basic operations from Numpy as of today.

You can check out RustyNum on GitHub. I’d love to hear your feedback, suggestions, or contributions!

Update January 28th: RustyNum also has its own website!

Thanks for reading, and happy coding!

Cheers,
Igor

Self-Supervised Models are More Robust and Fair

IgorSusmelj — Thu, 07 Apr 2022 18:00:26 +0000

‍A recent paper from Meta AI Research shows that their new 10 billion parameter model trained using self-supervised learning breaks new ground in robustness and fairness.

In spring 2021 Meta AI (former Facebook AI) published SEER (Self-supervised Pretraining of Visual Features in the Wild). SEER showed that training models using self-supervision works well on large-scale uncurated datasets and their model reached state-of-the-art when it was published.

The model has been pre-trained on 1 billion random and uncurated images from Instagram. The accompanying blog post created some noise around the model as it set the ground for going further with larger models and larger datasets using self-supervised learning: https://ai.facebook.com/blog/seer-the-start-of-a-more-powerful-flexible-and-accessible-era-for-computer-vision/

In contrast to supervised learning, the approach of self-supervised learning does not require vasts amounts of labeled datasets, therefore significantly reducing the costs.
Basically, the two main ingredients-lots of data and lots of compute-are enough, as shown in this paper. Besides the independence of labeled data, there are further advantages in using self-supervised learning we talked about in another post.

SEER is more robust and fair

In this post, we’re more interested in the new follow-up paper to the initial SEER paper. The new paper has been published in late February 2022:

Vision Models Are More Robust And Fair When Pretrained On Uncurated Images Without Supervision, 2022

The original SEER paper showed that larger datasets combined with self-supervised pre-training result in higher model accuracy on downstream tasks such as ImageNet. The new paper takes this one step further and investigates what happens if we train even larger models with respect to robustness and fairness.

Model robustness and fairness have recently gained more attention in the ML community.

With robustness, we are interested in how reliable the model works when facing changes in its input data distribution. This is a common problem as models once deployed might face scenarios they have never seen during training.

Model fairness focuses on the evaluation of models towards gender, geographical and other diversity.

But what exactly is fairness and how can we measure it?

Previous work in Fairness Indicators

Although Fairness in ML is an older research area the field has recently gained lots of interest. For example, the paper Fairness Indicators for Systematic Assessments of Visual Feature Extractors, 2022 introduces three rather simple indicators one can use to evaluate model fairness.

The approach is to fine-tune (think about transfer learning) a trained backbone to make predictions across three indicators:

harmful mislabeling of images of people by training a classifier Harmful mislabeling happens when a model associates attributes like “crime” or non-human attributes like “ape” or “puppet” with a human. Apart from being low overall, the mislabeling rate should be independent of gender or skin color. If however, people with a certain gender or skin color are mislabeled more often than others, the model is unfair.
geographical disparity in object recognition by training a classifier How well do we recognize objects all around the world? Common objects like chairs/ streets/ houses look different across the globe.
disparities in learned visual representations of social memberships of people by using similarity search to retrieve similar examples If we do similarity lookup of people of different skintone do we also get similar skintones in the set of nearest neighbors?

In their paper, three different training methodologies and datasets are used for the evaluation.

Supervised training on ImageNet
Weakly-supervised training on filtered Instagram data
Self-Supervised training on ImageNet or uncurated Instagram data (SEER model)

All three training paradigms use the same model architecture. A RegNetY-128 backbone with 700M parameters.

The results of the evaluation show that training models with less supervision seem to improve the fairness of the trained models:

Going Deeper into Fairness Evaluation

Now, how does the new SEER paper build on top of the Fairness Indicators paper we just discussed?

There are two main additions:

A larger model (A RegNetY-10B model with 10B parameters, over 14x times more)
Evaluation across 50+ benchmarks

In order to fit the model onto the GPUs tricks like model sharding, Fully Sharded Data Parallel (FSDP), and Activation Checkpointing are used. Finally, the authors used a batch size of 7,936 across 496 NVIDIA A100 GPUs. Note that according to the paper only 1 Epoch is used for the self-supervised pertaining. On ImageNet these models are often trained for 800 epochs or more. This means that even though the dataset is almost 1'000 times larger the images seen during training are comparable.

Since the authors used a very large dataset (1 Billion images) they argue that they can also scale the model size: We scale our model size to dense 10 billion parameters to avoid underfitting on a large data size.

Results

The paper is full of interesting plots and tables. We will just highlight a few of them in this post. For more information, please have a look at the paper.

Fairness

This benchmark is the one we looked at previously when talking about the Fairness Indicators paper. If you look at the table you find different genders and skintones as well as different age groups. Interesting here is that all SSL pre-trained models perform much better than the supervised counterpart.

Out-of-distribution Performance

Out-of-distribution or commonly called robustness towards data distribution shifts is a common problem in ML. A model trained on a set of images might find slightly different images once deployed.

Conclusion

While self-supervised learning (SSL) is still in its infancy the research direction looks very promising. Not having to rely on large and fully labeled datasets allows training models that are more robust to data distribution shifts (data drifts) and are fairer than their supervised counterparts. The new paper shows very interesting insights.

If you’re interested in self-supervised learning and want to try it out yourself you can check out our open-source repository for self-supervised learning.

We’re using self-supervised learning at Lightly from day one as our initial experiments showed its benefits when doing large-scale data curation. We’re super happy that new research papers support our approach and we hope that we can help curate datasets to allow for less biased datasets and more fair models.

Igor, co-founder
Lightly.ai

Data Annotation and What Data Annotation Companies do

IgorSusmelj — Mon, 21 Feb 2022 22:07:02 +0000

Data annotation is one of the core functions of machine learning. The more data an ML model is trained with, the more accurate it will become.

Just like humans learn through training and practice, machine learning models are also trained by feeding them with huge volumes of data.

One of the reasons Google is still the best search engine is because it has a lot of data compared to its competitors, including Yahoo and Bing (Microsoft’s search engine). With this data, Google is able to give users the best search results that match their search queries. Several other web apps also rely on data annotation to improve their algorithms in order to enhance their users’ experience.

So, what is data annotation?

Data annotation refers to the process of categorizing and labeling information or data so that machine learning models can use it. The data used to train machine learning models has to be accurately labeled and categorized for specific use cases. For instance, the categorization and labeling of data to be used by a search engine ML model is different from a speech recognition ML model.

Data annotation involves assessing four primary types of data; text, audio, video, and image. This article will focus mainly on images and texts annotation since they are the most popular types of data used to train machine learning models.

Text annotation

A 2020 State of AI and Machine Learning report shows that over 70% of companies relied on text to train their AI and machine learning models. The common types of annotations used with text include; sentiment, intent, and query. Let’s discuss each of these in detail.

Sentiment Annotation
Sentiment annotation involves assessing emotions, attitudes, and opinions, making it crucial to have the proper training data for machine learning models. Sentiment annotation is done by humans because it involves moderating content and sentiments on platforms such as social media and eCommerce sites.

Query annotation
This type of text annotation involves training search algorithms by tagging the various components within product titles and search queries to improve the relevance of search results. Algorithms that use query annotation are usually found in search engines for eCommerce platforms.

Intent annotation
This type of text annotation involves training machine learning models to identify intention in a particular text. Intent annotations help ML models to differentiate various inputs into categories, including requests, commands, bookings, recommendations, and confirmations. This type of text annotation is mainly used to train search engine Machine Learning models.

Image annotation

Image annotation involves training machine learning models with several images to help them learn about the features in those images. Some of the applications that use such algorithms include; computer vision, robotic vision, and apps that have facial recognition functionalities.

For effective training of ML models with image annotation, metadata has to be attached to all the images used. This metadata usually includes identifiers, captions, and keywords. Some of the popular use cases that take advantage of image annotation include; health apps that auto-identify medical conditions, computer vision systems in self-driving cars, machines used for sorting things, and many more.

Image annotation is more intense and requires more computation power than text annotation. This is simply because images carry way more data than texts. Training ML models with images involves learning about all the pixels in the various images fed into the ML model.

Images annotation has five main types, and these include;

Bounding boxes annotation
With bounding boxes, human annotators are tasked to draw boxes around specific subjects within the image. This type of annotation is mainly used to train autonomous vehicle algorithms to detect objects such as road labels, traffic, potholes, etc.

3D cuboids annotation
This type of image annotation involves drawing 3D boxes around specific objects in an image. Unlike bounding boxes that only consider length and width, 3D cuboids include the height or depth of the object.

Polygons
At times some objects may not fit well in a bounding box or 3D cuboid because not all things are rectangular. Objects such as cars, humans, and buildings are usually not perfectly rectangular, so they can’t fit in a rectangle or cuboid. In this case, human annotators have to draw polygons around the non-rectangular objects before feeding this data to an ML model.

Lines and spines
These are used to train machine learning models to identify lanes and boundaries. So, annotators are required to draw lanes between certain boundaries that you would wish your ML model to learn.

Semantic segmentation
This is a much more precise and deeper type of annotation that involves associating every pixel in a given image with a tag. This annotation type is mainly used in machine learning models for autonomous vehicles and medical image diagnostics.

What do data annotation companies do?

One of the major challenges involved in training machine learning models is finding the right quality and quantity of data to feed them. Remember, the quality and amount of data you provide these models determine the overall outcome of the tasks these models will be finally be deployed to do.

To help fix these issues, data annotation companies avail the appropriate amount of data that can be used to train various types of AI and ML models. These companies use the human-assisted approach and machine-learning assistance to provide high-quality data to train AI and ML models.

Besides providing training data for AI and ML models, data annotation companies also offer deploying and maintaining services for AI and ML projects. These are follow-up services meant to ensure the provided data provides the desirable results wherever the ML algorithm trained using this data is deployed.

For instance, if it is a search algorithm deployed in an eCommerce site, the data annotation company has to ensure the algorithm provides the best search results for the various user queries.

Check out our list of data annotation companies to learn more!

This post was originally posted here: https://data-annotation.com/data-annotation-and-what-data-annotation-companies-do/

Train Test Split in Deep Learning

IgorSusmelj — Sun, 20 Feb 2022 18:59:11 +0000

One of the golden rules in machine learning is to split your dataset into train, validation, and test set. Learn how to bypass the most common caveats!

The reason we do that is very simple. If we would not split the data into different sets the model would be evaluated on the same data it has seen during training. We therefore could run into problems such as overfitting without even knowing it.

Back before using deep learning models we often used three different sets.

A train set is used for training the model
A validation set that is used to evaluate the model during the training process
A test set that is used to evaluate the final model accuracy before deployment

How do we use the train, validation, and test set?

Usually, we use the different sets as follows:

We split the dataset randomly into three subsets called the train, validation, and test set. Splits could be 60/20/20 or 70/20/10 or any other ratio you desire.
We train a model using the train set.
During the training process, we evaluate the model on the validation set.
If we are not happy with the results we can change the hyperparameters or pick another model and go again to step 2
Finally, once we’re happy with the results on the validation set we can evaluate our model on the test set.
If we’re happy with the results we can now again train our model on the train and validation set combined using last the hyperparameters we derived.
We can again evaluate the model accuracy on the test set and if we’re happy deploy the model.

Most ML frameworks provide built-in methods for random train/ test splits of a dataset. The most well-known example is the train_test_split function of scikit-learn.

Are there any issues when using a very small dataset?

Yes, this could be a problem. With very small datasets the test set will be tiny and therefore a single wrong prediction has a strong impact on the test accuracy. Fortunately, there is a way to work around this problem.

The solution to this problem is called cross-validation. We essentially create partitions of our dataset as shown in the image below. We always hold out a set for testing and use all the other data for training. Finally, we gather and average all the results from the testing sets. We essentially trained k models and using this trick managed to get statistics of evaluating the model on the full dataset (as every sample has been part of one of the k test sets).

This approach is barely used in recent deep learning methods as it’s super expensive to train a model k times.

With the rise of deep learning and the massive increase in dataset sizes, the need for techniques such as cross-validation or having a separate validation set has diminished. One reason for this is that experiments are very expensive and take a long time. Another one is that due to the large datasets and nature of most deep learning methods the models got less affected by overfitting.

Overfitting is still a problem in deep learning. But overfitting to 50 samples with 10 features happens faster than overfitting to 100k images with millions of pixels

One could argue that researchers and practitioners got lazy/ sloppy. It would be interesting to see any recent paper investigating such effects again. For example, it could be that researchers in the past years have heavily overfitted their models to the test set of ImageNet as there has been an ongoing struggle to improve it and become state-of-the-art.

How should I pick my train, validation, and test set?

Naively, one could just manually split the dataset into three chunks. The problem with this approach is that we humans are very biased and this bias would get introduced into the three sets.

In academia, we learn that we should pick them randomly. A random split into the three sets guarantees that all three sets follow the same statistical distribution. And that’s what we want since ML is all about statistics.

Deriving the three sets from completely different distributions would yield some unwanted results. There is not much value in training a model on pictures of cats if we want to use it to classify flowers.

However, the underlying assumption of a random split is that the initial dataset already matches the statistical distribution of the problem we want to solve. That would mean that for problems such as autonomous driving the assumption is that our dataset covers all sorts of cities, weather conditions, vehicles, seasons of the year, special situations, etc.

As you might think this assumption is actually not valid for most practical deep learning applications. Whenever we collect data using sensors in an uncontrolled environment we might not have the desired data distribution.

But that’s bad. What am I supposed to do if I’m not able to collect a representative dataset of the problem I try to solve?

What you’re looking for is the research area around finding and dealing with domain gaps, distributional shifts, or data drift. All these terms have their own specific definition. I’m listing them here so you can search for the relevant problems easily.

With a domain, we refer to the data domain, as the source and type of the data we use. There are three ways to move forward:

Solve the data gap by collecting more representative data
Use data curation methods to make the data already collected more representative
Focus on building a robust enough model to handle such domain gaps

The latter approach is focusing on building models for out-of-distribution tasks.

Picking a train test split for out-of-distribution tasks

In machine learning, we refer to out-of-distribution whenever our model has to perform well in a situation where the new input data is from a different distribution than the training data. Going back to our autonomous driving example from before, we could say that for a model that has only been trained on sunny California weather, doing predictions in Europe is out of distribution.

Now, how should we do the split of the dataset for such a task?

Since we collected the data using different sensors we also might have additional information about the source for each of the samples (a sample could be an image, lidar frame, video, etc.).

We can solve this problem by splitting the dataset in the following way:

we train on a set of data from cities in list A
and evaluate the model on a set of data from cities in list B

There is a great article from Yandex research about their new dataset to tackle distributional shifts in datasets.

Things that could go wrong

The validation set and test set accuracy differ a lot

You very likely overfitted your model to the validation set or validation and test set are very different. But how?

You likely did several iterations of tweaking the parameters to squeeze out the last bit of accuracy your model can yield on the validation set. The validation set is no longer fulfilling its purpose. At this point, you should relax some of your hyperparameters or introduce regularization methods.

After deriving my final hyperparameters I want to retrain my model on the full dataset (train + validation + test) before shipping

No, don’t do this. The hyperparameters have been tuned for the train (or maybe the train + validation) set and might yield a different result when used for the full dataset.
Furthermore, you won’t be able to answer the question anymore of how good your model really performs as the test set no longer exists.

I have a video dataset and want to split the frames randomly into train, validation, and test set

Since video frames are very likely highly correlated (e.g. two frames next to each other almost look the same) this is a bad idea. It’s almost the same as if we would evaluate the model on the training data. Instead, you should split the dataset across videos (e.g. videos 1,3,5 are used for training and video 2,4 for validation). You can again use a random train test split but this time on the video level instead of the frame level.

Igor, co-founder
Lightly.ai

This blog post has originally been published here: https://www.lightly.ai/post/train-test-split-in-deep-learning

‍

Active Learning using Detectron2

IgorSusmelj — Sun, 30 May 2021 09:12:24 +0000

Tired of labeling all your data? Learn more about how model predictions and embeddings can help you select the right data.

Supervised machine learning requires labeled data. In computer vision applications such as autonomous driving, labeling a single frame can cost up to $10. The fast growth of new, connected devices and cheaper sensors leads to a continuous increase in new data. Labeling everything is simply not possible anymore. Many companies in fact only label between 0.1% and 1% of the data they collect. But finding the right 0.1% of data is like finding the needle in the haystack without knowing what the needle looks like. So, how can we do it efficiently?

One approach to tackle the problem is active learning. When doing active learning, we use a pre-trained model and use the model predictions to select the next batch of data for labeling. Different algorithms exist which help you select the right data based on model predictions. For example, the well-known approach of uncertainty sampling selects new data based on low model confidence. Let's assume a scenario where we have two images with cats, one where the model says it's 60% sure it's a cat and one where the model is 90% certain that there is a cat. We would now pick the image where the model has only 60% confidence. We essentially pick the "harder" example.
With active learning, we iterate this prediction and selection process until we reach our target metrics.

In this post, we won't go into detail about how active learning works. There are many great resources about active learning. Instead, we will focus on how you can use active learning with just a few lines of code using the Active Learning feature of Lightly. Lightly is a data curation platform for computer vision. It leverages recent advances in self-supervised learning and active learning to help you work with unlabeled datasets.

The Datasets: From MS COCO to Comma10k

It is very common these days to use pre-trained models and fine-tune them on new tasks using transfer learning. Since we are interested in object detection here we use a pre-trained model from MS COCO (or COCO). Consisting of more than 100k labeled images, it is a very common dataset used for transfer learning for image segmentation, object detection, or keypoint/pose estimation.

Our goal is to use active learning to use a COCO pre-trained model and fine-tune it on a dataset for autonomous driving. For this transfer task, we are using the Comma10k dataset. From the repository: "It's 10,000 PNGs of real driving captured from the comma fleet. It's MIT license, no academic-only restrictions or anything."

As you might have noticed already the Comma10k dataset has annotations for training "segnets" (semantic segmentation networks). However, there are no annotations for bounding boxes we require for our transfer task. We, therefore, have to add the missing annotations. Instead of annotating all 10k images, we will use active learning to pick the first 100 images where we expect the highest return in model improvement and annotate them first.

Let's have a look at how active learning can help us select the first 100 images for annotation.

Let's get started

This post is based on the Active Learning using Detectron2 on Comma10k tutorial. If you want to run the code yourself there is also a ready-to-use Google Colab Notebook.

To get active learning working can be really hard. Many companies fail to implement active learning properly and get little to no value out of it. One of the main reasons for this is that they focus only on uncertainty sampling. Uncertainty sampling is one of two big categories of active learning algorithms. The following illustration shows the Knowledge Quadrant - The right column is Active Learning. (see Active Learning with PyTorch)
You find the two active learning approaches on the right side.

Uncertainty sampling is probably the most common approach. You pick new samples based on where model predictions have low confidence.

The Lightly Platform supports both, uncertainty sampling as well as diversity sampling algorithms.

The second approach is diversity sampling. You can use it to diversify the dataset. We pick images that are visually/ semantically distinct from each other.

Uncertainty sampling can be used with a variety of scores (least confidence, margin, entropy…).
For diversity sampling Lightly uses the coreset algorithm and embeddings obtained from its open-source self-supervised learning framework lightly.

However, there is more.

Lightly has another algorithm for active learning called CORAL (COReset Active Learning) which uses a combination of diversity and uncertainty sampling.

The goal is to overcome the limitations of the individual methods by selecting images with low model confidence but at the same time making sure that they are visually distinct from each other.

Let's see how we can make use of active learning and the Lightly Platform.

Embed and Upload your Dataset

Let's start by creating embeddings and uploading the dataset to the Lightly Platform. We will use the embeddings later for the diversification part of the CORAL algorithm.

You can easily train, embed, and upload a dataset using the lightly Python package.
First, we need to install the package. We recommend using pip for this. Make sure you're in a Python3.6+ environment. If you're on Windows you should create a conda environment.

Run the following command in your shell to install the latest version of lightly:

Now that we have lightly installed we can run the command line command lightly-magic to train, embed, and upload our dataset. You need to pass a token and a dataset_id argument to the command. You find both in the Lightly Platform after creating a new dataset.

Once you ran the lightly-magic CLI command you should see the uploaded dataset in the Lightly Platform. You can have a look at the 2d visualizations of your dataset. Do you spot the two clusters forming images of day and night?

Active Learning Workflow

Now, that we have a dataset uploaded to the Lightly Platform with embeddings we can start with the active learning workflow. We are interested in the part where you have a trained model and are ready to run predictions on unlabeled data. We start by creating an ActiveLearningAgent. This agent will help us managing the images which are unlabeled and makes sure we interface with the platform properly.

In our case, we don't have a model yet. Let's load it from the disk and get it ready to run predictions on unlabeled data.

Finally, we can use our pre-trained model and run predictions on the unlabeled data. It's important that we use the same order of the individual files as we have on the Lightly Platform. We can simply do this by iterating over the al_agent.query_setwhich contains a list of filenames in the right order.

In order to upload the predictions, we need to turn them into scores. And since we're working on an "object detection" problem here we use the ScorerObjectDetection.

We're finally ready to query the first batch of images.

Query the first 100 images for labeling

To query data based on the model predictions and our embeddings on the Lightly Platform we can use the .query(...) method of the agent. We can pass it to an SamplerConfig object to describe the kind of sampling algorithm and its parameters we want to run.

After querying the new 100 images we can simply access their filenames using the added_set of the active learning agent.

Congratulations, you did your (first) active learning iteration!
Now you can label the 100 images and train your model with them.
Active learning is usually done in a continuous feedback loop. After training your model using the new data you would do another iteration and predict + select another batch of images for labeling.

I hope you got a good idea of how you can use active learning for your next computer vision project. For more information check out the Active Learning using Detectron2 on Comma10k tutorial.

Igor Susmelj, co-founder
Lightly

The original post has been published here: https://www.lightly.ai/post/active-learning-using-detectron2

The Advantage of Self-Supervised Learning

IgorSusmelj — Sat, 06 Mar 2021 20:52:08 +0000

‍A few personal thoughts on why self-supervised learning will have a strong impact on AI. From recent NLP to computer vision papers.

This is not a prediction but rather a summary of personal findings and trends from research and industry.

First, let’s discuss the difference between self-supervised learning and unsupervised learning. Whether there actually is a difference between the two is still an open discussion.
Unsupervised learning is the idea of models learning without any supervision. Clustering algorithms are often an example of unsupervised learning. There is no supervision or training involved on how clusters are formed (at least not for simple methods such as K-Means).
In self-supervised learning, we use the data itself as a label. We essentially turn unsupervised learning into supervised learning by leveraging something called a proxy task. A proxy task is different from the downstream or model task because we are not interested in the proxy itself.

In NLP popular methods such as Googles BERT, 2019 use a pre-training procedure where the model would predict missing words within a sentence or the next sentence based on the current sentence. We can create a sentence with a missing word by simply removing a single word from it. Now the ground truth information (our label) is the missing word. We can train the model in a self-supervised way.

In computer vision, we can apply the very same technique to train a model. We take an image and remove part of it (we essentially color it with a single color). The task of the model is to predict the missing pixels (we call this image inpainting). Since we have access to the original image and the missing pixels (ground truth) we can train the model in a supervised way. The paper Context Encoders: Feature Learning by Inpainting, CVPR, 2016 is an example of such a self-supervised training procedure using inpainting. Unfortunately, this approach in computer vision doesn’t work that well.

Newer methods use image augmentations. A single image would go twice through an augmentation pipeline. We end up with two new versions of the original image (we call them views). If we do the same for multiple images we can train a model to find the pairs which belong to the same original image (before augmentation). We essentially learn the model to be invariant to whatever augmentation we choose.

Now, let’s have a look at the advantages self-supervised learning can bring to the world of AI.

Lifelong Learning

When we talk about AI we all think about some smart system learning over time and improving itself. Unfortunately, this is quite difficult. Supervised learning systems require new labels for new data to be trained on. Improving the systems require continuous re-labeling and re-training.

However, using self-supervision we don’t require human labels anymore. There has been some great work into that direction from Alexey Efros Lab like the following paper using self-supervised learning for adapting to new environments in reinforcement learning: Curiosity-driven Exploration by Self-supervised Prediction, ICML, 2017

Data Labeling

Supervised learning requires ground-truth data. We call them labels or annotations and in domains such as computer vision, they are mostly generated by humans. A single label can cost between a few cents and up to multiple dollars. It all depends on how much time the annotation task takes and how much expertise is required. Whereas lots of people can draw a bounding box around a car and a pedestrian fewer can do the same for medical images.

Self-supervised learning can help to reduce the required amount of labeling. On one hand, we can pre-train a model on unlabeled data and fine-tune it on a smaller labeled set. A popular example is A Simple Framework for Contrastive Learning of Visual Representations, ICML 2020. Btw. the last author of this paper is no one else than Turing Award winner Geoffrey Hinton. Another way to help with labeling efficiency is that we can use the obtained features from a self-supervised model to guide the selection process of which data to label. One approach is to simply pick the data samples which are diverse and not similar. We do this at Lightly.

I hope you got an idea of how self-supervised learning works and why there is a good reason to be excited about it. If you’re interested in self-supervised learning in computer vision don’t forget to check out our open-source Python framework for self-supervised learning on GitHub.

Igor, co-founder
Lightly.ai

This post has originally been published here: https://www.lightly.ai/post/the-advantage-of-self-supervised-learning

Embedded COVID mask detection on an Arm Cortex-M7 processor using PyTorch

IgorSusmelj — Tue, 23 Feb 2021 19:26:00 +0000

How we built a visual COVID-19 mask quality inspection prototype running on-device on an OpenMV-H7 board and the challenges on the way.

TLDR; The source code to train and deploy your own image classifier can be found here: https://github.com/ARM-software/EndpointAI/tree/master/ProofOfConcepts/Vision/OpenMvMaskDefaults

In the summer of 2020, we worked with Arm to build an easy-to-use tutorial on how to train and deploy an image classifier on an Arm microcontroller. In this post, we show how we approached and solved the following challenges:

Convert a PyTorch ResNet to TensorFlow and quantize it to use 8-bit integer values
Collect, select, and annotate data of faulty and non-faulty masks
Use self-supervised pre-training to boost model performance when working on fewer images. ‍

The Results to Expect

The goal of this project was to show an end-to-end workflow on how to train and deploy a convolutional neural network to an OpenMV-H7 board.

The video below showcases how our classifier detects faulty masks in real-time.

The OpenMV-H7 Board

The board consists of an STM32H743VI Arm Cortex-M7 processor running at 480MHz, multiple peripherals, and a camera module mounted on it.
The camera module has an OV7725 sensor from OmniVision and can record in VGA resolution (640x480) at 75 FPS.

Since the board has limited computing power and memory, we aimed for a very small deep learning model. We call the variant ResNet-9 since it’s more of a cut in half ResNet-18 variant. Below you can find some numbers about the model configuration, runtime, and other metrics.

Input size: 64x64x3
CPU Freq.: 480 MHz
Operations: 33.4 MOp
Model size: 90 kBytes
Inference Time: 150 ms
Operations/s: 249 MOp/s

Detailed specs can be found on the official website of OpenMV here.

A close-up picture of the OpenMV H7 Board we used.

Data Collection

Neural networks are very data-hungry. In order to efficiently collect enough training data we did the following:

We used the camera on the OpenMV-H7 board to record video sequences. With the USB interface and the OpenMV IDE, we were able to easily record the camera stream and save it as a video file.
To simulate a real production line we mounted the camera on cardboard to make sure the camera is stable. The optics point to the production line which is a metal plate with tall borders. This setup ensures, that the camera sees defect and non-defect masks within the same environment.
Finally, we moved masks through our inspection line using a combination of push and pull.

A picture of our data collection pipeline. We cut a small hole into the cardboard to clamp the USB table holding the board into it.

Data Selection and Annotation

At this stage we have multiple video files, each having captured a few minutes. The next challenge is to extract the frames and annotate the data. We use FFmpeg for the frame extraction and Lightly to select a diverse set of frames. Note that we had more than 20k frames but no time to annotate all of them. Using Lightly we selected a few hundred frames covering all relevant scenarios.
Lightly uses self-supervised learning to get good representations of the images. It then uses these representations to select the most interesting images which should be annotated. The benefit of this method is that we can access the pre-trained model and fine-tune it on only a handful of labeled images.

Example images taken with the OpenMV H7 camera showing the three labels for the data. From left to right: good mask, defect mask, no mask.

Model Fine-Tuning

To prevent the model from overfitting, we simply froze the pre-trained backbone and added a linear classification head to the model. We then trained the classifier for 100 epochs on a total of 500 annotated images.

‍

From PyTorch to Keras to TensorFlow Lite

Moving the pre-trained PyTorch model to TensorFlow Lite turned out to be the most difficult part of our endeavor.

‍

We tried out several tricks with ONNX to export our model. A simple library called pytorch2keras worked fine for a model only consisting of linear layers but not for our conv + linear model.

The main problem we encountered, was that PyTorch uses the CxHxW (channel, height, width) format for tensors whereas TensorFlow uses HxWxC. This meant that, after transforming our model to TensorFlow Lite, the output of the layer just before the classifier was permuted, and hence, the output of the classifier was incorrect. In order to address this problem, we considered manually permuting the weights of the linear classifier.

However, we decided to go for a simpler solution. We pooled the output of the last convolutional layer into a Cx1x1 shape. That way, changing the order of the channels does not affect the output of the neural network.

The final step is to quantize and export the Keras model to TensorFlow Lite. In our case quantization reduces the model size and speeds up running the model in inference at the cost of a few percent lower accuracy.

Special thanks to our collaborators at Arm and Philipp Wirth from Lightly for making this project possible. The full source code is available here. You can easily train your own classifier and run it on an embedded device. Feel free to reach out or leave a comment if you have any questions!

‍
Igor, co-founder
Lightly.ai

The original post was published here: https://lightly.ai/post/embedded-covid-mask-detection-on-an-arm-m7-using-pytorch

‍

Few-Shot Learning with fast.ai

IgorSusmelj — Fri, 28 Aug 2020 18:12:23 +0000

In few-shot learning, we train a model using only a few labeled examples. Learn how to train your classifier using transfer learning and a novel framework for sample selection.

Introduction

Lately, posts and tutorials about new deep learning architectures and training strategies have dominated the community. However, one very interesting research area, namely few-shot learning, is not getting the attention it deserves. If we want widespread adoption of ML we need to find ways to train them efficiently, with little data and code. In this tutorial, we will go through a Google Colab Notebook to train an image classification model using only 5 labeled samples per class. Using only 5 exemplary samples is also called 5-shot learning.

Don’t forget to check out our Google Colab Notebook for the full code of this tutorial!

Frameworks and libraries we use

Jupyter Notebook (Google Colab)

The full code of this tutorial will be provided as a notebook. Jupyter Notebooks are python programming environments accessible by web browsers and are very useful for fast prototyping and experiments. Colab is a service from Google where you get access to notebooks running on instances for free.

Fast.ai

Training a deep learning model can be quite complicated and involve 100s of lines of code. This is where fast.ai comes to the rescue. A library developed by former Kaggler Jeremy Howard specifically aimed to make training deep learning models fast and simple. Using fast.ai we can train and evaluate our classifier with just a few lines of code. Under the hood, fast.ai is using the PyTorch framework.

WhatToLabel and borisml

WhatToLabel and it’s python package borisml aim to solve the question which samples you should work with. If you only label a few samples out of your dataset one of the key questions arising is how do you pick the samples? WhatToLabel aims at solving exactly this problem by providing you with different methods and metrics for selecting your samples

Setup your Notebook

We start by installing the necessary dependencies and downloading the dataset. You can run any shell command in a notebook by start the code with an “!”

E.g. to install our dependencies we can run the following code within a notebook cell:

In this tutorial, we work with a dataset consisting of cats and dogs images. You can download it from Kaggle using the fastai CLI (command-line interface) by running the following command. Note that you need to adapt the token you get from Kaggle:

Select the samples for few-shot learning

In order to get robust results with our few-shot learning algorithm, we want our training set to cover the full space of samples. That means we don’t want lots of similar examples but rather select a very diverse set of images. To achieve this we can create an embedding of our dataset followed by a sampling method called coreset[1] sampling. Coreset sampling ensures that we build up our dataset by only adding samples which lie furthest apart from the existing set as possible.

Now we will use WhatToLabel and its python package borisml to select the most diverse 10 samples we want to work with. We first need to create an embedding. Borisml allows us to do this without any labels by leveraging recent success in self-supervised learning. We can simply run the following command to train the model for a few epochs and create our embedding:

Finally, we need to upload our dataset and embedding to the WhatToLabel app to run our selection algorithm. Since we don’t want to upload the images we can tell the CLI to consider only metadata of the samples.

Once the data and embedding are uploaded we can go back to the web platform and run our sampling algorithm. This might take a minute to complete. If everything went smooth you should see a plot with a slider. Move the slider to the left to only keep 10 samples in the new subset. Hint: You can use the arrow keys to move the slider one by one. Once we have our 10 samples selected we need to create a new tag (left menu). For this tutorial, we use “tiny” as the name and press the enter key to create it.

Download the newly created subset using the following CLI command:

You might see that the dataset you downloaded is not perfectly balanced. E.g. you might have 4 images of cats and 6 of dogs. This is due to the algorithm we chose for selecting the samples. Our goal was to cover the whole embedding/ feature space. It might very well be that there are more similar images of cats in our dataset than images of dogs. As a result, more images of dogs than cats will be selected.

Train our model using fast.ai

If you reach this point you should have a dataset we obtained using WhatToLabel and Coreset sampling ready to be used to train our classifier.

Fast.ai requires only a few lines of code to train an image classifier. We first need to create a dataset and then a learner object. Finally, we train the model using the .fit(...) method.

Interpreting the results

To evaluate our model we use the test set of the cats and dogs dataset consisting of 2'000 images. Looking at the confusion matrix we see that our model mostly struggles with predicting dogs as being cats.

Fast.ai also helps us here getting interpretable performance plots of our model with just a few lines of code.

The library allows us also to look at the images from the test set with the highest loss of the trained model. You see that the model struggles with smaller dogs looking more similar to cats. We could improve accuracy by selecting more samples for the training routine. However, the goal of this tutorial was to show that by leveraging transfer learning and a smart data selection process you can get already high accuracy (>80%) with just a handful of training data.

I hope you enjoyed this brief guide on how to use few-shot learning using fast.ai and WhatToLabel. Follow me for further tutorials on Medium!

Igor, co-founder
whattolabel.com

[1] Ozean S., (2017), Active Learning for Convolutional Neural Networks: A Core-Set Approach

Rotoscoping: Hollywood's video data segmentation?

IgorSusmelj — Fri, 15 May 2020 15:32:26 +0000

In Hollywood, video data segmentation has been done for decades. Simple tricks such as color keying with green screens can reduce work significantly.

In late 2018 we worked on a video segmentation toolbox. One of the common problems in video editing is oversaturated or too bright sky when shooting a scene. Most skies in movies have been replaced by VFX specialists. The task is called “sky replacement”. We thought this is the perfect starting point for introducing automatic segmentation to mask the sky for further replacement. Based on the gathered experience I will explain similarities in VFX and data annotation.

You find a comparison of our solution we built compared to Deeplab v3+ which was at the time considered the best image segmentation model. Our method (left) produced better details around the buildings as well as reduced the flickering significantly between the frames.

Comparison of our sky segmentation model and Deeplab v3+

Video segmentation techniques of Hollywood

In this section, we will have a closer look at color keying with for example green screens and rotoscoping.

What is color keying?

I’m pretty sure you heard about color keying or green screens. Maybe you even used such tricks yourself when editing a video using a tool such as Adobe After Effects, Nuke, Final Cut, or any other software.

I did a lot of video editing myself in my childhood. Making videos for fun with friends and adding cool effects using tools such as after-effects. Watching tutorials from videocopilot.com and creativecow.com day and night. I remember playing with a friend and wooden sticks in the backyard of my family’s house just to replace them with lightsabers hours later.

In case you don’t know how a green screen works you find a video below giving you a better explanation than I could do with words.

Video explaining how a green screen works

Essentially, a greenscreen is using color keying. The color “green” from the footage gets masked. This mask can be used to blend-in another background. And the beauty is, we don’t need a fancy image segmentation model burning your GPU but a rather simple algorithm looking for neighboring pixels with the desired color to mask.

What is rotoscoping?

As you can imagine in many Hollywood movies special effects require more complex scenes than the ones where you can simply use a colored background to mask elements. Imagine a scene with animals that might be shy of strong color or a scene with lots of hair flowing in the wind. A simple color keying approach isn’t enough.

But also for this problem, Hollywood found a technique many years ago: Rotoscoping.
To give you a better idea of what rotoscoping is I embedded a video below. The video is a tutorial on how to do rotoscoping using after effects. Using a special toolbox you can draw splines and polygons around objects throughout a video. The toolbox allows for automatic interpolation between the frames saving you lots of time.

After effects tutorial on rotoscoping

This technology, introduced in After Effects, 2003 has been out there for almost two decades and has since then been used by many VFX specialists and freelancers.

Silhouette is in contrast to After Effects one tool focusing solely on rotoscoping. You get an idea of their latest product updates in this video.

I picked one example for you to show how detailed the result of rotoscoping can be. The three elements in the following video from MPC Academy blowing my mind are motion blur, fine-grained details for hairs, and the frame consistency. When we worked on a product for VFX editors we learned that in this industry the quality requirement is beyond what we have in image segmentation. There is simply neither a dataset nor a model in computer vision fulfilling the Hollywood standard.

Rotoscoping demo reel from MPC Academy.
Search for “roto showreel” on YouTube and you will find many more examples.

How is VFX rotoscoping different from semantic segmentation?

There are differences in both quality and how the quality assurance/ inspection works.

Tools and workflow comparison

The tools and workflow in VFX, as well as data annotation, are surprisingly similar to each other. Since both serve a similar goal. Rotoscoping, as well as professional annotation tools, support tracking of objects, working with polygons and splines. Both allow for changing brightness and contrast to help you finding edges. One of the key differences is that in rotoscoping you work with transparency for motion blur or hair. In segmentation, we usually have a defined number of classes and no interpolation between them.

Quality inspection comparison

In data annotation quality inspection is usually automated using a simple trick. We let multiple people do the annotation and can compare their results. If all annotators agree the confidence is high and therefore the annotation is considered good. In case they only partially agree and the agreement is below a certain threshold an additional round of annotation or manual inspection takes place.
In VFX however, an annotation is usually done by a single person. The person has been trained on the task and has to deliver very high quality. The customer or supervisor lets the annotator redo the work if the quality is not good enough. There is no automatic obtained metric. All inspection is done manually using the trained eye of VFX experts. There is a term called “pixel fucking” illustrating the required perfectionism on a pixel level.

How we trained our model for sky segmentation

Let’s get back to our model. In the beginning, you saw a comparison between our result and Deeplab v3+, 2018. You will notice that the quality of our video data segmentation is higher and has less flickering. For high-quality segmentation, we had to create our own dataset. We used Full HD cameras mounted on tripods to record footage of the sky. This way a detailed segmentation around buildings and static objects can be reused throughout the whole shot. We used Nuke for creating the annotated data.

Image showing the soft contours using for rotoscoping.
We blurred the edges around the skyline.

Additionally, we used publicly available and license-free videos of trees, people, and other moving elements in front of simple backgrounds. To obtain ground truth information we simply used color keying. It worked like charm and we had pixel-accurate segmentation of 5 min shots within a few hours. For additional diversity within the samples, we used our video editing tool to crop out parts of the videos while moving the camera around. A 4k original video had a Full HD frame moving around with smooth motion. For some shots, we even broke out of the typical binary classification and used smooth edges, interpolated between full black and white, for our masks. Usually, segmentation is always binary, black or white. We had 255 colors in between when the scene was blurry.

Color keying allowed us to get ground truth data for complicated scenes such as leaves or hair. The following picture of a palm tree has been masked/ labeled using simple color keying.

For simple scenes, color keying was more than good enough to get detailed results. One could also replace now the background with a new one to augment the data.

This worked for all kinds of trees. And even helped us obtain good results for a whole video. We were able to simply adapt the color keying parameters during the clip.

Also, this frame has been taken using simple color keying methods

To give you an idea of the temporal results of our color keying experiments have a look at the gif below. Note there is a little bittering. We added this on purpose to “simulate” recording with a camera in your hand. The movement of the camera itself is a simple linear interpolation of the crop on the whole scene. So what you see below is just a crop of the full view.

This mask has been obtained using color keying for the first frame. The subsequent frames might only need a small modification of the color keying parameters. We did such adaptation every 30-50 frames and let the tool interpolate the parameters between them.

Training the model

To train the model we added an additional loss on the pixels close to the borders. This helped a lot to improve the fine-details. We played around with various parameters and changing the architecture. The simple U-Net model worked well enough. We trained the model not on the full images but on crops of around 512×512 pixels. We also read up on Kaggle competitions such as the caravan image masking challenge from 2017 for additional inspiration.

Adversarial training for temporal consistency

Now that we had our dataset we started training the segmentation model. For the model, we used a U-Net architecture, since the sky can span the whole image and we don’t need to consider various sizes as we would need to for objects.

In order to improve the temporal consistency of the model (e.g. removing the flickering) we co-trained a discriminator which always saw three sequential frames. The discriminator had to distinguish three frames coming from our model or the dataset. The training procedure was otherwise quite simple. The model trained for only a day on an Nvidia GTX 1080Ti.

So for your next video data segmentation project, you might want to have a look at whether you can use any of these tricks to collect data and save lots of time. In my other posts, you will find a list of data annotation tools. In case you don’t want to spend any time on manual annotation there is also a list of data annotation companies available.

I’d like to thank Momo and Heiki who worked on the project with me. An additional thank goes to all the VFX artists and studios for their feedback and fruitful discussions.

Note: This post was originally published on data-annotation.com

Curated List of Data Annotation Companies

IgorSusmelj — Sat, 04 Apr 2020 15:32:07 +0000

After sharing a list of tools and frameworks around data annotation I decided to also collect and maintain a list of data annotation companies and service providers.

Quite often we don't want to spend hours in classifying images, drawing bounding boxes or segmentation maps. Since there are plenty of companies focusing on outsourcing these cumbersome tasks it often makes sense as an ML engineer to use these services and spend your own time on model training and optimization.

Data Annotation Tools and Frameworks

IgorSusmelj — Mon, 02 Mar 2020 18:34:38 +0000

I started creating my own list of data annotation tools and frameworks for my machine learning projects. I work a lot with computer vision data and used to build my own little annotation tools. However, there are plenty of open-source tools available!

I thought it would be helpful for some of you!

I will start blogging more about my deep learning related projects here. So stay tuned for more interesting content!