Forem: Shamanth Shetty

10x your active learning via active transfer learning in NLP

Shamanth Shetty — Fri, 08 Jul 2022 10:35:54 +0000

Introduction

Active learning is an excellent concept: you train a model as you label the data for it. This way, you can automatically label records of high prediction confidence and pick those records for manual labeling that have low prediction confidence (primarily working for imbalanced datasets).

In this post, we want to look into how you can get much more out of active learning by combining it with large-scale and pre-trained language models in natural language processing. After you’ve read this blog post, you’ll learn the following:

How embeddings can help you to boost the performance of simple machine learning models
How you can integrate active transfer learning into your weak supervision process

Compute embeddings once, train often

Transformer models are, without a doubt, among the most significant breakthroughs in machine learning of recent years. Capable of understanding the context of sentences, they can be used in multiple scenarios such as similarity scoring, classifications, or other natural language understanding tasks.

For active transfer learning, they become relevant as encoders. By cutting off the classification head of these networks, transformers can be used to compute vector representations of texts that contain precise semantic information. In other words, encoders can extract structured (but latent, i.e., unnamed) features from unstructured texts.

Check out our embedders library if you want to build such embeddings using a high-level, Scikit-Learn-like API.

We’re also going to be showing how these embeddings can be used to significantly improve the selection of records during manual labeling in another blog post related to neural search.

Why is this so beneficial for active learning? There are two reasons:

You only need to compute embeddings once. Even though creating such embeddings might take some time, you will see nearly no negative impact on the training time of your active learning models.
They boost the performance of your active learning models a lot. You can expect your models to learn the first significant patterns from 50 records per class - which is terrific for natural language processing.

In our open-source software Kern, we split the process into two steps, as shown in the screenshots below. First, you can choose one of many embedding techniques for your data. We download the pre-trained models directly from the Hugging Face Hub. As we use spaCy for tokenization, you can also use those transformers out-of-the-box to compute token-level embeddings (relevant for extraction tasks such as Named Entity Recognition).

Afterward, we provide a scikit-learn-like interface for your models. You can fit and predict your models and intercept and customize the training process to suit your labeling requirements.

Active learning as continuous heuristics

As you might already know, active learning does not have to be standalone. It especially shines when combined with weak supervision, a technology integrating several heuristics (like active transfer learning modules) to compute denoised and automated labels.Active learning generally provides continuous labels, as you can compute the confidence of your predictions. Combined with regular labeling functions and modules such as zero-shot classification (both covered in separate posts), you can create well-defined confidence scores to estimate your label quality. Because of that, active learning is always a great idea to include during weak supervision.

We’re going open-source, try it out yourself

Ultimately, it is best to just play around with some data yourself, right? Well, we’re launching our system soon that we’ve built for more than the past year, so feel free to install it locally and play with it. It comes with a rich set of features, such as integrated transformer models, neural search and flexible labeling tasks.

Subscribe to our newsletter here, stay up to date with the release so you don’t miss out on the chance to win a GeForce RTX 3090 Ti for our launch!:-)

Check out the Youtube tutorial here

Active learning tutorial

Shamanth Shetty — Fri, 08 Jul 2022 09:46:45 +0000

Hey there everyone, hope everyone’s having a fantastic Friday 😊 Today we’ll discuss active learning and show you some cool things that can be done using our active learning interface. Active learning is useful especially in situations where there is a lot of data to be labelled, which can be prioritised to label the data in an efficient manner. In our active learning interface, you can apply few-shot learning for heuristics. These are very valuable for weak supervision as you can integrate them with labeling functions or other heuristics. We integrate HuggingFace embeddings so you can use your favourite transformer models for embeddings. They typically require only little training data to get a reasonable precision.

If you want to learn more about this, we have uploaded a full tutorial on our Youtube channel, check it out here - https://www.youtube.com/watch?v=VfQapj5TtUQ&ab_channel=KernAI

Subscribe to our newsletter 👉🏼 https://www.kern.ai/pages/open-source and stay up to date with the release so you don’t miss out on the chance to win a GeForce RTX 3090 Ti for our launch

How Kern differentiates from existing labeling environments

Shamanth Shetty — Mon, 04 Jul 2022 09:28:54 +0000

If VS code had a data-centric sibling, what would it look like?

Introduction

We’re just about to launch our software as open-source software, and with that want to show why we built that software.

After reading this blog, you’ll have seen some new perspectives about building training data (mainly for NLP, but the concepts are also somewhat agnostic).

Training data will become a software artifact

We talked extensively about software 1.0 and software 2.0 in the last few years. In a nutshell, it is about how AI-backed software contains training data next to the algorithms to make predictions. We love this saying because it defines that the data is part of the software. This has some profound implications:

There are two ways to build AI software: model-centric and data-centric. You can focus not only on the algorithms to implement the software but also on the data itself.
Software needs to be documented. If data is part of the software, it can’t be treated poorly. And this starts with label definitions, known inter-annotator agreements, versions, and some documentation about how labels have been set (i.e., turning the labeling procedure black box into some explainable instrument).

Labeling is not a one-time job. If it is part of the software, you’ll work on it daily. And for that, you need some excellent tooling.

We’re betting that such tasks require a data-centric development environment. To make things like labeling or cleansing data (especially in NLP) much more straightforward. That is why we’ve built Kern AI. In more detail, we constantly have the following two questions in mind:

What is needed so developers and data scientists can create something from scratch much faster?

What is required, such that developers and data scientists can improve some features continuously in a maintainable way?
In the following, we’ll go into more detail and showcase some of the features we’ve built for Version 1.0, our open-source release.

How can I prototype in a data-centric approach?

Picture this: You’ve got a brand new idea for a feature. You want to differentiate clickbait from regular titles in your content moderation. But: you only have unlabeled records and clearly don’t want to spend too much time on data labeling.

This is where our data-centric prototyping comes into play. It looks as follows:
First, you label some reference data. It can be as little as 50 records, not too much effort.

Second, with our weak supervision core (we’re explaining that technology in another post in more detail), you can build and combine both explicit heuristics such as labeling functions, as well as active transfer learning and zero-shot modules. Our application comes with an integration to the Hugging Face Hub, such that you can just easily integrate large-scale language models for your heuristics.

The heuristics are automatically validated by the reference samples you’ve labeled (of course, we’re making reasonable splits for the active learning, such that you don’t have to worry about the applicability of the statistics).

Third, you are computing weakly supervised labels. Those are continuous, i.e., have some confidence attached. Doing so typically requires seconds, so you can do this as often as you’d like.

Now, we have the first batch of weakly supervised labels. But the prototyping doesn’t stop here. We can go further. For instance, we’re seeing on the monitoring page that there are too few weakly supervised labels for clickbait data. With our integrated neural search powered by large-scale language models, we can just search for similar records of existing data we’ve labeled and find new patterns.

Alternatively, we can also use that data to find outliers to see where we might face issues later in the model development.

In general, this gives us a great estimation of our data, where it’s easy to make predictions, and how our data baseline looks. You can from here build a prototype using Sklearn and FastAPI within minutes.

But we don’t want to stop there, right? We want to be able to improve and maintain our AI software. So now comes the question: how can we do so?

How can I continuously improve in a data-centric approach?

You’ll quickly find that it is time to improve your model and let it learn some new patterns - and refine existing ones. The good news is that you can just continue your data project from the prototyping phase!

First, let’s look at how weak supervision comes into play for this. It comes in handy for many reasons:
You can apply any kind of heuristic as a filter for your data and thus slice your records accordingly. Remember, this is so helpful for debugging and, in general, documentation - your raw records are enriched through the weakly supervised labels and the heuristic data in general. Use this to find weak spots in your data quality.

The weakly supervised labels are continuous, i.e., soft labels. You can sort them according to your needs. If you want to find potential manual labeling errors, you can sort by the label confidence descending and filter for mismatches in manual and weakly supervised labels. Voilá, those are most likely manual labeling errors. Alternatively, you can sort ascending and thus find potential weak spots of your weak supervision, helping you debug your data.

We have also built certain quality-of-life features to make it as easy for you to extend your heuristics with ease. For instance, if you label spans in your data, we can automatically generate lookup lists that you can use, for example, in labeling functions. This way, you only write your labeling function once but can extend it continuously.

Also, as you continue to label data manually, your active transfer learning modules will improve over time, making it even easier for you to find potential mislabeled data or weak spots.

Lastly, the whole application comes with three graphs for quality monitoring. We’ll showcase one, the inter-annotator agreement, which is automatically generated when you label data with multiple annotators. It shows users' potential disagreement, which we’ll also integrate into the heuristics. Ultimately, this helps you understand what human performance you can measure against and where potential bias is.

There is so much more to come

This is our version 1.0. What we built is the result of many months of closed beta, hundreds of discussions with data scientists, and lots of coffee (as usual ;-)). We’re super excited but can already tell you that there is much more to come. Following features will include, for instance, feature programming or extensive prompt engineering in the zero-shot modules, as well as no-code templates for recurring labeling functions. Stay tuned, and let us know what you think of the application. We couldn’t be more excited!

We’re going open-source, try it out yourself

Subscribe to our newsletter 👉🏼 https://www.kern.ai/pages/open-source, stay up to date with the release so you don’t miss out on the chance to win a GeForce RTX 3090 Ti for our launch! :-)

How to engineer prompts for Zero-Shot modules

Shamanth Shetty — Thu, 30 Jun 2022 12:58:14 +0000

Introduction

For instance, zero-shot classification only requires some label names to estimate the topic of a document.

‍I really love football and am a big fan of Liverpool!
Possible topics: [sports, economics, science]
-> Sports (83.52%)

‍Now, how does that work? Let’s look under the hood and understand why prompt engineering is so interesting.

Using context to predict classes

These models work using context. Let’s say we want to predict the sentiment of a sentence: “The sky is so beautiful”. If we now have the following sentence, “That is [MASK]”, we can try to predict the token for [MASK], i.e., we have a mask prediction. We now provide two options to fill that mask: “positive” and “negative”. If we look at the token likelihood for that prediction, we’ll see that the model picks “positive” due to a higher probability of the chained sentence “The sky is so beautiful. That is positive.”.

We often refer to the masked sentence as the hypothesis template. If we change it, we end up with different predictions. They can be very generic but also specific to the task at hand.

Now, how can we make use of that knowledge of how zero-shot modules work? We can try to enrich the context with as much valuable metadata as possible. For instance, let’s say we have not only label names but also label descriptions. If we want to zero-shot classify hate speech, it is relevant for a classifier to know that hate speech consists of toxic and offensive comments. What could this look like? We add that data to the hypothesis template: “Hate Speech is when people write toxic and offensive comments. The previous paragraph is [MASK]”.

What’s really interesting is that the order of your hypothesis template actually matters. For instance, you could also split the hypothesis template and put the label description on the front of your context, which might improve the accuracy of your model.

Learning without changing parameters

If you think about it, we can turn zero-shot classifiers into few-shot classifiers without fine-tuning the actual model. That is, we don’t have to change any parameters of a given model. Instead, if we have some labeled data, we can just add it to the context. Let’s say we have some examples for positive sentences. We can add to the hypothesis template, “I ate some pizza yesterday, which was delicious. That was great. I saw the movie Sharknado. That was terrible.”

Of course, we don’t want to add enormous amounts of examples to the context, as at some point, it would result in too long runtimes. Additionally, zero-shot classifiers usually hit plateaus in learning quite fast, so regular active transfer learning is a better choice in these cases. Still, adding a few samples to your context, turning a zero-shot model into a few-shot, can significantly improve its performance.

In our soon-to-be-released open-source software Kern, we’re integrating zero-shot classifiers as heuristics to apply weak supervision (we’ve also covered what that is, so check it out in our blog). This way, you can make perfect use of these models without worrying about them hitting plateaus or taking too long during inference.

We’re going open-source, try it out yourself

Ultimately, it is best to just play around with some data yourself, right? Well, we’re launching our system soon that we’ve built for more than a year, so feel free to install it locally and play with it. It has a rich set of features, such as integrated transformer models, neural search, and flexible labeling tasks.

‍
‍Check out our website for more information 👉🏼https://www.kern.ai/

Subscribe to our newsletter 👉🏼 https://www.kern.ai/pages/open-source and stay up to date with the release so you don’t miss out on the chance to win a GeForce RTX 3090 Ti for our launch 😉

Play around with the data of each record in a Python IDE

Shamanth Shetty — Wed, 29 Jun 2022 13:09:25 +0000

Hey there once again 😊 We would love to address a cool feature today for all the data science enthusiasts who like to experiment and gain insights from their data. In addition to labeling records with kern, you can explore the data of each record in a Python IDE. The record is pre-defined, and comes with spaCy integration making it easy for customization as per the users needs. Our vision at kern is to provide users with an easy interface and interesting features suited for experimental or research work at rapid speeds. This completely changes the nature of the task which will enable our users to build the best labeling functions.

Check out our website for more information 👉🏼https://www.kern.ai/

Subscribe to our newsletter 👉🏼 https://www.kern.ai/pages/open-source and stay up to date with the release so you don’t miss out on the chance to win a GeForce RTX 3090 Ti for our launch 😉

6 (+1) types of heuristics to automate your labeling process

Shamanth Shetty — Tue, 28 Jun 2022 10:10:28 +0000

Introduction

Before getting started, here are the following things you’ll learn in this post :

What labeling functions are and how you can implement them
How to improve your weak supervision using active learning and zero-shot modules
How to use lookup lists to easily manage keyword- and regex-functions
What other options do you have to automate your labeling using existing resources
If you want to dive deeper into weak supervision, check out our other blog post 👉🏼 https://www.kern.ai/post/automated-and-denoised-labels-for-nlp-with-weak-supervision

The basic heuristic

We’ll start with the basics of weak supervision, such as labeling functions. Essentially, these are noisy functions that will generate some label signal. They can be super simple keyword/regex functions, complex formulas, and deep domain knowledge. For instance, imagine the following function to detect positive sentiment in some text:

def lkp_positive(record: Dict[str, Any]) -> str:
my_list = [“:-)”, “awesome”, ...]
for term in my_list:
if term.lower() in record[“text”]:
return “positive”

Their most significant advantage is explainability and maintenance, and It is straightforward to implement them. With our automated heuristic validation, you can quickly build labeling functions, grow their coverage, and estimate their performance.

Learning heuristics from few reference training data

Of course, we’re not limited to only labeling functions. Especially with the availability of large-scale pre-trained language models such as those from Hugging Face, you can integrate active transfer learning into the weak supervision methodology. You must typically label ~50 records per class to get some promising initial results.

In active transfer learning, you compute the embeddings (i.e., vector representation) of a text once and perform training on lightweight models like logistic regression or conditional random fields. This works for both classifications and extraction tasks like named entity recognition.

With kern.ai, it comes with a very easy-to-use interface. Within three clicks, you can select your language model and build models in a Python IDE similar to Scikit-Learn. Also, the IDE comes with several options for training in such a way that you can, for instance, filter specific classes or pick a minimum confidence level. Applying active transfer learning to your task is always a great idea.

To learn more about active transfer learning, check out our blog posts 👉🏼 https://www.kern.ai/post/6-1-types-of-heuristics-to-automate-your-labeling-process

Learning from label definitions

There are also options in the field of zero-shot that you can apply as heuristics. In zero-shot scenarios, you typically try to examine any information from high-level metadata such as the label names and potential descriptions to provide a pre-trained model with some context to infer predictions.

Even though this field is still under a lot of active research (see also for reference the term “Prompt Engineering”), it is already beneficial in practice, and you can use zero-shot as a great off-the-shelf heuristic for your project. For instance, if you have some news headlines for which you want to apply topic modeling, a zero-shot model can yield promising results without any fine-tuning. However, keep in mind that such models are enormous and computationally expensive, so it typically takes some time to compute their results. They are definitely much slower than active transfer learning or labeling functions.

Our software currently provides a very simplistic zero-shot interface for classifications. We currently don’t integrate prompt engineering but will do so in the future.

3rd party applications and legacy systems

Another valid option for heuristics is external heuristics, such as other applications in a similar domain. Let's say we want to build a sentiment analysis for financial data - a sentiment analysis from providers like Symanto is a tremendous external heuristic to quickly gain logical power from their systems.

We're extending our API to enable you to integrate any heuristic you like. Don't forget to subscribe to our newsletter here (https://www.kern.ai/pages/open-source) and stay up to date with the release, so you don't miss out on the chance to win a GeForce RTX 3090 Ti for our launch

Manual labeling as a heuristic

Maybe now you’re thinking about whether to include crowd workers or, in general, human annotators - well, of course, you can! The great thing about weak supervision is its generic interface through which you can integrate anything as a heuristic that fits its interface.

Manual labeling is especially useful if you want to label some critical slices. In Kern, you can easily filter your records to create slices with ease, which you can then manually label.

Their integration will also be enabled via API, so there will be no difference between the integration of 3rd party apps and manual labeling for heuristics.

Lookup functions

For label functions, you often implement keyword- or regex-based pattern matching. Those functions are super helpful, as they are easy to implement and validate. Nevertheless, maintaining them can become quite tedious, as you don’t want to constantly append to some list in a labeling function. Because of this, we created automated filling of lookup lists (also called knowledge bases) in your projects based on entity labeling.

Those lists are super easy to create and are linked to Python variables. Once created and filled, you can integrate them into your functions as follows:

Afterward, you don’t have to touch any other code in this function to maintain your heuristic. Instead, you can work on the function only via the lookup list. That’s easy, isn’t it?

Templates for labeling functions

As a last additional type, we want to show you some cool recurring labeling functions. They are available in our templates repository, so feel free to check them out - and if you want to add your own, please let us know. You can create an issue or fork the repository anytime.

For classifications, you can often integrate libraries like spaCy to use pre-computed metadata. For instance, you can just check the grammar or named entities of sentences and label your data accordingly. Another typical option is to use TextBlob for sentiment analysis:
from textblob import TextBlob
def textblob_sentiment(record):
if TextBlob(record["mail"].text).sentiment.polarity < -0.5:
return "spam"

If you’re building labeling functions for extraction tasks, you need to adapt the return statement a bit. In general, entities can occur multiple times within a text, so you yield instead of return. Also you need to provide the start and end of a span. Similar to the above function, you could use TextBlog in an extraction function to find positive and negative aspects in a review like this:

from textblob import TextBlob
def aspect_matcher(doc):
window = 4 # choose any window size here
for chunk in record[“details”].noun_chunks:
left_bound = max(chunk.sent.start, chunk.start - (window // 2) +1)
right_bound = min(chunk.sent.end, chunk.end + (window // 2) + 1)
window_doc = record[“details”][left_bound: right_bound]
sentiment = TextBlob(window_doc.text).polarity
if sentiment < -0.5:
yield “negative”, chunk.start, chunk.end
elif sentiment > 0.5:
yield “positive”, chunk.start, chunk.end

We’re going open-source, try it out yourself

Ultimately, it is best to just play around with some data yourself right? Well, we’re launching our system soon that we’ve built for more than the past year, so feel free to install it locally and play around with it. It comes with a rich set of features like integrated transformer models, neural search and flexible labeling tasks.

Subscribe to our newsletter 👉🏼 https://www.kern.ai/pages/open-source, stay up to date with the release so you don’t miss out on the chance to win a GeForce RTX 3090 Ti for our launch! :-)

AI-assisted newsletter dashboard

Shamanth Shetty — Mon, 27 Jun 2022 13:58:08 +0000

Hey there everyone, hope you’re having a great day 😊 We would like to share a few things about our workshop at the datalift summit last week. "AI-assisted newsletter dashboard" is the title of the workshop we had hosted at the datalift summit. We used a streamlit frontend with a minimal fastAPI backend, which is not only really fun to program but also ridiculously fast to set up. The data was collected over a couple of months, where we settled on three newsletters which were really interesting. BERT embeddings power the recommendations and similarity search. They are also the foundation for the topic classification of the application.

The best part ? It is python only and we got it all on GitHub ready for you to explore, so check it out! There are also tons of improvements to be made, we left a little inspiration in the repo for you to start with right away. Let’s start collaborating to build better AI solutions together ✌🏻

Link to our GitHub 👉🏼 https://github.com/code-kern-ai/datalift-summit

Integrate embeddings easily

Shamanth Shetty — Fri, 24 Jun 2022 12:23:47 +0000

Hey there, hope everyone’s having a great day 😄 Today we’ll jump right into text embeddings and integrations you can leverage to build high-quality AI models. With kern it is possible to build any kind of text embeddings with just three clicks ! Not just that, it integrates with HuggingFace transformer hub so you can pick your favourite language model for vectorization. Now imagine if this could be automatically integrated for your active transfer learning and neural search ? Yes you read that right, this is possible with our soon to be released open-source data labeling solution ! Jump right into the comments sections for questions or any queries regarding the software 😉

Check out our website for more information 👉🏼https://www.kern.ai/

Subscribe to our newsletter 👉🏼 https://www.kern.ai/pages/open-source and stay up to date with the release so you don’t miss out on the chance to win a GeForce RTX 3090 Ti for our launch 😉

AutoML-docker has been launched

Shamanth Shetty — Thu, 23 Jun 2022 14:40:38 +0000

Today, we are launching a brand new and completely free open source tool in our lineup: AutoML-docker by kern. With this CLI-based tool, you can easily use our data to build containerized natural language classifiers in no time. Our mission at kern is to make the world of data and AI more straightforward and accessible. To aid you in your quest for building value from data, we’ve been working on an easy-to-use tool to create natural language classifiers automatically. The possibilities we aim to achieve to build high-quality AI solutions for our users are endless. #automl -docker is open to everyone, but we designed it to be especially useful for people new to machine learning, data, and AI. You can also check out our #Github repository or our youtube channel for a tutorial, where we explain everything step-by-step.

To learn more about our autoML tool check out our GitHub 👉🏼 https://github.com/code-kern-ai/automl-docker

Link to the youtube tutorial 👉🏼https://www.youtube.com/watch?v=IUFyCYE6cbc

Subscribe to our newsletter 👉🏼https://www.kern.ai/pages/open-source and stay up to date with the release so you don’t miss out on the chance to win a GeForce RTX 3090 Ti for our launch 😊

Tokenization of text with spaCy tokenizer

Shamanth Shetty — Tue, 21 Jun 2022 11:49:04 +0000

In the latest version of our software, the text will now automatically get tokenized with the spaCy tokenizer of your choice saving time and energy with added features exclusively for our user’s.This not only helps you to build better labelling functions via pre-integrated metadata, but also allows you to easily label data manually. Kern sparks innovation having the easiest navigation system with features that can help you manage labelling tasks rapidly in-house. This drastically reduces the time, money and support in delivering high quality AI solutions in a hassle free way.

Subscribe to our newsletter 👉🏼 https://www.kern.ai/pages/open-source and stay up to date with the release so you don’t miss out on the chance to win a GeForce RTX 3090 Ti for our launch 😉

Applying complex filters to match your data

Shamanth Shetty — Mon, 20 Jun 2022 14:41:17 +0000

kern.ai makes complex things easier. I’m sure you are wondering how is this possible ? With kern it’s possible to apply complex filters that fit with your data. Another interesting thing you can do is utilising the metadata generated by your heuristics (such as labeling functions) to easily filter your data. Our software makes it easier and faster to analyse interesting sets of data matching your needs. Once the filters have been finalized, it can be saved later for analyzing it in the project analytics page.The left hand side consists of extensive filtering options that allow the user to slice the data according to their needs. You can also order your results, for example ordering by weak supervision confidence to refine the records where the label model is still struggling. The right hand side then displays the records that match the applied filters you can see above. Attribute filters are highlighted for an easy UI experience.

Subscribe to our newsletter 👉🏼 https://www.kern.ai/pages/open-source and stay up to date with the release so you don’t miss out on the chance to win a GeForce RTX 3090 Ti for our launch 😉

Applying neural search while labelling your data

Shamanth Shetty — Fri, 17 Jun 2022 11:37:55 +0000

Hey there everyone ! Today we’ll be talking about applying neural search while labelling your data. When neural search has been activated, it enables you to find the closest text based on transformer embeddings. This helps users to quickly identify common patterns and understand their data better. We believe in a platform where users have a higher degree of freedom when it comes to customising the data. User’s can now apply any embedding without any limitations which offers maximised customisability.

Subscribe to our newsletter 👉🏼 kern.ai and stay up to date with the release so you don’t miss out on the chance to win a GeForce RTX 3090 Ti for our launch!