Forem: Johannes Hötter

Twitter Issues are a mess!!

Johannes Hötter — Sat, 01 Apr 2023 12:50:49 +0000

Ok, you all most likely heard it. Twitter went open-source. That's amazing. Curious as I am, I wanted to dive into their repository.

When looking into their issues list, I was laughing out loud. Check this:

GitHub users are making fun on the whole release, and turn the issues list into a jokes section.

As an engineer on the dev team of Twitter, however, I would be really annoyed. Differentiating between issues of trolls and non-trolls is now a new todo on their list. So let's try to help them. I'm going to show a first, very simple version of a classifier for identifying troll-issues in the Twitter repo. Of course, I'm sharing the work on GitHub as well. Here's the repo.

Getting the data

I've scraped the issues with a simple Python script, which I also shared in the repo:

import requests
import json

PAT = "add-your-PAT-here" # see https://docs.github.com/en/authentication/keeping-your-account-and-data-secure/creating-a-personal-access-token
owner = "twitter" 
repo = "the-algorithm" 

url = f"https://api.github.com/repos/{owner}/{repo}/issues"
headers = {"Authorization": f"Bearer {PAT}"}

all_issues = []

while url:
    response = requests.get(url, headers=headers)
    if response.status_code == 200:
        issues = response.json()
        all_issues.extend(issues)
        if "next" in response.links:
            url = response.links["next"]["url"]
        else:
            url = None
    else:
        print(f"Failed to retrieve issues (status code {response.status_code}): {response.text}")
        break

issues_reduced = []
for issue in all_issues:
    issue_reduced = {
        "title": issue["title"],
        "body": issue["body"],
        "html_url": issue["html_url"],
        "reactions_laugh": issue["reactions"]["laugh"],
        "reactions_hooray": issue["reactions"]["hooray"],
        "reactions_confused": issue["reactions"]["confused"],
        "reactions_heart": issue["reactions"]["heart"],
        "reactions_rocket": issue["reactions"]["rocket"],
        "reactions_eyes": issue["reactions"]["eyes"],
    }
    issues_reduced.append(issue_reduced)

with open("twitter-issues.json", "w") as f:
    json.dump(issues_reduced, f)

print(f"Retrieved {len(all_issues)} issues and saved to twitter-issues.json")

Of course, these days, I didn't write the code for this myself. ChatGPT did that, but you all already know that.

I decided to reduce the downloaded data a bit, because much of the content didn't seem to be relevant to me. Instead, I wanted to just have the URL to the issue, the title and body, and some potentially interesting metadata in form of the reactions.

An example of this looks as follows:

  {
    "title": "adding Documentation",
    "body": null,
    "html_url": "https://github.com/twitter/the-algorithm/pull/838",
    "reactions_laugh": 0,
    "reactions_hooray": 0,
    "reactions_confused": 0,
    "reactions_heart": 0,
    "reactions_rocket": 0,
    "reactions_eyes": 0
  },

Building the classifier

With the data downloaded, I started refinery on my local machine. With refinery, I'm able to label a little bit of data and build some heuristics to quickly test if my idea works. It's open-sourced under Apache 2.0, you can just grab it and try along.

Simply upload the twitter-issues.json file we just created:

For the title and body attributes, I added two distilbert-base-uncased embeddings directly from Hugging Face.

After that, I set up three labeling tasks, of which for now only the Seriousness task is relevant.

Diving into the data, I labeled a few examples to see how the data looks like and to get some reference labels for my automations I want to build.

I realized that quite often, people are searching for jobs in issues. So i started building my first heuristic for this, in which I use a lookup list that I created to search for appearances of job-terms. I'm going to later combine this via weak supervision with other heuristics to power my classifier.

For reference, this is how the lookup lists looks like. Terms are automatically added while labeling spans (which is also why i had three labeling tasks, one for classification and two for span labeling), but I could also have uploaded a CSV file of terms.

As I also already labeled a bit of data, I created a few active learners:

With weak supervision, I can easily combine this active learner with my previous job search classifier without having to worry about conflicts, overlaps and the likes.

Also I noted a couple of issues with just a link to play chess online:

So i added a heuristic for detecting links via spaCy.

Of course, I also wanted to create a GPT-based classifier, since this is publicly available data. However, GPT seems to be down while I'm initially building this :(

After circa 20 minutes of labeling and working with the data, this is how my heuristics tab looked like

So there are mainly active learners, some lookup lists and regular-expression like heuristics. I will add GPT in the comments section as soon as I can access it again :)

Now, I weakly supervised the results:

You can see that the automation already nicely fits the distribution of trolls vs. non-trolls.

I also noticed a strong difference in confidence:

So I headed over to the data browser and configured the confidence so I only see the records with above 80% confidence.

Notice that in here, we could also filter by single heuristic hits, e.g. to find records where different heuristics vote different labels:

In the dashboard, I now filter for the high confidence records and see that our classifier is performing quite good already (note, this isn't even using GPT yet!):

Next steps

I exported the project snapshot and labeled examples into the public repository (twitter_default_all.json.zip), so you can play with the bit of labeled data yourself. I'll continue on this topic the next days, and we'll add a YouTube video for this article for a version 2 of the classifier. There certainly are further attributes, we can look into, such as taking the length of the body into account (I already saw that shorter bodys typically are troll-like).

Also, keep in mind that this is an excellent way to benchmark how power GPT can add for your use case. Simply add it as a heuristic, try a few different prompts, and play with excluding or adding it from your heuristics in the weak supervision procedure. For instance, here, I excluded GPT:

I'm really thrilled about Twitter going open-source with their algorithm, and I'm sure it will add a lot of benefits. What you can already tell is due to the nature of Twitter's community, issues are often written by trolls. So finding detecting such will be important for the dev team of Twitter. Maybe this post here can be of help for that :)

Why and how we started Kern AI (our seed funding announcement)

Johannes Hötter — Thu, 16 Feb 2023 08:17:15 +0000

Our co-founders Henrik and Johannes first met in January during a seminar at the Hasso Plattner Institute in Potsdam. It was a one-week seminar, which went from early in the morning until late at night. During that time, Johannes had just started an AI consultancy and was about to land the big first project. He was euphoric - soon, he would be able to implement a large neural network to process image data on a large scale in a real-world project.

After an in-depth discussion with Henrik, he realized he wasn’t prepared for it. It was a project deeply-rooted in Physics. The funny thing, Johannes is the person that failed almost every Physics exam in school. He had great knowledge of the latest AI frameworks and architectures and was able to write decent ETL pipelines, but he had zero knowledge about what he was meant to build.

Henrik, who studied physics before he started his master's degree, offered to help. They decided to implement the project together, and after they finished it successfully (which, looking back, is a bit of a miracle), they realized two things:

Henrik and Johannes would make an awesome founder duo
to implement a successful project, AI alone isn't going to help. It requires both engineers and business users (you might argue, "that is something you can read on Forbes". They did that before, but it was something completely different when they realized it hands-on in a project).

During 2020, both continued to implement projects but realized that they’d like to start building their own software. By the end of 2022 they decided to turn their consultancy into a software startup. This is how Kern AI was born.

"We know a lot about AI and have built great projects that created value for clients, but we certainly are missing lots of domain knowledge. Why not build a No-Code AI tool, and let the end user implement the AI?", share Johannes and Henrik about their thinking process.

Together, they built the first mockup in November '20, signed an agreement with a client by December '20, and developed the MVP in January '21. It was about to go into production, and Henrik and Johannes were about to witness our first SaaS client succeed, right? ... Wrong ...

Our first (failed) product: onetask

We called the SaaS onetask (you had to do one task to build the AI). By labeling data, a model was trained in the background, which you could then call in a small playground or via an API.

In February '20, both received the first feedback from the client and were shocked: The AI was just as good as random guessing. It learned nothing. And in addition, the people that labeled the data felt insecure about "building an AI". Henrik and Johannes figured out two new things:

The AI was fed with training data that, as a data scientist, you wouldn't consider training data. It simply wasn't good enough (they did plenty of projects beforehand and faced bad data before, but as they were able to fix data issues with our technical knowledge, they didn't realize how big this obstacle would be).
Being involved in building AI doesn't mean building the AI. The users felt insecure. But doesn't No-Code always win? Well, most often, No-Code applications result in deterministic results. Connecting your Webflow form to Hubspot via Zapier means that a new inbound lead is always sent to the CRM. But building AI means building statistical applications that produce results that are probabilistic. It's a new level of complexity.

As both were trying to figure out with the client how they could improve the AI, one developer from the client’s side asked Henrik why they don't automatically label the data via some rules and then let the users only label parts of the data. This simple statement was core to our product pivot:

Give superpowers to technical users. Our main user shifted from a non-technical user to a developer. Understand what they require to build AI, and help them build that.
Optimize the collaboration with end users (or generally where the technical user requires help), but have clear separations of responsibility.

At that time (the full team was still enrolled at university), Johannes heard about data-centric AI in research, a concept in which developers focus on building the training data of an AI system in collaboration with domain experts. "Jackpot, that's it!" - they looked for another early client, pitched the concept to their data science team (i.e., again, they went to our end users first), and outlined a project.

In May '21, we had the next MVP.

Early signs of the right direction

We saw that the data science team of our client had as the initial training data an Excel spreadsheet that was partially labelled years ago. Think of column A containing the raw content that should be predicted and column B (partially) containing what the model should predict. No documentation at all. Yikes.

Because of this, in the following project, our goals were:

To give data scientists more control in building the AI
To let domain experts collaborate actively (as we knew that this is crucial from day one)

Our MVP gave the data scientists a toolkit powering labelling automation, initially to fill out missing manually labelled data. To set up the automation, we asked the domain experts to label some data with us in a Zoom session and to speak out loud about what they were thinking as they were labelling the data.

Turns out this 2-hour session was worth a ton. Why?

The data scientists learned more about the data itself. Of course, they weren't completely new to the field, but no domain expert said out loud before what they were thinking about the record.
In the call, we turned the thoughts into code (think little Python snippets), and ran our software to combine the heuristics with some active learning (i.e., Machine Learning on the data labelled in the session). Seeing how the labelling was turning more and more into automation, the domain experts were excited at the end of the call, feeling they were an active and integral part of the process.
Lastly, the data scientists had a much better foundation to build models. Their training data now contained more labels, it contained better labels (we found tons of mislabeled data in that process).
Furthermore, the data was documented via automation and was more and more becoming part of an actual software artifact.

Ultimately, the data science team built a new model on top of the iterated training data, resulting in an F1-score raise from 72% to 80%. In non-technical terms, this means that you can trust your model much more.

We found that we were heading in the right direction. Our next question was: "what do we need to build precisely, and how can we best ship this to developers?”.

To answer the first question better than anyone else, we realized in early 2022 that we must win the hearts of developers. And this - for many good reasons - typically means via open-source.

We went open-source - version 1.0 of “Kern AI refinery”

Fast forward to July ‘22 (after many further product iterations and a full redesign), we open-sourced our product under a new name: Kern AI refinery (the origin of the name is very simple: we want to improve, i.e., refine, the foundation for building models).

We decided to fully focus on natural language processing (NLP), as we both saw refinery performing exceptionally well in NLP use cases in the past, and as we got incredibly excited about what the future of NLP might bring (this was before ChatGPT btw).

On our launch day, we were trending on HackerNews, and so we quickly gained interest from developers all over the world. From the feedback we got, we saw that refinery was moving exactly in the direction we hoped it would:

Shortly after the release, we had more than 1,000 stars on GitHub (i.e., a GH users expressing that they like the project), hundreds of thousands of views on the repository, and dozens of people telling us about the use cases they implemented via refinery. We were thrilled and started digging deeper.

This leads us to today.

Announcing our seed funding, co-led by Seedcamp and Faber with participation from xdeck, another.vc and Hasso Plattner Seed Fund

We are happy to announce that Seedcamp and Faber co-led our seed funding of €2.7m.

Our investors share our vision of bringing data-centric NLP into action and trust us in building Kern AI by focusing on the end users first. We’re thrilled to receive their support and backing and now aim to continue expanding our platform.

Doing so, we today announce the release of our data-centric NLP platform.

It is the result of our insights and efforts since we started Kern AI. What makes it stand out?

It puts users in their roles, while also sparking collaboration and creativity. bricks (our content library) is connected with refinery (database + application logic), such that developers can turn an idea into implementation within literally seconds. Why? Because that way, devs and domain experts can validate ideas immediately.
It is capable of doing the sprint and the marathon. Prototype an idea within an afternoon and automatically have the setup to grow your use cases over time. Just like regular software.
You can use it both for batch data and real-time streams. Start by uploading an Excel spreadsheet into refinery, and over time grow your database via native integrations or by setting up your own data stream via our commercial API (gates).
It is flexible. You are using crowd labelling to annotate your training data? No problem, you can integrate crowd labelling into refinery. Do you already have a set of tools? This also works, and refinery even comes with native integrations to tools like Labelstudio. The more familiar you get with the platform, the more use cases you will see. That’s what gets us excited: sparking creativity.
It can power your own NLP product as the database. Or you can use it as the NLP API. Or you can even cover a full end-to-end workflow on it. Use cases range from building sophisticated applications up to implementing a small internal natural language-driven workflow.

Our team is genuinely excited about what comes next. We believe that NLP is just about to get started, and it will disrupt almost anything touched by technology. And we’re confident that our work will contribute to it.

How we power our open-source neural search with qdrant

Johannes Hötter — Mon, 03 Oct 2022 09:32:52 +0000

This is going to be part of a series of posts in which we show how refinery builds on top of qdrant, an open-source vector search engine.

At Kern AI, we believe that the biggest breakthrough in deep learning is all about embeddings. If you wonder what that means: computers can’t really understand unstructured texts or images. Instead, they require some numeric representation of e.g. a text. But how can you transform text into numbers? This is precisely what neural networks excel at, they can learn such representations. What is the benefit, you might wonder? You can calculate with e.g. the meaning of a text. What is the output of “king” - “man” + “woman”? With embeddings, you’ll get the numeric representation of “queen”. And that is nothing less but breakthrough material!

In other terms: embeddings are a generalization of database technologies. Instead of filtering and searching only on structured data such as spreadsheets, we’re currently experiencing search technologies build on top of embeddings. Effectively, you turn text into a query-able structure that embeds the meaning of the text.

Let’s say you’re looking for the answer to a question in a large text corpus. With embeddings, you can turn the text passages into numeric vectors, and compute the closest vector to your question. Statistically speaking, this gives you the most relevant answer to your question. You don’t even need to match keywords here, embeddings understand synonyms and context!

At Kern AI, we have built refinery, an open-source IDE for data-centric NLP. Here, we make use of large-scale language models to compute said embeddings and enable developers to find e.g. similar training samples, to find outliers, or to programmatically apply labeling to the training data.

But as we’ve built our developer tool, we realized that scalability was a big issue. Retrieving similar pairs based on their cosine similarity (given by their embeddings) was something we couldn’t do on larger scales, e.g. hundreds of thousands of records.

This is where qdrant, an open-source vector search engine enters the game. Implemented in Rust, their engine provides an API to their vector database which allows you to retrieve similar pairs even under high load within milliseconds.

To get started with qdrant, simply execute the following lines in your CLI:

Now, let’s look into an example snippet of our application, in which we have set up an endpoint using FastAPI to retrieve the (up to) 100 most similar records given some reference record:

You can look further into the code base here

As you can see, it is rather straightforward. You need to get the query vector which is your reference, and can limit the result set by a similarity threshold (i.e. which cosine similarity do result records need to have to be contained in the filtered set?). Further, you could extend the search via query filters, which we’re not making use of so far.

With its scalability and stability, qdrant is a core technology in our stack. As the application of similarity search is not only limited to a neural search approach, we’re even using it for features such as recommendation engines. We’re going to share insights about this in a future article.
‍
Now, we encourage you to look into qdrant and their engine. You can see one live example in our online playground, which you can find here: https://demo.kern.ai
‍
Check out their GitHub here: https://github.com/qdrant/qdrant

And don’t forget to leave a star in our refinery: https://github.com/code-kern-ai/refinery

Build for Hugging Face, Rasa or Sklearn

Johannes Hötter — Tue, 06 Sep 2022 11:24:01 +0000

We've built our open-source IDE for data-centric NLP with the belief that data scientists and engineers know best what kind of framework they want to use for their model building. Today, we'll show you three new adapters for the SDK.

Let's jump right in.

Hugging Face

Transformer models are arguably one of the most interesting breakthroughs in natural language processing. With Hugging Face, you have access to an abundance of pre-trained models. Ultimately, however, you want to finetune them to your task at hand. This is where Hugging Face datasets come into play, and you can generate them with ease using our adapter.

The dataset is a Hugging Face-native object, which you can use as follows to finetune your data. This could look as follows:

The documentations of Hugging Face have some further examples on how to finetune your models.

Sklearn

The swiss army knife of machine learning. You most likely have already worked with it, and if so, you surely love the richness of algorithms to select from. We do so to, and so we decided to add an integration to Sklearn. You can pull the data and train a model as easy as follows:

The data object already contains train and test splits derived from the weakly supervised and manually labeled data. You now can fully focus on hyperparameter tuning and model selection. We also highly recommend to check out Truss, an open-source library to quickly serve these models. We'll cover this in a separate article.

Rasa

If you want to build a chatbot or conversational AI, Rasa arguably is one of the first choices with its strong framework and community. We love building chatbots with Rasa, and already covered a YouTube series on how to do so. Here, we'll now introduce you the rasa adapter for our SDK, with which you can build the training data for your chatbot with ease.

This will directly pull your training data into a YAML format, off-the-shelf ready for your chatbots to learn from. This way, you can manage and maintain all your chat data within refinery, and have it ready for your chatbot training at hand.

You see, we already cover some integrations to open source NLP frameworks. If you're missing one, please let us know. We're happy to add them continuously, to bridge the gap between building training data and building models.

If you haven't tried out refinery yet, make sure to check out our GitHub. Also, feel free to join our Discord community - we're happy to meet you there!

Beautiful UIs with Figma and Tailwind

Johannes Hötter — Mon, 25 Jul 2022 21:27:00 +0000

In this post, we’re going to share how we used Figma and Tailwind to redesign our open-source tool refinery. The article will entirely focus on how to build beautiful UIs quickly. You don’t need any prior knowledge to understand this post.

After this post, you’ll know:

Why Figma and Tailwind are such a great combination to build a beautiful UI
How you can quickly build a consistent design
Those mockups are worth the time! :-)

Our change in UI

To jump into this article, we first want to show you what our manual labeling screen used to look like. The sidebar was too dominant because the color coding and the label colors were a bit gloomy. We knew our UI needed some improvements.

Because of that, we built up some mockups using Figma and Tailwind, arguably one of the best design toolkit combinations. Figma is excellent for building high-quality mockups fast and collaboratively. In contrast, Tailwind offers a great set of predefined classes and components for your web app - and it comes with an amazingly well-designed Figma template kit. Within hours, we set up the following mockup:

Showing this page to users, we saw that it not only offered a better-looking design but also provided a better workflow. From here, we decided to implement the mockup in Tailwind. The result is almost looking the same as the mockup, as you can see here:

So with that in mind, we want to show you how we did that.

Don’t start from scratch!

First, it is relevant to know that there are template kits available in Figma. You can browse them in their library or at third-party libraries. So you don’t need to rebuild everything on your own!

Our tip is that you should use templates of which you have the HTML codes, too. This enabled us to easily switch between Figma mockups and implemented UI elements without losing the design. But you can still also build the HTML files from scratch.

In our case, we imported the Tailwind template into Figma and had access to the mockup elements:

Those same elements are available in the Tailwind UI components library:

And this is not only true for high-level elements like the layout of your screen but also for detailed elements like badges:

Implement your screens in Figma first

From here, it becomes easy to drag and drop your elements into the respective layout you want. After 15 minutes of playing around with it, you should already feel comfortable building custom mockups in Tailwind, and it should take you little effort to turn these mockups into UI shells.

We developed this not only for one but for multiple screens; roughly 95% of the things you can see in our application have been first designed via mockups.

The reason we did so was that it helped us to ensure consistency throughout the entire application. It made it easier for our team members to gain early feedback on the whole workflow of the application and how the UI would differ from our previous version. Ultimately, it also allowed us to implement the UI much faster. As soon as the mockup was finished (for that version, of course, we’re already improving on that with our following upcoming versions), we “just” had to implement the shell.‍

We believe that Figma and Tailwind are the best combinations to build beautiful UIs, but there are many great alternatives. Most important is that this approach helps you build what’s been discussed and agreed upon in a short time. We’ll continue building our application based on mockups, and we’ll discuss them with our community :)‍

If you’re interested in seeing an application built this same way, check out our open-source tool refinery. Also, feel free to join our community on Discord if you have any questions.

We're open-source! The data-centric sibling of VS Code

Johannes Hötter — Mon, 18 Jul 2022 09:53:00 +0000

Hello, open-source world!

We have been working tirelessly towards this day for a long time. Finally, we can say that Kern refinery goes open-source, and we celebrate this with our version 1.0!

There are three main reasons why we decided to do that step:

Community
We strive to create a community of like-minded devs who want to participate in the movement toward data-centric AI.

Innovation
We want to drive innovation through collaboration. Open-source software provides a faster response time to current market needs.

Transparency
It is important to us that the access and individualization of our software are possible for all of our users.

So as of today, you can pip install kern-refinery on your machines to download and run our application.

Now that we've finished our open-source go-live, we are looking forward to working towards those three goals every day.

Our work towards the release was mainly put into - again - three areas:

New UI and improved UX
You might have noticed that our app has a new look. We did several user tests, rebranded our app, and went for the following look:

Also, we integrated some features that make it easier to play around with the data from a programmatic point of view, such as the record IDE:

You can check out those things in our guide.

What do you think of the new UI look? Let us know, we're excited about your feedback!

Extended documentation and use cases
We've put extra effort into everything related to your first impression and first successes of using Kern refinery. And what's super important for that is documentation and use cases. You can now not only find more insights in this very documentation but can also find hands-on examples on YouTube, on our GitHub, and on our community spaces (discussion forum and Discord).

Architectural changes
Lastly, with an open-source release, we wanted to improve our architectural design. We've spent lots of effort on refactoring services and making sure that we can quickly iterate on your product feedback and ideas. In total, we've spent now more than 18 months on this very application, from initial design to the first MVP and now version 1.0 - but of course, we are still only getting started to build the data-centric development environment specifically designed to help data scientists in building great AI models. Help us, and we'll make sure to continue building something people love!

To stay up-to-date with everything, make sure to subscribe to our newsletter, and don't forget to give us a star on GitHub. We couldn't be more excited for the future!