Forem: Jesper Dramsch

Deepmind built a competitive coding AI called AlphaCode here's the gist of it

Jesper Dramsch — Sun, 06 Feb 2022 16:33:00 +0000

I'm obsessed with Alphacode.

How do you go from a mediocre text generator to an AI that can compete competitively?

Deepmind attempts this with AlphaCode.

Let's look at their research together.

I can safely say the results of AlphaCode exceeded my expectations. I was sceptical because even in simple competitive problems it is often required not only to implement the algorithm, but also (and this is the most difficult part) to invent it. AlphaCode managed to perform at the level of a promising new competitor. I can't wait to see what lies ahead! ~ Mike Mirzayanov

😼 The Problem

Competitive programming is a whiteboard interview on steroids.

You get:

Text problem statements
Description of inputs
Specifics of outputs

You have to understand and implement a solution.

Finally there are hidden tests.

The Backspace Problem (Mirzayanov, 2020):

💾 The Data

Best results were obtained by Pre-training on the entirety of Github. Not just Python (the target language) but C++, C#, Go, Java, Rust......

Then fine-tuning the AI on the CodeContests dataset

Teach AI code
Get into competitions
???
Profit!

3️⃣ ???

There is a third step:

Reduce False Positives.

Create extra hidden tests.

Large test variance reduces false positives.

Implementing "slow positives" reduces inefficient code.

Going from 12 to 200+ tests/problem shows significant improvement!

☝️ Intermission: Amazing tools

"Hypothesis" can automatically generate more tests for you:

https://hypothesis.works

Advent of code is competitive coding light often with "slow positives":

adventofcode.com

My Short essay

🔬 The Evaluation

You can't just submit solutions until one works in competitive coding.

So to simulate that Deepmind settled on a n@k metric.

Submit n solutions to k coding problems.

They're going for n=10, so 10@k to evaluate AlphaCode.

(Also pass@k as an upper bound.)

🧮 The Algorithm

Train a model on a bunch of code.
Then focus on contests.
Then generate a bunch of potential solutions.
Filter out trivial solutions.
Cluster similar solutions.
Select candidates (at most 10).
Submit and evaluate.

🤖 It's a transformer!

Specifically, encoder-decoder architecture takes flat text + tokenized metadata.

Tokenizer: SentencePiece (Kudo & Richardson, 2018)

Internally, uses multi-query attention (Shazeer, 2019) but shared key & value per attention block.

Visualize the multi-query model here

🦾 Let's stop here. This is long.

AlphaCode is an above-average competitive programmer.

The paper has ablation studies on tricks in the transformer

and compares impact of different pre-training datasets

and different training tricks.

It's a great read.

Conclusion

Pre-train on huge GitHub data
Fine-tune on CodeContest data with extra test cases
Transformer with multi-query attention
Generate then filter 99% & cluster solutions
Submit a max of 10

Read the Deepmind blog post

Follow these 5 great ML creators on Twitter

Jesper Dramsch — Thu, 03 Feb 2022 11:33:00 +0000

How is the Twitter website free?!

If you're getting into machine learning and deep learning, you can get a whole education on Twitter.

Check out these creators:

Let's look at their work!

🎨 In this thread Jean de Nyandwi goes into depth on ConvNets!

The development from Dense networks to Convolutional layers onto the secret enabler of deep learning Pooling Layers.

Sharing great courses and resources along the way!

The image you see below is a typical architecture of Convolutional Neural Networks, a.k.a ConvNets.

ConvNets is a neural network type architecture that is mostly used in image recognition tasks.

More about ConvNets 🧵🧵

Image credit: CS 231n pic.twitter.com/fWTMiOUP4r

— Jean de Nyandwi (@Jeande_d) December 2, 2021

🔎 But ConvNets have recently been dethroned by Transformers!

In this thread Abhishek Thakur (4x @Kaggle Grandmaster & Huggingface autoNLP)

goes into detail for an implementation of transformers in @PyTorch!

"Attention is all you need" implementation from scratch in PyTorch. A Twitter thread:

1/

— abhishek (@abhi1thakur) December 13, 2021

🤝 Transformers and ConvNets are united by one player!

The Softmax function (in most cases).

Santiago Veldarrama writes some epic threads (you've likely seen).

More recently started sharing these neat 30-second visuals on core ML concepts:

Softmax is one of the most popular activation functions.

Here is a 30-second introduction to it. pic.twitter.com/cgwWR1vQS9

— Santiago (@svpino) January 15, 2022

✒️ How do you get a text into a Transformer or ConvNet though?!

Images are easy, right? They're just matrixes.

For text you need something called Text Embedding, which converts your words into numbers.

So Luiz Gustavo made a thread about Embeddings:

Models like #AlphaCode, #LaMDA #Copilot, #GPT, #CLIP, #DALL-e depends on a very important concept:

➡️Text Embeddings

But do you know what a Text Embedding is and how to create one?

Do you even need to create one?

Let's take a look...

[1.32 min]

[This is a good one!👀]

1/9🧵

— Luiz GUStavo 💉💉💉🎉 (@gusthema) February 3, 2022

🛠️ Finally you need the right tools to train your models, right?

Philip Vollet finds the libraries and apps before they're cool.

Accidentally DDOS'd a few websites, sharing them.

Maybe the latest Tuner mixed with automatic feature selection using Shap?

shap-hypetune a python package for simultaneous Hyperparameters Tuning and Features Selection for Gradient Boosting Models!

pip install shap-hypetune

Don't forget to star the repository! https://t.co/UYV4GA432t pic.twitter.com/REybOPVX6f

— Philip Vollet (@philipvollet) January 31, 2022

TL;DR

In this thread I shared 5 top ML creators on Twitter with content about:

ConvNets
Transformers
Softmax
Text Embedding
Great Tools (Hyperparameter tuning with shap-based feature selection!)

4 Tools Kaggle Grandmasters use to win $100,000s in competitions

Jesper Dramsch — Wed, 02 Feb 2022 16:33:00 +0000

4 Tools Kaggle Grandmasters use to win $100,000s in competitions

Expertise is figuring out what works and what doesn't.

Why not let the experts tell you?

Rather than experiment from the ground up for a decade!

Pseudolabelling
Negative Mining
Augmentation Tricks
Test-time augmentation

🎨 Pseudolabelling

Some competitions don't have a lot of data.

Pseudo labels are created by building a good model on the training data.

Then predict on the public test data.

Finally, use labels with high confidence as additional training data!

📉 Hard Negative Mining

This works best on classifiers with a binary outcome.

The core idea:

Take misclassified samples from your training data.
Retrain the model on this data specifically.

Sometimes this is specifically applied to retraining on false positives as negatives.

🏁 Finish training unaugmented

Data augmentation is a way to artificially create more data by slightly altering the existing data.

This trains the ML model to recognize more variance in the data.

Finishing the last training epochs unaugmented usually increases accuracy.

🔃 Test-Time Augmentation (TTA)

Augmentation during training? Classic.

How about augmenting your data during testing though?

You can create an ensemble of samples through augmentation.

Predict on the ensemble and then use the average prediction from our model!

Conclusion

Kaggle can teach you some sweet tricks for your machine learning endeavours.

This article was about these four:

Create extra training data
Train on bad samples
Top of training with original data
Test on an ensemble of your data

Pandas borrows a core concept from SQL and these 3 emoji tell you exactly how to use it

Jesper Dramsch — Mon, 31 Jan 2022 11:33:00 +0000

Pandas borrows a core concept from SQL:

the Join

But there are so many different types of joining two DataFrames

Let's make this easy and go through Left, Right, Inner and Outer joins

Knowing these is great in interviews in addition to your usual DataFrame shenanigans

Joins always happen on two DataFrames.

This explanation will use: 🟩🟪⬛

🟩: The left DataFrame
🟪: The right DataFrame
⬛: The result

Don't get the word "join" wrong though, you can actually end up with a smaller DataFrame ⬛ than either or 🟩🟪

⬅️ The Left Join is selfish

This one takes the complete left DataFrame 🟩 and only checks for overlaps from the right 🟪

No 🟪 from outside of the bounds of 🟩 will make it into ⬛

➡️ The Right Join is almost the same as Left

Only take everything in 🟪 and overlapping 🟩

⤵️ The Inner Join

This one is tricky.

Almost always ⬛ will be smaller than 🟩&🟪.

For the Inner join, you only look at the parts of 🟩 and 🟪 that overlap.

Nothing is included in ⬛ that exists outside of this common area.

↔️ The Outer Join

Is possibly the simples one.

It is exactly what we would expect from a "join".

Take all of 🟩 and all of 🟪 and combine it into ⬛.

All of the data is in our result.

TL;DR

Pandas borrows from SQL using Joins
Left and Right join maintain the original and whatever overlaps in the other
Inner is only the common ground
Outer uses all the data in both DataFrames

3 Ways you can use Jupyter Notebook as an excellent Communication Tool

Jesper Dramsch — Sun, 30 Jan 2022 11:33:00 +0000

People love to hate Jupyter Notebooks!

They're missing out on a great communication tool.

Here's how data scientists and machine learners can

build great Jupyter Notebooks that people love to read

🧮 Fix your dang code

You can make a mess in notebooks when experimenting.

All Imports into Cell 1
Extract some functions and classes
Comments.
Fix your variable names
Fix the order and outputs
Use black to standardize your code format

✒️ Jupyter has Markdown Cells!

Use them.

If you're not using Markdown in your Notebooks, you should not be using Notebooks.

Write some prose to explain your code.

Whether analysis or model, do yourself a favour and document it.

In case you need a markdown guide.

🔂 Final checks

We all do it.

We forget to run a cell and wonder for an embarrassing amount of time, why we have an error in the next one.

Before you share your notebook:

Restart Kernel to forget all variables
Run All Cells

Not by hand. Automatic.

It'll show any mix ups

🏆 Extra Credit

🦾 Use nbconvert to run all cells from the command line

🧠 Write a utils file with reusable functions

⌨ Learn keyboard shortcuts for efficiency

🎨 Make it interactive with ipykernel

Conclusion

Care about code quality
Use markdown cells (More than just headings)
Make sure the notebook runs
Use extra tools to make your life easier

How to deal with data changing and machine learning models doing worse after training

Jesper Dramsch — Fri, 28 Jan 2022 11:33:00 +0000

Machine Learning in Production 101

I just finished some writing for the UN (ITU) about machine learning models in production.

Wonder how to deal with data changing and models doing worse after training? (when deployed)

This is for you.

📌 We're talking about Drift

Our training data is static. Contact with the real world is non-stationary.

This drift can happen in three ways:

The input data changes
The labels for the data change
The inherent relationship changes

⚗️ Input data changes!

One way to monitor these is by checking the distribution of the new data vs the training data.

We can use these tests:

Continuous: Kolmogorov-Smirnov test
Categorical: Chi-squared test

Solution:

Retraining the models regularly.

🧩 Target / Label changes

These can be natural changes similar to the input changes.

In that case, you can use the same approach.

But sometimes, our categories change, because we make discoveries or mgmt decisions.

Solution: Updates are best reflected in automated pipelines

🔀 Concept shift

This one sucks.

ML models learn the relationship between input and label (ideally).

When that relationship changes our entire historic data set is obsolete.

Essentially what happened in early 2020.

Solution: New data, but setting up auto alerts is essential

📖 More info?

I wrote an ebook about machine learning validation.

I give it away to my newsletter subscribers.

I have just made the biggest update to the ebook, including production models and machine learning drift.

Subscribe to receive weekly insights from Late to the Party on machine learning, data science, and Python.

Conclusion

We hope training data represents real-world data in machine learning, but it doesn't always.

Set up MLOps automation
Retrain for input data changes
Care for label changes
Hope it's not concept drift, where the relationship of data changes

My first first-author paper broke 50 citations. I think I know why and you can do it too

Jesper Dramsch — Thu, 27 Jan 2022 11:33:00 +0000

My first first-author paper broke 50 citations. I think I know why and you can do it too.

I shared my code and made it ridiculously easy to use.

I'm lazy myself.

If I find a paper that gives me all its ingredients ready to go,

I'm happy.

But how do you do this?

4️⃣ There are four essential files in the repo:

License file - tell people how they're allowed to use it
conda_env.yml - tell people how to install
Readme - Document everything!
Jupyter NB - The actual code (with inline documentation)

💼©️™️ How do you choose a license for your files?

I like choosealicense.com

Personally, I often go with the MIT license. It's easy, it removes your liability, it gives people the right to use and modify your code.

Modification means new research. That means citations!

🤖 Install Stuff.

It's great to provide a requirements.txt for pip

and/or

a conda_env.yml for conda installation.

I made it terribly hard for people and didn't provide package versions. Rookie mistake.

Step it one up and use poetry for exact reproduction of environments!

📝 Write that Readme.md

The readme file magically goes to the forefront of your @github repo.

Do the minimum, tell folks:

What is this about
How to install
How to use (files, commands, etc)
How to CITE! Put a bibtex file for extra credit.

🤷 People love to hate Jupyter Notebooks

But here's the thing.

Jupyter Notebooks are great, if you clean them before publication.

Put stuff into neat functions, organize sections, and make some nice plots.

Above all, use Markdown cells for inline documentation!

🏆 For extra credit:

🐳 Use docker to completely freeze the computational environment

🧮 Register your code on zenodo to give it a DOI

📌 Pin the repo to your github profile

Conclusion

Choose the right license for your code
Write amazing documentation
Make it easy to install, read, and use
Consider making it incredibly easy to use

Here's my repo.

Not a stellar example, but I give Jesper from 5 years ago a pass.

Scientists make great Data Scientists

Jesper Dramsch — Sun, 10 Oct 2021 10:33:00 +0000

I love it when an applied scientist asks me how to get into data science or machine learning.

Sure, any scientist, but geologists, biologists, oceanographers and environmental scientists especially. Applied disciplines that are commonly looked down upon by fields like physics or math. Those are the gems in the rough.

I hear the outcry already:

But math! But statistics! Those are mud people!

But here's the thing.

Few people have an intuitive understanding of complexity, like applied scientists. Few people can make a decision and gain insight in an environment where new data isn't available, and uncertainty is high.

Applied scientists know data.

It's a marvel to watch an applied scientist dig through a data set. Depending on the scientist, it may also be good to keep the kids out of earshot, but that's another story.

Data is messy.

Nothing pokes a hole in well-laid-out plans like getting the real-world data set. Andrew Ng, after teaching the world machine learning, now teaches the world how important data is. And for a good reason, your neural network, your random forest, they're random numbers in a computer until they are conditioned on data.

Applied scientists have seen it before.

An applied scientist has seen it before, even if the imposter syndrome is there because this knowledge is hard to quantify. Analyzing the genes from a dinosaur's bone marrow that is thousands of years dead or understanding how a region formed over millions of years after looking at a weathered cliff are awe-inspiring feats.

Those are hard skills. Those are skills that teach you to handle data. Those are skills that are hard to teach and hard to learn.

Teaching basic statistical concepts is easy after that.

That’s why I love when applied scientists are interested in data science.

This atomic essay was part of the October 2021 #Ship30for30 cohort. A 30-day daily writing challenge by Dickie Bush and Nicolas Cole. Want to join the challenge?

I recreated Hey.com in Gmail

Jesper Dramsch — Fri, 13 Aug 2021 10:33:00 +0000

I had some problems with Gmail, particularly emails landing in spam, that definitely weren't spam. I love the functionality of Gmail, albeit still mourning Inbox. When I tweeted about it, I got the suggestion to try a few different services, including Hey.com by Basecamp. I both remembered people clamouring for invites to Hey.com back in the day. I also remember vaguely that something was up with Basecamp.

I signed up for Hey and after two important emails directly ended up in spam and I read up on what happened at Basecamp, I decided it was time to give Gmail another shot, but make it work for me. The Hey email service promised to revolutionise email. It's opinionated approach is both a blessing and its downfall.

What I liked about Hey.com

The idea of auto-sorting email after touching them once seemed intriguing. Hey sorts emails into three buckets, namely, your inbox, a feed for newsletters and a papertrail for receipts you need searchable but never need your attention. The revolutionary idea is that hey does not use an algorithm or machine learning. Any email from a new sender does not land in your inbox but on a sorting table. There you decide where this email goes in the future.

Hey has a few other features that are neat, but that is the main feature that I liked. Touch an email once and have all subsequent email be dealt with. Without learning algorithms.

Can you auto-sort email in Gmail?

Gmail is strong on filters and labels. I love email filters.

The problem here, is that I will not go into the settings and update a filter every time I get a new email. This is in no way similar to Hey obviously, where it's one click.

I went ahead and scoured the web for some good filters and found one for the Papertrail.

subject:("Order Receipt"|"your order"|"order confirmaton"|"your payment"|"your etsy purchase"|"my account payment processed"|"your purchase"|"your gift purchase"|"thinkgeek order"|"Your threadless order"|"order from Woot"|"your amazon order"|"Return Confirmation"|"your amazon order"|"Your package"|"Order Has Been Updated"|"UPS My Choice"|"UPS Ship Notification"|"Netrition order"|"Your payment"|"Your contribution"|"order has shipped"|"Your download"|"Your Amazon.com order"|"Payment scheduled"|"bill is ready"|"order confirmation"|"thanks for becoming a backer"|"you have authorized a payment"|"changed your pledge"|"successfully funded"|"you sent a payment"|"amazon.com order"|"payment received"|"Purchase Confirmation"|"pledge receipt"|"Your TopatoCo Order"|"Humble Bundle order"|"your transaction"|"Package From TopatoCo"|"Your best offer has been submitted"|"Offer Retraction Notice"|"You've received a gift copy"|"Your Etsy Order"|"Thanks for contributing to"|"Shipping Confirmation"|"Purchase Receipt")

That was a great start but not really what I liked about Hey, where you could make a decision based on sender. So let's try and do that!

Google Apps Scripts to the rescue!

Google Scripts have a special place in the Google Cinematic Universe. It's easier to get access to some APIs than it may be with conventional tools. Additionally, you can have them run on a repeating basis.

So I had the idea of auto-updating certain filters. This is where the meat starts.

Google Apps scripts have access to the Gmail service, where you can easily retrieve filters using Gmail.Users.Settings.Filters.list('me'). The me keyword is a special access token to your personal Gmail, making it infinitely easier to work with (rather than setting up API keys).

Technically, there's no update functionality for filters. I decided to use the remove and create functionality instead. So I just have to make sure to not accidentally delete the filter and not create a new one during the auto-update.

Setting up Gmail

Gmail uses labels rather than folders, which is why I originally started using it.

I set up the Wanted, Feed, Papertrail labels to sort emails into and set up the multiple Inboxes setting as the main inbox. I'm still working on improving the system and have also added labels for social media, unwanted email and a few others that fit into my system better.

Wanted, Feed, and Papertrail are the main labels though and the script I will show you can work with any label.

I want this system to work with other labels that aren't just based on the email address of the sender. Therefore, I added another label To Sort to explicitly set a flag for emails to be used by the auto-sorting script.

Then, when I get a new email newsletter that is supposed to end up in Feed I assign the Feed and To Sort to that email. Then the script picks up this email address and it to the filter.

Setting up Filters

A long time ago, I figured that you can label Gmail filters by just including a small "text label OR actual filter". I'd normally obscure that text label, so it's never accidentally true in an actual email. Then a filter would look like {_Text_La_bel_} OR from:(test@example.com). Then I never have to remember what a filter does and can simply look it up.

I kept the filter above for orders. There are a lot of emails coming from different places you order online these days. It would be redundant to add that address or addresses from other filters to the auto-update.

This is one reason, I decided to add the To Sort label.

The script is supposed to only pick up specific filters, hence I decided that the Google Apps Script below will strictly pick up filters that begin with _Auto_. To make it easier to associate auto-sort filters with a label, I then add the label name to the text label.

I wanted to make sure it doesn't accidentally pick up the word, for example _Auto_Feed (as if that would happen), so all underscores are removed from the label, like so:

{_Auto_F_e_ed_} OR ...

The Script to Sort Them All

The following script can be run. Initialize() sets up the Gmail service. Install() sets up the recurring trigger to run every 10 minutes. updateFilter() is where the meat is.

Let's walk through relevant parts in updateFilter():

Get All Filters in Gmail
Iterate through All Filters and if the filter contains _Auto_ process the following
Extract Gmail label name from text label in filter called autolabel
Search Gmail for anything that has both autolabel and To Sort
If we find no new emails, go to next label, otherwise extract email address and add to filter, if the filter doesn't exist yet, to try and keep filter small with unique emails
Remove To Sort label from email conversation
Gmail filters cannot be longer than 1500 characters. When the filter gets too long, split out old filters into a new label e.g. _Backup_F_e_ed_ instead. These are not updated after.
Update Filter with new email addresses

Conclusion

Google Apps Scripts are surprisingly useful in the pursuit of auto-sorting emails.

I'm sure there are smarter ways to program the things above, however, for now it works well for me and has already saved me a lot of time processing email.

Now I just have to figure out what to do with all that time I used to waste on email.

Polywork Is Made for Multi-passionate Folks like Me

Jesper Dramsch — Sat, 24 Jul 2021 10:33:00 +0000

It's hardly a that I'm interested in way too many things. My PhD changed because this machine learning thing just seemed too interesting not to pursue.

Was that enough to keep my full attention?

Of course not. I also:

Learnt Python
Taught Python
Went to and won hackathons
Played around on kaggle
Created Youtube videos
Wrote blogposts
...

... you get the gist. I like creating things. I like doing things.

When cassidoo mentioned Polywork in her newsletter, I was intrigued. What would a social network that focuses on the many facets that make us up be?

What already exists with Linkedin?

Linkedin focuses on titles and credentials first.

This is also what Corporate America focuses on.

Oh, you graduated from Harvard? A job at McKinsey? Impressive. Here's my card.

Additionally, you try to gain a following by "showing thought leadership".

Recruiters telling incredible stories of empathy that none of us workers has ever experienced. Like being given a chance despite showing up late. Or a chance despite being drenched from the rain.

Double spaced creative writing assignments, followed by "Thoughts?" and "Agree?".

I'm not the kind of person that would get a job at McKinsey or Accenture (despite people working in these companies telling me otherwise). I would not get into Harvard or MIT. I don't receive awards either I'm afraid.

I try to share interesting things on Linkedin that are relevant for people getting into machine learning, but I've never been great in the popularity contest.

What is the Reality I See?

The most interesting people I see are the creators and doers.

Marie Forleo coined the phrase of multi-passionate individuals. And I feel seen with that description.

Yes I could talk about machine learning for hours. It is what I chose to talk about online.

But we could just as well talk about climbing, weight-lifting, pen & paper, scuba diving, or many many other topics and I'd be happy as a fish in water.

I love building things with others. Writing these things. Creating videos online. Speaking at events. Teaching, Learning, Communicating.

Sure thing life would be easier with Harvard credentials, an ex-Google in the Twitter bio, or any of these credentials. But only as long as they come with the other fun things.

My favourite human trait is curiosity.

Why do I like Polywork so much?

Polywork focuses on the things you do.

You have two small lines for your jobs and all the space in the world for an update of what you've been up to.

But let's start at the beginning.

The Badges. All the Badges.

When you set up your profile you can choose badges you'd like to add to your about-me.

You can even create some yourself that you deem missing.

When I scrolled through, even the ones I didn't identify with were badges I thought "oh interesting!".

Check out this thread for inspiration.

It's time for a 𝔪𝔢𝔤𝔞 𝔱𝔥𝔯𝔢𝔞𝔡🤘🧑‍🎤🎸

What are your favorite Badges you've created or discovered on Polywork that nail your personality? Drop 'em below and we'll share some of our favorites!

— Polywork (@PolyworkHQ) July 22, 2021

Now I can proudly display that I play chess, when I'm procrastinating on my Python code:

These badges are the link to posts you can make.

Activities and Posts

Once you have your little profile set up, you can post activities.

You can be as granular as you want.

Some people post big life updates.

Other people are way more granular:

Streamed on Twitch? Update.
Wrote some Code? Update.
Hit rank 1200 on chess.com? Update.
Published a new newsletter issue? Update!

That's my approach. I love adding activities to the feed.

It feels rewarding to get to score with the small things. Something that isn't a fully-fledged project.

I get to talk about the things I'm up to and they don't need to be Linkedin-polished.

This is my favourite bit about Polywork.

You can tag each post with activities.

These activities in turn are connected to the badges you saw before.

This makes your posts discoverable under badges, so you can see what other people that enjoy the things you enjoy are up to.

What Polywork Doesn't Do.

Deliberately, Polywork has decided against likes.

That's it.

Your activity isn't a popularity contest.

It just is and is allowed to stand for itself.

There's also no algorithmic curation of your feed. However, that used to be the case on every platform until they ran out of funding, so we'll see how that develops.

What Polywork Does Do.

Polywork has community in mind.

You decide beforehand, which things you are open to be contacted about.

Guest lecture? Yes. Angel Investing? No.

This is a feature that has moderation in mind.

Considering the types of harassment folks are subjected to on other social media platforms, where anything but the most egregious threats is "in line with the community guidelines", this is a refreshing approach.

Polywork has recently called for moderators to apply, as user-generated badges will always be a target of abuse as well.

Building a welcoming community? Yes, please.

What's to come

Polywork has just announced the Spacestation.

A feature where you can find people that are open to help on the specific things you are looking for.

Angel investing? I may not be your go-to, but the people open to investing are!

Need a podcast guest? The Spacestation can provide your next guest!

How do I get in?

I understand that Polywork may not be for everyone.

But it scratches an itch personally, to build a community of passionate individuals that like to draw outside of the box.

Currently, Polywork is invite-only.

I talked to them to get a special code for my readers, so of course it's

late-tothe-party 🎉

Because we're all late to the party sometimes.

Find my profile here: polywork.com/jesper

What I consider in a Laptop for Machine Learning

Jesper Dramsch — Thu, 01 Jul 2021 10:33:00 +0000

Many researchers hear about the huge hardware clusters that companies use for machine learning and AI. When you get started and even start doing more extensive projects, do you really need to move away from your work laptop?

Most people use laptops as hardware these days. Desktop computers are clunky pieces of hardware only used by people doing numerical or image processing work and gamers, so they've become very niche. Yet many people own laptops either bought themselves or through work. I've had several friends ask me if they can buy the new Macbook M1 or need $3000 laptops to break into data science.

So the new Apple, eh?

Historically Apples were marketed toward "creatives". But in their most recent iteration on the Macbooks, Apple has stepped up their game. Everyone working close to hardware was nervous when Apple bought ARM and moved away from Intel due to compatibility. Still, with the M1 laptops, they're also right away publishing a version of Google's popular deep learning framework Tensorflow!

The crazy thing is I've never even owned an Apple product, and I'm impressed with the execution. Before, my Mac-Owning friends and colleagues had to buy expensive external GPUs to run simulations, and now this! But this isn't just about Macs. Let's look at machine Learning itself because not all ML is Deep Neural Networks. Your 2016 Macbook may still be adequate to run many machine learning applications. And the best part? If they're not enough anymore, you can still rent an instance in the cloud for a few dollars during training!

Hardware Considerations for Machine Learning

Let's have a quick chat about computer hardware. Computers made up of some essential parts to do the computery things. For the kind of numerical work we are talking about, we're primarily interested in three things. The CPU is the brain of the computer. It enables complex tasks and computation and runs every program you have. The CPU closely interact with special memory called RAM, which is quite similar to short-term memory in humans. This is opposed to regular hard drives, which are much slower but perfect for longer-term storage of information.

Then there's the graphical processing unit or GPU. GPUs developed over time to make games more realistic, but because 3D graphics are essentially just a bunch of linear algebra and matrix calculations, people slowly figured out that you can do scientific calculations with them as well. These GPUs are really good at this one thing and one thing only: Throwing a bunch of matrixes at each other. These GPUs often have a dedicated memory, often called VRAM, even faster and even smaller than your CPUs RAM.

Many laptops, especially work-issued office laptops, only have a CPU and so-called integrated graphics. They can play Netflix and Youtube but will usually buckle when you want to start up any 3D game. It gets really specific real quick, but do you even need a GPU?

No, probably not. Most of machine learning is done on the normal CPU every laptop has. So you should be able to train most simple models in scikit-learn, even on your phone.

The CPU decides how fast your model trains, and usually, it doesn't matter. Model training times are relatively short anyways; the limiting factor on your laptop is RAM. There are only a few classic machine learning methods that can be trained iteratively, feeding it your entire dataset piece by piece and non of them are the ones I usually throw at any problem like Support-Vector Machines or Random Forests. There are some tricks, of course, but basically, you need to fit the whole data AND your model into memory (also called RAM) of the laptop. On smaller problems, this is also negligible. Still, some of the problems I'm working on right now require millions of data points. At this scale, I'm consuming 100s of GB of data. In my experience, for classic machine learning, you should prioritize RAM and then CPU or GPU.

Especially in office computers, the amount of RAM can be as low as 4GB (at which point opening Excel files can become a dread). That laptop quickly reaches capacity on your problem. But essentially, you should be able to validate your code for machine learning on any old machine with enough memory and often even train models in a reasonable time. Random Forests, for example, are popular because they're also very fast to train.

You can switch classical Machine Learning to the GPU with Nvidia Rapids cuML, which is basically scikit-learn on a GPU. That speeds up a lot of processing but puts you in front of some similar problems regarding the memory space on the GPU card. However, let's be honest here, if you want to train a deep learning model, you definitely want an Nvidia GPU in your laptop or desktop PC.

Reasons against running Deep Learning on your Beefy PC

Is that all?

No.

I'm not a fan of running deep learning on local hardware anymore. Running a reasonably big deep learning problem on your main PC will usually lock up the entire machine. Opening even small programs on the same machine can be a drag, as powerful as it may be. Deep learning models train for hours or even weeks, where your machine can be barely usable with significant power usage.

Where there's power, there's heat! Training deep learning models for extended periods of time can become quite hot. If you're like me during the pandemic, a lot of my work is done on the couch cosied up, despite having an amazing desk with screens and everything. Training might make the machine a bit more toasty than you'd prefer. Gaming laptops with dedicated Nvidia graphics cards are ok for this kind of training, and the Macbooks M1 will be too. Gaming laptops are notorious for not being particularly uncomfortable on the legs when playing heavy games or training models for extended periods, something to consider.

My Recommendation

You can totally train machine learning models on laptops or even your phone. Deep learning is a bit more tricky but is also becoming much more commonplace with Apple's adoption of the M1 chip.

For Machine Learning for Science, I personally recommend getting a reasonable work computer with enough RAM and a decent CPU, but leave Deep Learning to dedicated machines or the cloud. Try it

Try it out train a model on your laptop!

Transfer Learning is the Most Important Tool You Need to Learn

Jesper Dramsch — Fri, 25 Jun 2021 10:33:00 +0000

You almost immediately learn about transfer learning when you take the fast.ai course. This is no mistake! Transfer learning gets you started in your deep learning journey quickly, building your first neural network-based classifier in minimal time.

But transfer learning is so much more, and it's surprisingly important for machine learning in science.

What is Transfer Learning?

From a technical standpoint, transfer learning means taking a trained machine learning model and using it on a new, different, but related task. Let's say you take a classifier trained on ImageNet, like a ResNet. It's effortless these days to download that ResNet with trained weights on ImageNet! Usually, that training takes days to weeks and a significant chunk of money from a research grant.

However, our data isn't ImageNet. Let's say you want to start with a cat/dog classifier. Then transfer learning enables you to utilize the existing weights on the complex ImageNet data that has many more classes and more complexity captured in the weights of the network. Oftentimes, machine learning practitioners will retrain the network with a low learning rate, called fine-tuning, to maintain said complexity of the network, but adjust the weights ever so slightly for their cat/dog classifier.

Why for Science?

Science doesn't really care about cat/dog classifiers. We want to scan brains, find black holes and explore the core of the Earth!

Credits: Event Horizon Telescope collaboration et al.

This is where the big sibling of transfer learning comes in: Domain adaptation. Domain adaptation posits that an agent trained on a task A can do task B with minimal retraining if A and B are sufficiently similar. Sound familiar? Yes, it's a general version of transfer learning. The underlying idea is that the agents learn a distribution over task A , just like a neural network learns a representation of, i.e. ImageNet. Then the agent or our network can be adapted to work on a task, where the distribution is similar enough for the learned distribution to be "shiftable" towards our goal task.

What does that mean?

How can this help me in science?

Use a pre-trained Network directly

ImageNet weights are readily available and capture a lot of complexity in a classification task. Of course, there are larger datasets, but the availability here is vital. In case you're more interested in other tasks, like object identification PASCAL-VOC or COCO are good choices.

Then you can use these pre-trained networks and see whether you can fine-tune them on your data. In fact, pre-trained ImageNet ResNets were used in the Kaggle competition to segment salt in seismic data! And yes, this my conference paper from 2018 was a direct precursor to this work.

Be aware that some networks generalize better than others. VGG16 is known for its general convolutional filters, whereas a ResNet can be tricky, depending on how much data you have available.

Remember that you can always use a pre-trained network in a larger architecture! In the competition above, people included the pre-trained ResNet within a U-Net to perform semantic segmentation instead of classic single image classification!

Train Once, Fine-Tune Next

In science, especially in emerging applications, scientists will publish a proof of concept paper. Making a network work on a specific use case can be a valuable gateway to a general method.

Great scientists will publish their trained model alongside the code to reproduce these results. Frequently, you can use that trained network and fine-tune the network on a new site of a similar problem set. An example from the Focus Group meeting for AI for Natural Disaster Management I attended today is to train a network on an avalanche-prone site where a lot of data is available and then transfer the model to the new complexity captured by the first site.

This is not the generalization that most machine learning scientists strive for. However, it can be a valuable use case for applied scientists in specific areas. It's also great as a proof of concept to obtain grants for more data collection.

Train the Big One

The holy grail. The network to rule them all. The GPT-3 to my little corner of science.

When you're in the lucky position to have massive amounts of diverse datasets, you can train a model on all of it. This model ideally captures general abstractions from the data. Then, you fine-tune it to specific areas when needed.

This is, for example, done in active seismic. Many of the data brokers and service companies create large-scale models that can then be fine-tuned to the specific segmentation task a customer has at hand.

These models are particularly exciting to examine in ablation and explainability studies to gain deeper insight into specific domains and how a neural network interprets it.

Don't take My Word for it!

You could start with the fast.ai course and learn about transfer learning, or maybe read some research papers on the fantastic paperswithcode.

Or maybe you'd like to watch the CVPR 2021 workshop that in part inspired me to write this short look into applications of transfer learning: Data- and Label-efficient Learning in an Imperfect World.