Forem: AdamBarnhard

Building a COVID-19 Project Recommendation System

AdamBarnhard — Mon, 13 Apr 2020 15:38:36 +0000

How to create a GitHub open source repo recommendation system web app with MLflow, Sagemaker, and Booklet.ai.

TL;DR: We built a COVID-19 Project Recommendation System to help people discover open source projects in the community. Read on to see how we built it.

I’ve been so inspired by how much the open-source community has contributed during the COVID-19 pandemic. GitHub recently published a dataset of over 36K (!!) open-sourced repos where folks have contributed their time and code towards the community.

As I was browsing the list of projects available, I was overwhelmed by the passion that developers, data scientists, and other technical communities have poured into projects around the world. From an app to track symptoms to extensive analysis on existing datasets, the long list of projects is truly inspiring.

I found myself (virtually) discussing with friends about how we can contribute to these efforts. As we discussed building an analysis or tracking app, we realized it would be more impactful to contribute to COVID-19 open source efforts that are already underway. This led me to the idea of helping people find these great projects that are relevant to their skill sets, as easily as possible.

Given the sheer volume, finding a project to contribute to can prove challenging. There is such a wide range of projects, covering many different languages and topics. To make this easier, we built a recommendation system. Given a set of languages and keywords that you are interested in, you can find a few projects that may be relevant to your input. You can try out the COVID-19 Project Recommendation System, or read along to see how it was built! You can also access the full notebook for this project on GitHub.

COVID-19 Project Recommendation System on Booklet.ai

You may find some similarities between this demo and an end-to-end lead scoring example that I posted. If you read that blog post as well, thank you! Feel free to skip the sections that you have already covered.

Disclaimer: I am one of the co-founders of Booklet.ai, which is a free tool used as part of this demo.

Prerequisites

This will be a technical tutorial that requires a bit of coding and data science understanding to get through. To get the most out of this, you should have at least a bit of exposure to:

Python (we will stay within Jupyter notebooks the whole time)
Natural Language Processing (we will use a simple CountVectorizer)
The command line (yes, it can be scary, but we just use a few simple commands)
AWS (we will help you out on this one!)

Also, you should have a few things installed to make sure you can move quickly through the tutorial:

An AWS username with access through awscli (we will cover this below!)

Python 3 of some kind with a few packages:

Pandas pip install pandas
MLflow pip install mlflow
SKlearn pip install scikit-learn
NLTK pip install nltk
Docker (quick and easy to install here)

Before we begin…

We’re going to touch on a variety of tools and ideas. Before we dive right in, it’s important to take a step back to understand what’s happening here. There are a few tools that we will be using:

Jupyter Notebook: A common way to code with Python for Data Scientists. Allows you to run python scripts in the form of a notebook and get results in-line.
MLflow: An open source model management system.
Sagemaker: A full-stack machine learning platform from AWS.
Booklet.ai: A model web app builder.

Here is a diagram that outlines how these different tools are used:

Author’s Work

We will utilize Python within a Jupyter notebook to clean the Github data and train a model. Next, we will send that model to MLflow to keep track of the model version. Then, we will send both a docker container and the model into AWS Sagemaker to deploy the model. Finally, we will use Booklet to present a demo for the model.

Let’s dive in!

Python: Training the Model

About the Data

We are utilizing the dataset from the covid-19-repo-data repo on Github. It contains 36,000+ open source projects that are related to Covid-19, as well as metadata associated with each. To learn more about the dataset, check out the data dictionary.

Importing and Cleaning the Data

First, you’ll need to clone the GitHub repo to pull the data onto your machine:

git clone https://github.com/github/covid-19-repo-data.git

After pulling the most recent data file into your working directory, you’ll want to import the tsv (tab seperated values) using pandas:

There are a large variety of projects listed within the GitHub dataset, but we want to find projects that are most likely to benefit from having additional contributors. To do that, we are going to setup a few filters:

It must have at least one merged PR
It must have both a ReadMe and a Description
It must have a primary language name listed (repos that are mostly text, such as a list of articles, might not have a programming language listed)
It must have at least 2 distinct contributors

Here’s how we can apply those filters with pandas:

We plan to use a text-based approach to recommending projects, utilizing a combination of the repo description, topics, and primary language as our core bag-of-words. In order to simplify our text processing, we are going to remove punctuation and limit to only english-speaking repo descriptions (more languages might be added later!). To do this, we will need to:

First, detect the language using the langdetect package and limit to only english-detected descriptions.
Next, we will check for descriptions that only contain latin characters. Before we do this, we will need to turn emojis into strings using the emoji package (emoji’s aren’t latin?). We will use this to exclude descriptions that don’t have only-latin characters as a backup to the langdetect method above.
Finally, we will remove punctuation from the repo description, topics, and primary language, and then combine all of those into space separated string.

To do the tasks above, we will setup a few helper functions that will be applied to the text within the dataset:

Next, we will use these functions to work through the clean-up and filtering:

After running all of these filters above, we now have about 1400 projects to recommend. Much smaller than our original set, but hopefully still enough to help users discover interesting projects that can be contributed to.

Building and Running the Vectorizer

As we discussed, we will recommend projects based on a bag-of-words consisting of the description, topics, and primary language. We need to take this large string of words, and turn it into a form that can be used for a recommendation algorithm. To do this, we will create a Vectorizer (also called a Tokenizer) that turned our strings into a sparse vector indicating which words are present in the string.

To keep things simple, we will use sklearn’s CountVectorizer. Before we run the vectorizer, there are a few important inputs to discuss:

Lemmatizing is the process of taking multiple forms of a single word and converting them into a single “lemma” of the word. For example, ‘case’ and ‘cases’ will both be converted to ‘case’. This is important to reduce the noise and sheer number of words considered when creating the vector. Here we create a LemmaTokenizer class using the WordNetLemmatizer() from the nltk package:

Stop words are a list of words that we will remove before creating the vector. These are words that aren’t directly adding value to our recommendation, such as “us”, “the”, etc. Sklearn has a nice pre-built list of stop words that we can use. Also, we’ll remove words that are very common among our use-case and won’t differentiate between projects, such as “covid19” or “pandemic”. Here we combine the pre-built list with our hand-curated list specific to this dataset:

Now we can build the vectorizer and run it on our list of projects. This created a large, sparse matrix which is vectorizing each bag-of-words in our dataset:

Building the Recommender

To recommend projects, we will take in a set of languages and keywords, turn that into a bag-of-words, and find the most similar repositories using their bag of words. First, we’ll let’s check out the code, and then we can break it down:

First, we take the inputted text fields and turn them into a single bag-of-words.
Next, we use the same vectorizer from before to transform that bag-of-words into a vector representing the text. It is important to us the same transformer from before, so that the vector will represent the same set of text as the rest of the repositories.
Then, we use cosine_similarity to map the vector from our inputted string to the entire list of repositories. This allows us to understand how “similar” each individual repo’s bag-of-words is to our inputted bag of words. Here is a great article if you want to read more about cosine similarity!
Finally, we take the top 10 most similar repositories and return it as a DataFrame.

To turn this recommendation function into a form that can be deployed (we’ll cover this more in a minute), we are going to utilize MLflow’s pyfunc class. We need to create a class that has all of the necessary inputs for our prediction, and then runs the prediction and returns the result in the correct format:

We can now test out our new class. We can create an object based on the class, with all of our inputs, and make sure the output populates as expected:

You should see a dictionary-formatted DataFrame — nice!

MLflow: Managing the Model

What is MLflow?

Before we go setting this up, let’s have a quick chat about MLflow. Officially, MLflow is “An open source platform for the machine learning lifecycle.” Databricks developed this open source project to help machine learning builders more easily manage and deploy machine learning models. Let’s break that down:

Managing models: While building an ML model, you will likely go through multiple iterations and test a variety of model types. It’s important to keep track of metadata about those tests as well as the model objects themselves. What if you discover an awesome model on your 2nd of 100 tries and want to go back to use that? MLflow has you covered!

Deploying models: In order to make a model accessible, you need to deploy the model. This means hosting your model as an API endpoint, so that it is easy to reference and score against your model in a standard way. There is a super long list of tools that deploy models for you. MLflow isn’t actually one of those tools. Instead, MLflow allows easy deployment of your managed model to a variety of different tools. It could be on your local machine, Microsoft Azure, or AWS Sagemaker. We will use Sagemaker in this tutorial.

Setting up MLflow

The MLflow tracking server is a nice UI and API that wraps around the important features. We will need to set this up before we can use MLflow to start managing and deploying models.

Make sure you have the MLflow package installed (check out the Pre-reqs if not!). From there, run the following command in your terminal:

mlflow ui

After this, you should see the shiny new UI running at http://localhost:5000/

If you run into issues getting this setup, check out the MLflow tracking server docs here. Also, if you’d prefer not to setup the tracking server on your own machine, Databricks offers a free hosted version as well.

Logging the model to MLflow

Since we already created the pyfunc class for our model, we can go ahead and push this model object to MLflow. To do that, we need to first point Python to our tracking server and then setup an experiment. An experiment is a collection of models inside of the MLflow tracking server.

Before we send the model object to MLflow, we need to setup the Anaconda environment that will be used when the model runs on Sagemaker. In this case, we need to utilize both the default Anaconda, as well as conda-forge. The reason for using conda-forgeis so that we can also download the nltk_data alongside our nltk package. Normally, you can download the data within the script, but in this case, we need to make sure that data is stored alongside the package for use when we deploy the model. More details on nltk_data can be found here. Also, for more information about Anaconda, here’s a detailed overview.

Now, we can go ahead and log the model to MLflow. We use this command to send the class that we created, as well as all of the inputs to that class. This ensures everything needed to run the model is all packaged up together.

Testing a Local Deployment

Before we move on to deployment, we’ll want to test deploying the model locally. This allows us to understand any errors that might pop up, before having to go through the process of actually creating the endpoint (which can take a while). Luckily, MLflow has a nice local emulator for Sagemaker: the mlflow sagemaker run-local command. We setup a little helper function so you can copy and paste the run-local script directly into your terminal:

Once you’ve run that in your terminal (it might take a minute), you can test the local endpoint to make sure it is operating as expected. First, we setup a function to easily call the endpoint, and then we pass in our DataFrame, in a JSON orientation:

You should see the same result as you did when you tested the class earlier on! If you run into an error, head over to your terminal and you should see the stack trace so you can debug.

Sagemaker: Deploying the Model

What is Sagemaker?

Sagemaker is a suite of tools that Amazon Web Services (AWS) created to support Machine Learning development and deployment. There’s a ton of tools available within Sagemaker (too many to list here) and we will be using their model deployment tool specifically. There are some great Sagemaker examples in their GitHub repo here.

Setting up Sagemaker

First things first, you need to get permissions worked out. AWS permissions are never simple, but we will try to keep this easy! You’ll need to setup two different settings: a user for yourself and a role for Sagemaker.

The first is a user account so that you can access AWS as you send the model to Sagemaker. To do this, you’ll need to head over to the Identity and Access Management (IAM) console and setup a user account with Administrator permissions. If your security team pushes back, “Sagemaker Full Access” should work too! At the end of the setup flow, you’ll be given an AWS Access Key ID and a AWS Secret Access Key. Make sure to save those! They are not accessible after that first time. Now, head to your terminal and type aws configure. This will prompt you to enter your AWS keys that you just collected. Once you have that setup, you’ll now have AWS access from both the terminal and from Python! Here are more details from AWS.

The second is a role (which is essentially a user account for services within AWS) for Sagemaker. To set this up, head to the roles section of IAM. You’ll want to assign this role to Sagemaker and then pick the policy called “SagemakerFullAccess.” At the end of this process, you’ll get something called an ARN for this role! We’ll need this for deployment so keep this handy. More details from AWS here.

Finally, we need to push an MLflow docker container into AWS. Assuming you have the permissions setup correctly above and docker installed (see prerequisites section for docker setup), you’ll want to run the following command in your terminal:

mlflow sagemaker build-and-push-container

This will push a docker container into AWS, which will be used during deployment.

Deploying to Sagemaker

Now that we have everything setup, it’s time to push our model to Sagemaker!

The deploy function usually takes a 5 to 10 minutes to complete and the status is checked every so often with this function until completion. Once the deployment is complete, you’ll be able to find a model listed in the Sagemaker UI.

Booklet.ai: Turn the Model into a Web App

Woot woot, your model is deployed! Our next step is to make sure the recommendations can be accessed by anyone, easily. We want anyone to be able to select preferred languages and keywords and then see a few recommended repos. There are a few options on how to move forward from here:

You can share the code to invoke your model directly. This requires the user to spin up a python environment and invoke the model directly in Sagemaker utilizing properly configured credentials. Here is a tutorial.
You can also create a custom web app that creates an interface to interact with your model, using something like Flask. Here is a tutorial for that method.

At this point, we were anxious to get the model in the hands of users as fast as possible. We created Booklet.ai to make this last step of building a web app quick and easy.

What is Booklet.ai?

Booklet creates a web app for your ML model without any code changes or extra libraries to install. Here’s an overview of how Booklet works:

Grant Booklet.ai read-only access to a limited number of AWS Sagemaker actions.
Choose the Sagemaker endpoints you’d like to integrate with Booklet in our UI.
Booklet hosts a responsive web app that takes HTML form input, formats the data, invokes your Sagemaker endpoint, and displays the result within the web app. You can return the output as-is or use Nunjucks to format your output in any way you’d like.

Signup and Grant Access to Sagemaker

Booklet is free to use for your first ML Web App. Head over to the signup page to create an account.

You’ll also need to grant read-only access to a limited number of Sagemaker actions via an IAM role that is associated with our AWS account. Here are the steps:

Create a new role in the AWS IAM Console.
Select “Another AWS account” for the Role Type.
Enter “256039543343” in the Account ID, field (this is the Booklet.ai AWS account id).
Click the “Next: Permissions” button.
Click the “Create Policy” button (opens a new window).
Select the JSON tab. Copy and paste this JSON into the text area.
Click “Review policy”.
Name the policy “BookletAWSIntegrationPolicy”.
Click “Create policy” and close the window.
Back in the “Create role” window, refresh the list of policies and select the policy you just created.
Click “Next: Review”.
Give the role a name such as “BookletAWSIntegrationRole”. Click “Create Role”.
Copy the Role ARN. It looks like something like “arn:aws:iam::123456789012:role/BookletIntegrationRole”.

With the AWS Role created and the ARN on your clipboard, we’re almost there. In the Booklet.ai settings, paste in the AWS Role ARN and click the “Save” button:

Input IAM role in Booklet.ai

Create the Web App for your Model

Click the “New Model” button within Booklet.ai and choose the Sagemaker endpoint you’d like to wrap in a responsive web app. On this screen you can also configure a friendly model name for display and a description. We need to setup two quick configurations to get the demo ready:

Feature Schema: This configures the inputs that users will send to your ML model. In this case, we want to setup two different fields: one multi-select text field for Programming Language, and one multi-select text field for key words. You can see how we structured this below:

Post-Processing: This configures how the output should be displayed to the users. Booklet uses Nunjucks, a simple Javascript templating language, for this section. In the editing panel, you can reference the structure of the model results (populated from the defaults in the feature schema) and also see your edited output update in real-time.

Editing Post-Processing in Booklet.ai

For this model, we want the results to look like a list of Github repos, so we’ve configured the Nunjucks template in that way:

Once you’ve filled out both of these configurations, go ahead and click “Update Model.”

Create the Web App for your Model

Your web app is ready for use!

Completed Model Web App in Booklet.ai

Right now, your model is only accessible to those that are within your org in Booklet. You can also make your model easy to share by heading back into settings and setting your model to public (alongside an easy-to-use URL) so that anyone can access your model.

Closing Thoughts

In this tutorial, you’ve learned how to go from a raw dataset in Github, all the way to a functioning web app. Thanks for following along! If you have any thoughts, questions, or run into issues as you follow along, please drop in a comment below. As a reminder, all of the code for this project can be found on GitHub.

Note from the editors: Towards Data Science is a Medium publication primarily based on the study of data science and machine learning. We are not health professionals or epidemiologists, and the opinions of this article should not be interpreted as professional advice. To learn more about the coronavirus pandemic, you can click [_here](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/situation-reports)._

Booklet.ai

You may not need Airflow…. yet

AdamBarnhard — Tue, 31 Mar 2020 17:09:25 +0000

TL;DR: Airflow is robust and flexible, but complicated. If you are just starting to schedule data tasks, you may want to try more tailored solutions:

Moving data into a warehouse: Stitch
Transforming data within a warehouse: DBT
Scheduling Python scripts: Databricks
Batch scoring ML models: Booklet.ai

How could using 4 different services be easier than using just one?

Apache Airflow is one of the most popular workflow management tools for data teams. It is used by hundreds of companies around the world to schedule jobs of any kind. It is a completely free, open source project, and offers amazing flexibility with its python-built infrastructure.

Apache Airflow

I’ve used (and sometimes set up) Airflow instances of all sizes: from Uber’s custom-built Airflow-based Piper to small instances for side projects and there is one theme in common: projects get complicated, fast! Airflow needs to be deployed in a stable and production-ready way, all tasks are custom-defined in Python, and there are many watch-outs to keep in mind as you are building tasks. For a less technical user, Airflow can be an overwhelming system just to schedule some simple tasks.

Although it may be tempting to use one tool for all of the different scheduling needs, that may not always be your best choice. You’ll end up building custom solutions every time a new use-case comes up. Instead, you should use the best tool for the job you are trying to accomplish. The time saved during the setup and maintenance for each use-case is well worth adding a few more tools to your data stack.

Imgflip

In this post, I’ll outline a few of the use cases for Airflow and alternatives for each.

Disclaimer: I am one of the founders of Booklet.ai.

Extracting Raw Data from a Source to Data Warehouse

Data teams need something to do their jobs…. Data! Many times there are multiple different internal and external sources for this data across disparate sources. To pull all of this data into one single place, the team needs to extract this data from all of these sources and plug them into one single location. This is usually a data warehouse of some kind.

Stitch

Stitch

For this step, there are many reliable tools that are being used around the globe. They extract data from a given set of systems on a regular cadence and send those results directly to a data warehouse. These systems handle most error handling and keep things running smoothly. Managing multiple, complicated integrations can prove to be a maintenance nightmare, so these tools can save a lot of time. Luckily, there is a nightmare-saving option:

Stitch coins itself as “a cloud-first, open source platform for rapidly moving data.” You can quickly connect to databases and third party tools and send that data to multiple different data warehouses. The best part: the first 5 million rows are free! Stitch can also be extended with a few open source frameworks.

Transforming Data within a Data Warehouse

Once that data is loaded up into a data warehouse, it’s usually a mess! Every source had a different structure and each dataset is probably indexed in a different way with a different set of identifiers. To make sense of this chaos, the team needs to transform and join all of this data into a nice, clean form that is easier to use. Most of the logic will happen directly within the data warehouse.

DBT

DBT

The process of combining all of these datasets into a form that a business can actually use can be a tedious task. It has become such a complex field, that the specific role of Analytics Engineer has emerged from it. These are a common set of problems across the industry and a tool has emerged to solve these problems specifically:

DBT considers itself “your entire analytics engineering workflow” and I agree. With only knowing SQL, you can quickly build multiple, complex layers of data transformation jobs that will be fully managed. Version control, testing, documentation and so much else is all managed for you! The cloud-hosted version is free to use.

Transforming Data outside of the Data Warehouse

Sometimes the team will also need to transform data outside of the data warehouse. What if the transformation logic can’t operate completely within SQL? What if the team needs to train a Machine Learning model? These tasks might pull data from the data warehouse directly, but the actual tasks need to operate in a different system, such as python.

Databricks

Databricks

Most custom python-based scripts usually start as a Jupyter Notebook. You import a few packages, import or extract data, run some functions, and finally push that data somewhere else. Sometimes more complicated, production-scale processes are needed, but that’s rare. If you just need a simple way to run and schedule a python notebook there’s a great option:

Databricks was created by the original founders of Spark. Its main claim-to-fame is spinning up spark clusters super easily, but it also has great notebook functionality. They offer easy to use Python notebooks, where you can collaborate within the notebook just like Google docs. Once you develop the script that works for you, you can schedule that notebook to run completely within the platform. It’s a great way to not worry about where the code is running and have an easy way to schedule those tasks. They have a free community edition.

Batch Scoring Machine Learning Models

If the team has built a machine learning model, the results of that model should be sent to a place where it can actually help the business. These tasks usually involve connecting to an existing Machine Learning model and then sending the results from that model to another tool, such as a Sales or Marketing tool. This can be a ridiculously tedious task to get a system up and running that actually pushes the correct model results at the right time.

Booklet.ai

Booklet.ai

Building a machine learning model is hard enough, it shouldn’t take another 2 months of custom coding to connect that model to a place where the business can find value in it. This work usually requires painful integration to third party systems, not to mention the production-level infrastructure work that is required! Thankfully, there is a solution that handles some of these tasks for you:

Booklet.ai connects to existing ML Models and then allows you to quickly set up a few things: A demo to share the model with non technical users, an easy API endpoint connection, and a set of integrations to connect inputs and outputs. You can easily set up an input query from a data warehouse to score the model, and then send those results to a variety of tools that your business counterparts may use. You can check out a demo for a lead-scoring model that sends results to intercom . You can request access to the Booklet.ai beta , where your first model will be free.

Photo by Mike Benna on Unsplash

A True End-to-End ML Example: Lead Scoring

AdamBarnhard — Mon, 09 Mar 2020 05:17:28 +0000

From machine learning idea to implemented solution with MLflow, AWS Sagemaker, and Booklet.ai

Introduction

Selling something can be hard work. A business might have many potential customers leads but most of those customers won’t turn into actual, paying customers in the end. A sales team has to sort through a long list of potential customers and figure out how to spend their time. That’s where lead scoring comes in. This is a system that analyzes attributes about each new lead in relation to the chances of that lead actually becoming a customer, and uses that analysis to score and rank all of the potential customers. With that new ranking, the sales team can then prioritize their time, and only spend time on the leads that are highly likely to become paying customers.

Cool, that sounds great! How do I do it?

Well, I’m glad you asked! In this post, we will walk through the full end-to-end implementation of a custom built lead-scoring model. This includes pulling the data, building the model, deploying that model, and finally pushing those results directly to where they matter most — the tools that a sales team uses.

Testing the model in Booklet.ai

If you want to test out this model without going through the full process, we have a fully-functioning lead scoring model on Booklet.ai . We’ve posted all code, in the form of a Jupyter Notebook on GitHub.

Prerequisites

This will be a technical tutorial that requires a bit of coding and data science understanding to get through. To get the most out of this, you should have at least a bit of exposure to:

Python (we will stay within Jupyter notebooks the whole time)
Machine Learning (we will use a Random Forest model)
The command line (yes, it can be scary, but we just use a few simple commands)
AWS (we can hold your hand through this one!)

Also, you should have a few things installed to make sure you can move quickly through the tutorial:

An AWS username with access through awscli (we will cover this below!)

Python 3 of some kind with a few packages:

Pandas pip install pandas
MLflow pip install mlflow
SKlearn pip install scikit-learn
Docker (pretty quick and easy to install here)

Before we get started…

We’re going to touch on a lot of tools and ideas in a short amount of time. Before we dive right in, it’s important to take a step back to understand what’s happening here. There are a few tools that we will be using:

Jupyter Notebook: A go-to for data scientists. Allows you to run python scripts in the form of a notebook and get results in-line.
MLflow: An open source model management system.
Sagemaker: A full-stack machine learning platform from AWS.
Booklet.ai: A model testing and integration system.
Intercom: A customer messaging platform that is commonly used by customer service and sales teams to manage customer relationships.

Here is a diagram that outlines how these different tools are used:

Process Overview, author’s work. Utilizing images from Noun Project

At the highest level, we will use a Jupyter notebook to pull leads data and train a model. Next, we will send that model to MLflow to keep track of the model version. Then, we will send both a docker container and the model into AWS Sagemaker to deploy the model. Finally, we will use Booklet to put that model to use and start piping lead scores into Intercom.

Now that we got that out of the way, let’s get started!

Training the Model

About the Data

First, we need to access data about our leads. This data should have two types of information:

(A) The response variable: Whether or not the lead converted into a paying customer

(B) The features: Details about each lead that will help us the response variable

For this exercise, we are going to use an example leads dataset from Kaggle. This dataset provides a large list of simulated leads for a company called X Education, which sells online courses. We have a variety of features for each lead as well as whether or not that lead converted into a paying customer. Thanks Ashish for providing this dataset and for the awesome analysis on Kaggle.

Importing and Cleaning the Data

To import this data, simply read the leads_cleaned dataset into Pandas. If you are reading this data from a database instead, replace this with pd.read_sql_query instead.

Next, we want to pick out a few columns that matter to us. To do that, we will create lists of columns that fit into different categories: numeric, categorical, and the response variable. This will make the processing and cleaning processing easier.

From here, we can create our train/test datasets that will be used for training:

Now that we have a test dataset, let’s go ahead and create a scaler for our numeric variables. It is important to only run this on the training dataset so that we don’t “leak” any information about the test set.

Now, we need to make some adjustments to the model to prepare for modeling. We’ve created a function to perform a few things:

Select the columns that we’ve defined as important
Use the fitted scaler to center and scale the numeric columns
Turn categorical variables into one-hot encoded variables
Ensure that all columns from the training dataset are also in the outputted, processed dataset (This is important so that all levels of dummy variables are created, even if the dataset we import doesn’t have each individual level.)

Here’s how it looks when we put it all together and run both the training and test dataset through our preprocessing function:

Training the model

This bring us to the exciting part! Let’s use our newly cleaned and split datasets to train a random forest model that predicts the chances of someone converting into a paying customer of X Education. First, let’s define a few standard hyperparameters and initialize the SKLearn model:

From here, we can quickly calculate a few accuracy metrics in our test set to see how the model did.

We have an accuracy of 82% and an AUC score of 80% in our held-out test set! Not too shabby. There is definitely room to improve, but for the sake of this tutorial, let’s move forward with this model.

MLflow: Managing the Model

What is MLflow?

Setting up MLflow

The MLflow tracking server is a nice UI and API that wraps around the important features. We will need to set this up before we can use MLflow to start managing and deploying models.

Make sure you have the MLflow package installed (check out the Pre-reqs if not!). From there, run the following command in your terminal:

mlflow ui

After this, you should see the shiny new UI running at http://localhost:5000/

Once you have the tracking server running, let’s point Python to our tracking server and setup an experiment. An experiment is a collection of models inside of the MLflow tracking server.

Packaging our model with processing

If you are working with a model that has no preprocessing associated with your data, logging the model is fairly simple. In our case, we actually need to setup this preprocessing logic alongside the model itself. This will allow leads to be sent to our model as-is and and the model will handle the data prep. A quick visual to show this:

Data Processing Flow, author’s work. Utilizing images from Noun Project

To do this, we will utilize MLflow’s pyfunc model-type (more info here) which allows us to wrap up both a model and the preprocessing logic into one nice Python class. We will need to send two different inputs to this class: objects (i.e. list of columns that are numeric or the random forest model itself) and logic (i.e. preprocessing function that we created). Both of these entities will be used inside the class.

Now, let’s setup the class. First, check out the code and then we will talk through the different pieces:

The class leadsModel by based on MLflow’s pyfunc class. This will allow us to push this model into MLflow and eventually Sagemaker.

Next we setup all of the objects that we need within the __init__. This contains both the objects and the logic function.

Finally, we setup the predict function:

First, we take in the model_input (which is the dataframe that is sent to the deployed object after deployment) and ensure that all of the column names are lowercase.
Next, we send this dataframe into the preprocessing function that we had created and used earlier for model training. This time, we keep the response columns blank since we won’t need them for deployment!
Then, we reference the original training dataset’s column names and fill in any missing columns with 0’s. This is important since we will have levels of on-hot-encoded variables that aren’t calculated when we send datasets to the model after deployment.
Finally, we send this nice, clean dataset to our Random Forest model for prediction.

Now that we have all of our logic and objects ready to go within one class, we can log this model into MLflow!

Logging the model to MLflow

Before we package everything up and log the model, we need to setup the Anaconda environment that will be used when the model runs on Sagemaker. For more information about Anaconda, here’s a detailed overview.

Now, we start a run within MLflow. Within that run, we log our hyperparameters, accuracy metrics, and finally the model itself!

If you head over to the MLflow UI that we checked out earlier, you’ll see the newly created model along with all of the metrics and parameters that we just defined. Woot woot!

Logged model in MLflow

Sagemaker: Deploying the Model

What is Sagemaker?

Setting up Sagemaker

mlflow sagemaker build-and-push-container

This will push a docker container into AWS, which will be used during deployment.

Deploying to Sagemaker

Now that we have everything setup, it’s time to push our model to Sagemaker!

Deployed model in Sagemaker

Booklet: Integrating the Model

Congrats, your model is now deployed! Our next goal is to make this model helpful to the sales team. To do that, we’ll want to use the deployed model to create lead scores for new sales leads and send those results to the tools that the sales team uses. We now need to create a system that regularly pulls in new sales leads, sends each lead’s info to our deployed model, and then send those model results to Intercom, the sales team’s tool.

There are a few custom-built ways to set this up:

We can setup a custom Python script that regularly collects new Intercom user data in our data warehouse, sends that data to our deployed endpoint using the Sagemaker Python SDK, and then sends the results back to Intercom with their API.
We can utilize Sagemaker’s Batch Transform functionality (great example here) to score batches of Intercom users. All data starts and ends in S3 for batch transform, so we’ll need to pull data into S3 for scoring, and then push data from S3 to Intercom to serve that up to sales teams

We knew there had to be a more efficient way to push the model results into the tools where they are most useful, so we built Booklet.ai to make these steps easier.

What is Booklet?

Booklet adds a web testing interface and data integrations to each of your Machine Learning endpoints, without requiring code changes. With Booklet, you can quickly try out model test-cases to ensure results are performing as expected, as well as send these results to the tools that matter most. For a lead scoring model, we can send results back to our data warehouse (Redshift in this case) or the sale’s team’s tool (Intercom).

Testing the model

Using Booklet, we quickly setup a demo to test the lead scoring model. This is connected to the endpoint that we created in this tutorial so far. You can try out different inputs and see how the model classifies each theoretical lead. Learn more about how to turn your ML model into a web app here.

Testing the model in Booklet.ai

Connecting the model

Once you feel comfortable with the output of the model from testing, you can start sending those results to systems where that output is most useful. We’ve already set up our source in Redshift, which pulls data to feed into the model. We’ve also setup both a Redshift destination and an Intercom destination, where the results will be sent. To kick off an example dataflow, which pulls data from the source, scores that data with the model, and sends results to both destinations, you can try out a dataflow here.

Running a dataflow in Booklet.ai

Making your model impactful

Tada! We’ve now made our lead scoring model impactful by sending results directly into Intercom. To get a sense of how this might show up for a sales team member, here you can see each example lead now has a custom attribute listing whether or not they are likely to convert:

Example of lead score within Intercom Platform

With these labels easily available for each potential lead, a sales team member can start to prioritize their time and pick who they will reach out to first. This will hopefully lead to better efficiency, and more sales for your business! There are many ways to measure the success of these outcomes, but we’ll visit that at another time!

Closing Thoughts

If you’ve made it this far, thank you! You’ve successfully navigated an entire end-to-end machine learning project. From idea inception to business impact, and all of the steps in between. If you have any thoughts, questions, or run into issues as you follow along, please drop in a comment below.

A big thank you to Ashish for the dataset, Bing for a helpful review, and Kyle for an awesome blog to reference on MLflow and Sagemaker.

Photo by Chloe Leis on Unsplash