Forem: Tochukwu Nwoke

ML experiment tracking with DagsHub, MLFlow, and DVC

Tochukwu Nwoke — Thu, 12 Jan 2023 10:43:27 +0000

This blog is a two-part series introducing experiment tracking in machine learning (ML). Here, we cover the practical implementation of experiment tracking using a platform-based tool.

In practice, having some infrastructural setup, which can be referred to as a “workbench,” within the development pipeline is the way to go. It structures the workflow, but this is easier said than done. Although some cloud platforms have provided various out-of-the-box workbench platforms/services (like Vertex AI, Sagemaker, AzureML) ready for use, it doesn’t always cover all the use cases.

This often requires engineers to design and build an architecture for the workbench that best suits their workflow with external and in-house tools—which is outside the scope of this article. However, a typical workflow for this development stage will demonstrate how a specific problem might affect the workflow design and how the solution would play out with an out-of-the-box workbench.

Here, we’ll implement the experimentation workflow using D ag sHub, Google Colab, MLflow, and d ata version control (DVC). We’ll focus on how to do this without diving deep into the technicalities of building or designing a workbench from scratch. Going that route might increase the complexity involved, especially if you are in the early stages of understanding ML workflows, just working on a small project, or trying to implement a proof of concept.

Prerequisites:
The following are required to follow along with this article comfortably:

A basic understanding of ML
A DagsHub account

You will need to connect your GitHub repository to DagsHub. Suppose you don’t know how this blog will show you. This makes it possible to seamlessly version control your workflows with both GitHub and DagsHub, giving you the extra capabilities DagsHub provides and are necessary for ML workflows.

You can find the code for this project in this repository.

In this article, you will:

Use DagYard to set up remote environments on Google Colab
Version data with DVC and store in DagsHub storage
Train and track model experiments with MLflow on Google Colab
Understand how use cases drive the ML development process

Case study: Train a model to translate alphabets to sign languages

As a young research student, you decide to work on a simple side project that classifies all the letters of the alphabet for sign language. You cannot train this model on your local machine because the computational cost of training is expensive (the kernel or runtime keeps crashing). You must figure out an easy way to train the computer vision model without incurring financial costs or worrying about infrastructure setup.

Unfortunately, the allocated grant won’t cover this project. So, you can’t afford to use cloud services (like virtual machines (VMs), workbench, etc.), and other experienced engineering colleagues that might be able to help brainstorm for this hiccup in setup (workbench) are too occupied with work.

Case study requirements

Have a remote workbench setup with better computing power to enable faster and easier training. You should also be able to track your experiments as you optimize for the best model and where different versions of the data will persist.

Note: There is a rich option of tools to pick from for every step in this article. The individual functionality and integration of these tools/platforms made implementing this use case easy.

The workflow architecture for the out-of-the-box workbench looks like this:

What is DagYard?

Dag Y ard is a configuration tool in the form of a notebook for DagsHub. It brings the DagsHub environment runtime of any “hosted” project to Google Colab by pulling all the components remotely and having them run on the Google Colab runtime while having direct access to all DagHub features.

This way, you don’t always have to worry about uploading or reloading data paths from Google Drive each time the kernel dies during training. You also won’t have to worry about the source code because you can now have everything in one place—plus you can now completely take advantage of Google Colab’s free computing power privileges.

You must have all the correct parameters and environment variables set at the beginning of the notebook. The authentication and configurations process is as straightforward as checking a few boxes from the list of available options and filling out the credentials—based on what you want to configure directly from DagsHub. VOILA!!! It’s all set for action.

DagYard is used to configure the environment for this demonstration.

After setting up the project. Run the “help function” and “black magic” blocks to pull the project from DagsHub into the Colab directory.

Version and store the data with DVC

About the Data
You will be working with image data. The file structure from the data source contains all the images and a CSV file with the image features for each image (i.e., annotation, labels, etc.). The image folder structure has train, test, and validation set splits.

The data will be converted to a TFRecord format and recorded as a new data version. This preprocessing step reduces the data size and will make the training process easier since you are using TensorFlow. The src/data_preprocessing.py script performs this makeover.

Understand that you are working with the idea that data that comes from the data source can change at any time — which is what happens in practice. That is why you have chosen to version the data after preprocessing. Ideally, this represents the e xtract t ransform l oad (ETL) process.

DVC tracks data and uses a remote storage location to store the data. It supports a variety of remote storage locations like (s3 bucket, Google Drive, Azure blob storage, Google storage buckets, etc.). Although these are all excellent options, they might require a lot of configuration to set up or cost more because they are better suited for more complex workflows or robust solutions with scalability. If you’re thinking, “Google Drive is free and easy to configure,” you are correct, but it limits your abilities when tracking training. You will get to understand this better in a bit.

Fortunately, DagsHub provides free remote storage with 10GB of space by default for every project (repository) that DVC supports.

Thanks to DagYard, it has already been configured as the remote storage location — but the data_store/data directory will serve as your feature store for the processed data.

Using MLflow for model experiments

As established earlier, it’s not uncommon to run many experiments in ML. It is essential to log all information from each set of experiments. The roadmap to arriving at a particular model becomes easier to trace.

You will make use of MLflow to manage the training information. Usually, it would require initializing and configuring a server to run on your local or remote machine. Its integration with DagHub allows us to use it just as you would when working on a proxied server. Select the MLflow option from the notebook configuration cell on the Dagyard to configure access to the MLflow tracking server (URI). This way, the tracking server is automatically running on DagsHub’s backend server.

Although this setup might not give complete control over MLFlow because you don’t have direct access to the management/running of the server or backend store (database), you can always use the MLflow client API to communicate with the tracking server. These are the kind of trade-offs that you might come across when building your solution with different tools.

Run experiment workflow

Since you have the project runtime on Colab, you can efficiently run the training with the paths to the source code and data. The train.py script has all you need to set, train, and log all the information required from an experiment.

The teeny-tiny issue is that on the free Google Colab plan, there isn’t access to a terminal console. Not to worry: you can use the good old hack of accessing the terminal from your notebook to run your experiments. You will also push and commit your change within a configured notebook (DagYard).

With the help of a pipeline, processes/workflows like this can be streamlined, seamless, and repeatable. A DVC pipeline executes the entire workflow—from data ingestion and preprocessing to model training.

https://gist.github.com/Tob-iee/7b950a34de5d55a36db972cb91c8d984

Disclaimer: Ideally, this is not good practice, but in this case, it gets the job done and demonstrates the workflow concept. Running Notebooks/run_experiment_workflows.ipynb ****will spin up the entire pipeline process.

You can also find two other notebooks (local_runtime.ipynb and colab_runtime.ipynb) in the Notebooks directory for local runtime training and Google Colab runtime training, respectively. Just for context, and if you would like to understand the above approach better, feel free to run them.

After the training process is complete, push the changes. Go back to the DagsHub interface, and you can see the workflow dags from the pipeline.

All MLflow experiments can then be seen and used from the MLflow UI.

Conclusion
Congratulations on making it this far! You have successfully learned the concept of experiment tracking and the thought process behind building and designing workflows to best suit your problem or use case.

In practice, the approaches to providing ML solutions are often unique across businesses. This equips you or your teammates to implement this MLOps workflow better and offer more value.

Resources

The ASL dataset used for the project* Roboflow

How to Track and Analyze Experiments in Machine Learning: A Beginner's Guide

Tochukwu Nwoke — Tue, 10 Jan 2023 14:51:35 +0000

This article was originally posted on Hackmamba

As a machine learning (ML) practitioner, you must work simultaneously with code, data, and models. With the rapid evolution of these factors during development, keeping up with their interaction becomes even more demanding.

This article enables you to understand the concept of experiment tracking.

Practices in ML are becoming more streamlined, structured, and defined with the recent adoption of Machine learning operations (MLOps). MLOps workflows have different phases, and this article focuses on data and model experiment tracking in the model management phase. It provides better solutions for ML-driven development, but solutions always have trade-offs.

Model management is a crucial part of the MLOps lifecycle because this is where many rapid changes happen. From running a series of experiments to making little tweaks in the programs in a “needle in a haystack” fashion to figure out what works best quickly.

Why experiment tracking?

The early stages of developing ML solutions feel like a series of experiments with no direction. The critical question then becomes, “how can you conduct the experimentation fast, reproducible, and scalable way and simultaneously provide business value?”

The idea is to combine science (experimentation) and engineering workflows since you will work primarily with code, data, and models. Code is usually static, which we consider as the engineering aspect of the workflow, whereas data and models are dynamic and scientific. This engineering vs. scientific gap is where the concept of experiment tracking comes to play.

Experiment tracking in research

As with any research process, the scientist runs a series of experiments under certain conditions and variables before, hopefully, arriving at an optimal result for a particular problem. In the case of ML, the experiments would be the different times model training was carried out. The conditions and variables during an experiment would be the model’s hyperparameters, data features, etc.

Also, let’s not forget that by the end of the research, there is always this little book where the scientist records the final observations, findings, and discoveries — which engineers, other researchers, and other researchers use in the future. This is precisely what happens at the model development stage of ML development.

The focus is on recording all the reactions occurring during the training process by logging the training metrics, parameters, and artifacts for further analysis and understanding of the model’s performance for each experiment. With this, you can gain insights from experimentation. That knowledge equips us with the right tools to investigate the entire process and make better deductions on what step to take next.

The need to capture your data in ML experiments

Data tracking is an extension of the research process in ML; it entails keeping a proper record of the data that is used for training. It captures data in the form of versions in a feature store or database store for further investigation.

The data versions are driven mainly by the data preprocessing steps. These data preprocessing steps are processes/tasks that primarily structure the data for the type of model you are building. Different forms of data then morph into an acceptable form when passing them as input into the model.

It also entails feature engineering to a large extent in ML/data science because, as the famous data science saying goes, “the model is only as good as the data.” Versioning makes whatever preprocessing steps you have taken on the data easily traceable across preprocessing pipelines, just as code versions are tracked using Git.

Who conducts experiment tracking in ML?

The responsibility of keeping track of the experiments carried out when developing an ML model is the job of any individual directly training the model. Typically, this is a data scientist (DS) or an ML engineer (MLE).

DSs/MLEs curate all the information gathered during model training to easily compare differences across experiments. They create a high-level summary of all the training information in a way that makes sense to them and other technical personnel. This represents a form of documentation that allows communication and understanding of what goes on in their development environment (usually a Jupyter notebook) on a higher level.

By drawing hypotheses from previous experiments, DSs/MLEs can deduce the next promising strategy to apply to the data, model, or code when trying to arrive at an optimal ML solution. It also makes it easy to share experiments with other technical people or stakeholders in a way they can understand.

The tools for experiment tracking

Many tools have been developed for tracking experiments, such as MLflow, weights&biases, Neptune.ai, etc. The core purpose of these tools is to enable a central point of reporting during the model training process. Managing model development workflows become more accessible because all the findings during the experimentation phase will be populated at a single point, irrespective of where you run the experiment (i.e., either locally or virtually, on different machines) or who ran the experiment (i.e., having other engineers running experiments in parallel),

However, there are trade-offs between the numerous experiment tracking tools that exist. These trade-offs influence your decision when picking the perfect tool for your ML development stack and workflows based on the components, features, and integration they support.

How should experiment tracking be carried out

In a nutshell, the processes involved in tracking, recording, and carrying out experiments depend on the type of data, use cases, and the recording method of the tool. Data vary from structured to unstructured, and use cases could be computer vision, natural language processing (NLP), reinforcement learning, etc. It’s also quite common for the methods that tracking tools usually use to fall within the options of a spreadsheet method (like MS Excel), a version control method (like Git), or a software package/platform (like MLflow).

Nonetheless, manual methods like spreadsheet and version control tend to be more tedious, time-consuming, and less intuitive because you have to manually record your metrics, logs, hyper-parameters, etc. On the other hand, software packages/platforms have a robust structure that enables seamless traceability and reproducibility when running experiments. They can seamlessly support and integrate with frameworks like TensorFlow, SKlearn, and PyTorch; as such, they can intuitively take a record of the experiment runs — and even store models and artifacts in remote locations.

Conclusion

This article provides a good overview of what it entails to implement experiment tracking in ML. To learn more about experiment tracking, check out the following resources:

15 Best Tools for ML Experiment Tracking
ML Experiment Tracking: What It Is, Why It Matters, and How to Implement It