Doing DE job as DS

Andrey Alekseev — Fri, 14 Mar 2025 07:30:00 +0000

As a Machine Learning Engineer I've seen how data pipelines become crucial for effective ML systems (even more than the models itself). So instead of refusing this job, I suggest to embrace it. In a daily DS job, you inevitably end up doing DE work. Creating data pipelines to move data between systems isn't just a one-time thing, and data rarely cooperates by fitting neatly in memory. After the model development phase, you're stuck in the pipeline maintenance period.

So what are our options? I've tried a few approaches...

Python (Micro)service (FastAPI or CronJob in container)

Works nicely for all types of data. With schedule package that's quite easy to set up for regular jobs and with FastAPI you can even go for streaming data. Most pipelines run hourly or daily anyway. I would not recommend doing it for big(-ger) data though. Sure, you can process in batches, but now you're writing that orchestration logic yourself. Plus if you ever need to speed up the process you are kinda left without scaling ability.

Scheduled Jobs (Cloud Run, K8s Jobs and CronJobs)

This has become my go-to for medium datasets (especially with cloud run ease of setup). Being just a Docker container means I can use whatever packages and architecture I want. Startup times are quick, and it can be manually scaled.

K8s Jobs and Cloud Run Jobs both let you parameterize executions. Batch numbers, date ranges, or partition keys are parameters that can be passed to the job without rebuilding the docker image. Here you only need to handle starting up the job. That can be done via Airflow or any microservice that can send http requests or execute kubectl commands. With CronJobs you can simply set up multiple cronjobs with different parameters and the startup will be handled by k8s itself. Another upside is that you pay only for uptime. However there is an annoying part as well. You build everything by hand: integration, batching, orchestration, failure scenarios.

Specialized Services (Apache Beam with Dataflow, Spark, Flink)

Apache Beam allows to describe your workload and then execute it on different runners. If you have these runners available then you are in luck because Beam and its runners come with a lot of pre-built connections for pushing data around and handle batching, autoscaling and throttling out of the box. It works with streaming data too. But Apache Beam is more expensive and has a steeper learning curve even with runners available. But as a DS you will love tweaking the configs to find the perfect batch size interval. This system works really good after you spend time setting it up and configuring and have all necessary components nearby to create new or tweak existing pipeline.

Despite these options, consistency triumphs theoretical perfection. If your team has invested in Apache Beam, stick with it rather than introducing yet another pipeline approach.

Anyway these DE skills aren't optional extras anymore - they're becoming core to our toolkit. The data scientists are expected to manage the full DS lifecycle.

What approaches have worked for you?

🚀 Why Your ML Service Needs Rust + CatBoost: A Setup Guide That Actually Works

Andrey Alekseev — Sun, 19 Jan 2025 19:05:37 +0000

Let’s talk about a problem that’s been bothering ML teams for a while now. When we’re working with batch processing, Python does the job just fine. But real-time serving? That’s where things get interesting.

I ran into this myself when building a service that needed to stay under 50ms latency. Even without strict latency requirements, everything might seem fine at first, but as the service grow, Python starts showing its limitations. Variables would randomly turn into None, and without type checks, tracking down these issues becomes a real headache.

Right now, we don’t have a go-to solution for real-time model serving. Teams often turn to tools like KServe or BentoML, but this means dealing with more moving parts — more pods to watch, extra network calls slowing things down.

What about other languages? C++ is fast and works with pretty much every ML library out there, but let’s be real — building and maintaining a C++ backend service is not something most teams want to take on.

It would be great if we could build models in Python but serve them in a faster language. ONNX tries to solve this, and it works relatively great for neural networks. But when I tried using it with CatBoost, handling categorical features turned into a challenge — the support just isn’t there yet.

This brings us to Rust, which offers an interesting middle ground:

It’s just as fast as C++, but easier to my liking
The type system keeps your business logic clean and predictable
The compiler actually helps you write better code instead of just pointing out errors
The ML ecosystem is growing, with support from big names like Microsoft

Working with Official Catboost in Rust

Good news — there’s actually an official Catboost crate for Rust! But before you get too excited, let me tell you about quirks that I discovered along the way.

The tricky part isn’t the Rust code itself — it’s getting the underlying C++ libraries in place. You’ll need to compile Catboost from source, and getting the environment right for this is most difficult part.

Catboost team provides their own Ubuntu-based image for building it from source, which sounds great. But what if you’re planning to run your service on Debian to keep things light? Then you better build Catboost on the same version of Debian you’ll use for serving, otherwise you might run into compatibility issues.

Let’s talk about why this matters in practice. The Ubuntu build image needs a hefty 4+ GB of memory to work with. But if you set up a custom Debian build correctly, you can bring that down to just 1 GB. And when you’re running lots of services in the cloud, these numbers of extra memory usage start adding up in your monthly bill.

Setting Up Your Rust + Catboost Build Environment

Let me walk you through setting up a Debian-based environment for Catboost. I’ll explain not just what to do, but why each step matters.

Installing Catboost

On the Rust side it happens as simple as with other crates in Rust:

[package]
name = "MLApp"
version = "0.1.0"
edition = "2021"

[dependencies]
catboost = { git = "https://github.com/catboost/catboost", rev = "0bfdc35"}

However Catboost crate does not have precompiled C/C++ bindings and during installation (cargo build) will try to compile it from sources specifically for your environment. So let’s set up our environment.

Starting with the Right Base Image

First, we’re going with debian:bookworm-slim as our base image. Why? It comes with CMake 3.24+, which we need for our build process. The ‘slim’ variant keeps our image size down, which is always nice.

Setting Up the C++ Build Environment

We need a bunch of C++ packages, and while I’m using version 16 in our setup, you actually have flexibility here. Any version that supports -mno-outline-atomics will work fine.

Let’s break down our package installation into logical groups.

Setting Up Package Sources

First, we need to get our package sources in order. This part is crucial for getting the right LLVM tools:

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y - no-install-recommends \
 # will use it to download packages
 wget \ 
 # cryptographic package to verify LLVM sources 
 gnupg \
 # check Debian version to get correct LLVM package
 lsb-release \
 # package management helper
 software-properties-common

Then we need to add LLVM’s repository.

RUN wget -O - https://apt.llvm.org/llvm-snapshot.gpg.key | apt-key add - \
 && echo "deb http://apt.llvm.org/$(lsb_release -sc)/ llvm-toolchain-$(lsb_release -sc)-16 main" \
 >> /etc/apt/sources.list.d/llvm.list

The main step of installing packages

We need quite a few packages, and I recommend to organize them by purpose — it makes maintenance so much easier:

RUN apt-get update && DEBIAN_FRONTEND=noninteractive apt-get install -y - no-install-recommends \
 # Basic build essentials
 build-essential \
 pkg-config \

 # Core development packages
 libssl-dev \
 cmake \
 ninja-build \
 python3-pip \

 # LLVM toolchain - version 16 works great, but any version with 
 # -mno-outline-atomics support will do
 clang-16 \
 libc++-16-dev \
 libc++abi-16-dev \
 lld-16 \

 # Don't forget git!
 git

Next step cost me some time to figure out. Catboost expects to find clang in /usr/bin/clang, but our installation puts it in /usr/bin/clang-16. That’s why we have this bit:

RUN ln -sf /usr/bin/clang-16 /usr/bin/clang && \
 ln -sf /usr/bin/clang++-16 /usr/bin/clang++ && \
 ln -sf /usr/bin/lld-16 /usr/bin/lld

And do not forget to set up environment variables

ENV CC=/usr/bin/clang
ENV CXX=/usr/bin/clang++
ENV LIBCLANG_PATH=/usr/lib/llvm-16/lib
ENV LLVM_CONFIG_PATH=/usr/bin/llvm-config-16

Managing Dependencies

We need Conan (version 2.4.1+) for handling C++ dependencies. A word of caution about the installation:

RUN pip3 install --break-system-packages "conan==2.11.0"

That --break-system-packages flag might look scary, but it’s actually the easiest way I found to install Python packages system-wide in newer Debian versions. Besides we won’t be using much Python anyway in our build image.

Smart Build Strategy

Here’s a trick that’ll save you tons of build time during active development stage. Split your build into two steps:

First, build just the dependencies:

COPY ./Cargo.* ./
RUN mkdir src && \
 echo "fn main() {}" > src/main.rs && \
 RUSTFLAGS="-C codegen-units=1" cargo build - release

Important note here. You need that RUSTFLAGS="-C codegen-units=1" flag because it ensures that C++ and Rust play along.

Then build your actual application:

COPY ./src src
RUN cargo build - release

This way, Docker caches the dependency build, and you only rebuild your app code when it changes. Much faster!

A Critical Warning About Memory

This is important: during the C++ build steps, you’ll need machine with 20+ GB of memory (I used 32Gb). And here’s the part that cost me almost a day of debugging — if you don’t have enough memory, you won’t get a clear error message (or any to be honest). Instead, your build will mysteriously timeout, leaving you wondering what went wrong. I learned this one the hard way!

Wrapping It Up

Now we have a working Rust environment with Catboost that can handle all the good stuff: categories, text data, embeddings. Getting here wasn’t exactly easy.

Next time we’ll cover:

Building an Axum web service
Smart model loading patterns
Real-world performance tricks I learned along the way

So we’ll turn this foundation into something that can actually serve models in production! I ran into some interesting problems while figuring this out, like accidentally loading the same model multiple times for each handler call.

And if you’ve tried this out and hit any weird issues, let me know. It’s always interesting to hear what problems other people run into.

Full Dockerfile

Forem: Andrey Alekseev