Forem: Tun

Real-Time Infrastructure for Data Scientists

Tun — Tue, 18 Jul 2023 10:19:19 +0000

In our ongoing series on friction in feature engineering, we talked generally about the impedance gap between data scientists and engineers and did a deep dive into the hassle of translating Python into Java. Here, I want to do another deep dive, but this time on the subject of infrastructure—another source of friction when getting real-time feature transformations into production.

Here are the key takeaways:

Developers and data scientists both need to deal with infrastructure—but they need different tools.
Love it or hate it, the demand for “T-shaped” data scientists isn’t going away.
When it comes to data pipelines, data scientists are used to the “plan and run” approach.
When moving from batch to real-time, it helps to think in a service-oriented way.
There are ML and real-time pipeline tools that support both approaches: Metaflow, Bytewax, Confluent Stream Designer and Quix.
Each tool is strong in different areas such as usability, configurability, scalability and performance.

This article could be considered a follow-up to Chip Huyen’s hotly-debated post, Why data scientists shouldn’t need to know Kubernetes. The discussion around the post is almost as educational as the post itself.

In Twitter and Hacker News, debates emerged about how much of the stack you need to learn.

There were data scientists who felt validated, shunning the expectation that they should be “full stack”.
There were even developers who admitted that they don’t like using Kubernetes either, preferring to focus on their application logic.

On the other hand, some commentators seemed less sympathetic and felt that data scientists should just “suck it up”.

One of the more passionate comments in HN.

Of course, it’s not as black and white as the discussion would lead you to believe.

The spectrum between “no autonomy” and “complete autonomy”

There is a range of skills that put a data scientist somewhere between “no autonomy” (getting an engineer to configure and deploy everything for you) and “complete autonomy” (writing your own Helm charts and Terraform modules). It’s unrealistic to expect a data scientist to be on the “complete autonomy” end of the spectrum, but it’s fair to expect them to move as far as possible towards it, especially if you’re in a small team.

This expectation is common in the startup world; roles are usually more fluid and employees are expected to take on a wider array of responsibilities. This is why data scientists at startups tend to perform more data engineering tasks and spend less time on “pure” data science.

This loosely mirrors the DevOps trend for developers, where developers became responsible for deploying their own code and relied less on SREs or infrastructure specialists. Yet, the comparison isn’t entirely fair—doing data science isn’t the same as developing applications. For one, the developer learns how to replicate the production environment on their machine from early on and frequently tests their code in staging environments that are exact replicas of production. This pattern hasn’t been as easy for data scientists to copy.

Data scientists and developers have different ways of working

To understand the difference between these two roles, it helps to look at another scenario (just like I did in my language problem article).

Suppose I have two services that both take stock trading data as input.

One is a typical service that plots the price movement of a stock and sends its app users alerts when the price has crossed a certain threshold in a specific direction.
The second is an online feature transformation service that provides fresh features to an ML model.

The ML model is trained on patterns in historical trading activity coupled with real-time data. It looks at the trading behaviour in short time windows such as the last hour or last minute and gives users continually updated projections on a stock's movement within the current day as well as the long term (this is the same scenario I used in my last article). This is why it relies on the online feature transformation service.

You could deploy both services using a similar pattern but there will need to be key differences in the workflow. To understand what I mean let’s look at how you would deploy a typical service.

How a developer might deploy a typical service

Let’s say we’re using AWS Fargate to deploy it. AWS Fargate is a serverless compute engine for containers that simplifies container deployment and management by automatically handling infrastructure provisioning, scaling and maintenance. This is what makes it so popular with developers.

Having said that, the steps to deploy a service are still fairly complex. For example, look at the steps involved in the tutorials, How to deploy a Python Microservice on Fargate (CLI-based) or Deploy Microservices on AWS ECS with Fargate (UI-based).

Here’s a very rough summary of what you would need to do when starting a new project.

Preparation

First, you would need to have a full development environment set up with Docker installed as well as the AWS and ECS CLIs.

Dockerise the application

Then, after you’ve finished coding the logic for your service, you would need to Dockerise it. This means writing a Docker file that defines the entry point, software dependencies, relevant ports to open and so on. Then you run a build script to actually build a Docker image.

Push the image to a container registry

Then you’d push the image to Amazon Elastic Container Registry (ECR). In this step you need to make sure that you have the correct IAM roles and ensure that your image is tagged correctly.

Deploy the image to your cluster as a service

This is perhaps the most tricky part of the process. Here you need to define task and service definitions. This involves configuring execution roles, auto-scaling behaviour, load balancing, network settings and inter-service communication. Then you would rinse and repeat for the other services involved in your project.

All of this requires some fairly in-depth knowledge of how infrastructure works in AWS and many development teams have a dedicated DevOps specialist to help them navigate the intricacies of networking and access management.

However, once it is all set up, it’s usually automated with Infrastructure as Code (IaC) tools such as Terraform. This enables an automated CI/CD process to take care of deploying new versions of a service whenever developers push major changes to the underlying code. However, configuring a tool like Terraform is also not a trivial task.

Anyway, that’s a sketch of the process for a typical back-end developer. Let’s turn to the other type of task, one that is typically the concern of data scientists.

How an ML engineer might deploy an online feature transformation service

Assuming you’re using the same infrastructure as the typical microservice (AWS Fargate), the deployment steps probably wouldn’t change much. However, you would need to update your network settings to ensure that the container has access to the wider internet (for example, access the Coinbase WebSocket feed to read fresh trading data).

Secondly, you would need to connect the container to an online, in-memory data store such as Redis. This provides the feature transformation service with a place to write the calculated features.

Thus, when the ML model needs to make a prediction, it fetches the most recent features from Redis. This allows the prediction service to access feature data at very high speed, which can be crucial for latency-sensitive applications.

This of course begs the question—do we really expect data scientists to know all this too? Probably not. Most big companies have ML engineers to do it for them. And in companies that have batch-only machine learning processes, i.e. most companies, data scientists don’t have to think about application infrastructure at all.

However, many data scientists do have to provision infrastructure for model training and run data pipelines internally. Internal pipelines also have their fair share of infrastructural wrangling, but these processes generally are “plan and run” workflows which have a start and end. This reduces the complexity to a small degree, but it’s still typically handled by a data engineer or a data scientist with a T-shaped skill set, i.e. they have acquired some data engineering skills.

It takes another set of skills to deploy a service that runs online and to automate and test that deployment. This is why there’s usually a handover process where data scientists pass their work to software or data/ML engineers to deploy online. However, this handover can be a bottleneck as we covered in our earlier article on the impedance gap.

The demand for T-shaped data scientists

Let’s revisit that notion of the T-shaped skill set for a moment.

The batch processing world is now full of them. Just search on LinkedIn for “data scientist” and you’ll often see “/ data engineer” tacked on to the end. Why is that?

An article from early 2021 might give us a clue. In his article, We Don't Need Data Scientists, We Need Data Engineers, engineer Mihail Eric claimed that there were supposedly 70% more open roles for data engineers than for data scientists (out of the Y Combinator portfolio companies that he studied). His takeaway point was that the industry should place more emphasis on engineering skills when training data professionals.

Also, given the lack of data engineers at the time he was writing, many lone data scientists acquired some engineering skills out of necessity. Initially, many companies only hired one data scientist, usually their first data hire, so they had to learn how to provision their own infrastructure to some degree (I have data scientist friends that have been in this position and conveyed their pain). Later, they were also aided by the emergence of new tools that simplified some infrastructural aspects of running an ML data pipeline.

In the following section, I’ll take a look at one of these tools and identify some paradigms that are being carried over to tools for real-time data pipelines. The goal is to show how productivity gains from the batch world can also be applied to the real-time world.

Rise of the “Plan and Run” approach

In the batch world, a pipeline is bounded—there is a clear start and end. You run the pipeline and at some point it is done (until it is triggered again). In contrast, a real-time unbounded pipeline is usually never done, it runs continuously for eternity (unless you stop it or it encounters a serious error).

The bounded nature of batch processing lends itself to being orchestrated as a “plan and run” approach. This is the approach used by workflow tools such as Dagster and Airflow. Both of these tools are designed to connect to a data integration platform such as Airbyte and run some data processing steps in sequence.

Screenshot from the Dagster quick start for a job that fetches data from Hacker News’ APIs, transforms the collected data using Pandas and creates a word cloud based on trending stories (to visualise popular topics).

This processing workflow is written locally in a Python workflow file and then submitted to the workflow tool to run in a specific compute environment (hence “plan and run”). However, the disadvantage of these tools was that you still had to provision the infrastructure where the processing code would actually execute. This problem is especially acute when provisioning the resources to train memory-hungry machine learning models.

To help solve this problem, Netflix open-sourced Metaflow—originally an internal tool that abstracts away much of the infrastructure configuration and allows data scientists to use the same code in both development and production environments. The project was spun out into a separate startup called Outerbounds, which now manages its development. They also offer a managed platform which runs Metaflow so you don’t have to worry about setting it up.

In any case, it has some great concepts that can be applied to real-time data pipelines too, so let's take a closer look.

Metaflow: infrastructure for ML, AI and data science

You can run Metaflow locally, deploy it to external compute clusters, or use it in the Outerbounds managed platform. Once you have it set up, you can define and run your workflows in a cloud IDE. Unsurprisingly, you define your workflow in a Python file, which consists of steps that are defined with “@step” decorators. You can also visualise your workflow as a DAG which illustrates how the workflow will be executed.

The real kicker though, is the ability to provision infrastructure for each workflow step. Here’s a very simple example, from the Metaflow sandbox.

@kubernetes(memory=10000)
@catch
@step
def memory_hog(self):
    print("Requesting a lot of memory")
    self.bytes_used = len("x" * 1_000_000_000)
    print("Success!")
    self.next(self.join)

This example shows how you can provision a container with a specific amount of memory that goes beyond the defaults. Unlike Airflow, Metaflow allows you to easily provision different kinds of containers for different steps so that each step has the resources that it needs. For more information on how the Kubernetes configuration works, see Metaflow’s documentation of the “@kubernetes” decorator.

Although Metaflow is great for training and retraining models it’s not ideal for a real-time pipeline that runs continuous feature computations in production.

Yet, there are other tools that borrow the same “plan and run” approach for real-time processing. Some key examples are Bytewax and Confluent’s Stream Designer. Let’s look at Bytewax first.

Bytewax: Timely Dataflows

Bytewax is a Python native binding to the Rust implementation of the Timely Dataflow library. Timely Dataflow was first introduced as a concept in the 2013 Microsoft Research paper “Naiad: A Timely Dataflow System”. The original Rust implementation is described as a distributed “data-parallel compute engine” that allows you to develop and run your code locally and then easily scale that code to multiple workers or processes without changes.

Bytewax thus inherits all of these parallel processing features while providing a more accessible syntax and simplified programming interface, as well as a powerful CLI that lets you deploy a dataflow to an instance in the cloud.

In its most basic form, a dataflow looks like this, where each step calls a different function, starting with an input and ending with an output.

...
flow = Dataflow()
flow.input("inp", FileInput("wordcount.txt")) # Take a line from the file
flow.map(lower) # Lowercase all characters in the line
flow.flat_map(tokenize) # Split the line into words
flow.map(initial_count) # Count the occurrence of each word in the file
flow.reduce_window("sum", clock_config, window_config, add) # Use a tumbling window emit the accumulated value every 5 seconds
flow.output("out", StdOutput()) # Output to console

Example taken from the Bytewax documentation (functions omitted for brevity).

Dataflows can also contain subflows nested within a function and it’s possible to run steps concurrently and pass different parameters to various steps. This allows you to run the same flow with different variables (for example, a certain price threshold or field name).

Here’s how you would pass a threshold value of 10 to a flow:

python -m bytewax.run "parametric:get_flow(10)"

Sadly, you can’t define your infrastructure requirements in the code of the flow itself (like you can with Metaflow). However, you can (to a degree) define it at the command level.

For example, suppose that you want to run individual processes on different machines in the same network. Specifically, you want to run 2 processes, with 3 workers each on two different machines. The machines are known in the network as cluster_one and cluster_two.

You would run the first process on cluster_one as follows:

python -m bytewax.run simple:flow ---workers-per-process 3 --process-id 0 ---addresses "cluster_one:2101;cluster_two:2101"

And the second process like this:

python -m bytewax.run simple:flow ---workers-per-process 3 ---process-id 1 ---addresses "cluster_one:2101;cluster_two:2101"

To deploy a dataflow to a Kubernetes cluster you would use the Waxctl tool which is Bytewax's equivalent to Kubernete's kubectl client.

waxctl dataflow deploy my-script.py \
  --name=cluster \
  --processes=5 \
  --workers=2

However, this introduces a little more complexity as you’re required to configure communication between the different workers yourself. So for example, if you have a cluster that runs Redpanda or Apache Kafka to store the output of intermediate steps, you’ll have to configure your workers to connect to Redpanda or Kafka.

When you use the Waxctl CLI to deploy a Bytewax dataflow to a Kubernetes cluster, it will create the following components within the cluster.

These components are explained in more detail in the Bytewax documentation, but what's important to point out there is that Bytewax will provision multiple replicas for your dataflow (indicated by my-dataflow-0, my-dataflow-1, etc.) based on your configuration settings. However, it’s not entirely clear how to allocate more resources to an especially memory-hungry step while letting the other steps run in low-resource containers. My guess is that you need to customise the Bytewax Helm chart first.

As it stands, Bytewax definitely doesn't have the same level of configurability as Metaflow and there’s no managed version—you have to set up your own cluster on AWS or GCP. However, as I’ve pointed out before, they are currently working on a managed platform which I am hopeful will follow in the steps of Metaflow and abstract away more of the infrastructure headache.

Thus, from the perspective of a data scientist, Bytewax is fantastic for defining and orchestrating workflows but there's still the requirement of provisioning the accompanying infrastructure.

Confluent Stream Designer: real-time streaming pipelines with ksqlDB

Confluent’s Stream Designer is another tool with a “plan and run” approach to real-time pipelines. It uses ksqlDB as its processing engine and offers both a visual pipeline designer and a simple IDE for defining the pipeline in KSQL.

Unlike Metaflow and Bytewax, Stream Designer does not allow you to provision different resources for different steps in the pipeline. This is by design. The entire pipeline is designed to run on a single ksqlDB cluster which handles all the data processing and state management. The processing load is divided amongst the available nodes in the cluster and if a node is added or removed, ksqlDB will automatically rebalance the processing workloads.

Let’s look at an example of a pipeline which performs the following tasks:

Use the sample Datagen source connector to get basic page view data. This is how the data will look:

{ "viewtime": 1702311, "userid": "User_5", "pageid": "Page_39" }
{ "viewtime": 1702411, "userid": "User_6", "pageid": "Page_66" }
{ "viewtime": 1702541, "userid": "User_6", "pageid": "Page_89" }

Write that data to a stream in a topic called “pageviews_topic”.
Filter the stream for a specific user ID.
Write the filtered stream back to a downstream topic.

Here’s a screenshot of the finished pipeline that has been wired together in the Stream Designer UI.

Although the UI is designed to abstract away the process of creating KSQL by hand, there are some parts of the configuration that require you to know a bit of KSQL—such as defining column names for a stream (sadly none of the fields have autocomplete, or defaults that might help a data scientist configure the steps faster). This is perhaps not possible because a pipeline is designed offline. It doesn’t yet know about some attributes of the online data.

The following screenshot shows how you define columns for a filter stream:

If you know a bit of KSQL, it is probably faster to write the steps by hand which you can also do by clicking “View pipeline graph and source”. This ability to easily switch back and forth between the code and the visual graph is a nice touch.

Once you have written your pipeline, you click “activate” to deploy the whole pipeline at once. This makes it very much in line with the “plan and run” approach adopted by Metaflow and Bytewax. However, you can also add new components to an already running pipeline. If you want to change existing components, you need to deactivate the pipeline first and reactivate it when you’re done.

One obvious weakness here though, is that workflows are based on KSQL rather than Python. We’ve already covered the limitations of ksqlDB elsewhere, especially when it comes to machine learning. The gist of the article being that if you’re doing real-time machine learning and need to compute fresh features with complex transformations, it might not be the best choice.

If you're not doing machine learning and your pipeline consists of fairly standard processing steps (i.e. filtering, joining, aggregating), Stream Designer is a great option. For data scientists, the infrastructure problem is taken care of because everything runs on a managed cluster, which has likely already been set up by your infrastructure or Confluent support team.

Quix: serverless data pipelines

Last but not least, we have our own offering which uses many of the paradigms from the tools above. One difference though is that it works in a serverless manner. You don’t have to follow the “plan and run” approach (unless you really want to).

The Quix platform has a visual graph which is very similar to Confluent Stream Designer, however, the graph itself can’t be edited in the same way as Stream Designer—you can’t wire together nodes because it is designed to represent services that have already been deployed. There is also currently no offline view like you would get in Confluent’s Stream Designer. However, you can interact with the graph by clicking different nodes and opening their relevant contextual menus.

Each individual node represents a deployed project running a specific step in the pipeline. When you click on a node you can access the deployment settings that are used to run its process. You can also view the code—stored in your Git repo—that is running on the node (under the hood, this is handled by one or more Docker containers running in Kubernetes).

In Quix, the infrastructure and workflow are decoupled from the processing logic. This means that the processing code is stored separately from the workflow (rather than one big long flow.py as it would be in Metaflow ) and they all reside in the same Git repository.

The infrastructure settings and workflow steps are written to a YAML file.
The processing logic is stored in Python files and committed to the Git repository that you’ve defined for your workspace.

Here’s an example of how the infrastructure and workflow logic is defined in YAML.

# Quix Project Descriptor
# This file describes the data pipeline and configuration of resources of a Quix Project.

metadata:
  version: 1.0

# This section describes the Deployments of the data pipeline
deployments:
  - name: ADS Sim
    application: AdsSim
    deploymentType: Job
    version: v1.2
    resources:
      cpu: 4000
      memory: 8600
      replicas: 1
    variables:
      - name: output
        inputType: OutputTopic
        description: This is the output topic for demo sine wave data
        required: true
        value: f1-data
      - name: stream_id
        inputType: FreeText
        description: ''
        required: true
        value: car1
  - name: FeaturesEng
    application: FeaturesEng
    deploymentType: Service
    version: features-v1.0
    resources:
      cpu: 1000
      memory: 500
      replicas: 1
    desiredStatus: Stopped
    variables:
      - name: input
        inputType: InputTopic
        description: Name of the input topic to listen to.
        required: false
        value: f1-data
      - name: output
        inputType: OutputTopic
        description: Name of the output topic to write to.
        required: true
        value: f1-data-features

# This section describes the Topics of the data pipeline
topics:
  - name: f1-data-features
    persisted: true
    configuration:
      partitions: 2
      replicationFactor: 2
      retentionInMinutes: -1
      retentionInBytes: 262144000
  - name: f1-data
    persisted: false
    configuration:
      partitions: 2
      replicationFactor: 2
      retentionInMinutes: -1
      retentionInBytes: 262144000

This YAML file is stored at the top level of your project. Like Stream Designer, the editing process is bidirectional. You can edit the YAML file directly (either locally or in the cloud IDE) or you can use the web UI to define the steps, which will be synced to the YAML file. Each step is called a “deployment” and is referenced by a unique name.

Here’s an example of a deployment which contains the processing logic for a specific step.

Generally, the Quix processing model is quite opinionated about how you store the data as it flows through your pipeline. Although not visible in the pipeline visualisation, each step typically outputs the data to a Kafka topic.

Subsequent steps read from the Kafka topic and in turn produce their outputs to other Kafka topics as illustrated in the following diagram.

This means that the steps don’t need to be defined in any specific order in the YAML file. The order of steps can be determined from the topics that each step uses as input and output. Thus topics are like the links in a chain of steps.

Summary: comparing approaches

Now that I’ve talked you through a few tools, let's compare their approaches.

Plan and Run approach

The plan and run approach is fundamentally a paradigm from the batch processing world where a series of jobs are defined and executed all at once. This approach pushes you towards running the whole pipeline on one machine or container with a global set of resources. But this doesn’t work for machine learning workflows where some processing steps need far more resources than others. Thus, tools like Metaflow evolved to fill this gap.

Similarly, Bytewax and Stream Designer both guide you to run your entire flows in one compute environment. In Stream Designer it’s simply not possible to do it any other way and in Bytewax the infrastructure learning curve is prohibitively complex for a data scientist to manage alone (although it does provide an easy way to scale horizontally).

Thus, when it comes to provisioning infrastructure, Metaflow strikes the best balance between simplicity and configurability. Unfortunately, it’s not suitable for real-time pipelines because it doesn't have built-in functionality to handle streaming data or manage the lower latencies typically required in real-time systems.

Bytewax looks more promising and it shows a lot of potential. For now, the better way to work with it would probably be to have a software engineer set up a staging environment for the data scientists and give them simple command line tools to deploy their dataflows with the required resources. When the code is ready, an engineer can deploy it to the production environment. Even with this setup, Bytewax adds a lot of value, helping to solve the code translation problem (thus cutting down time to production).

Confluent’s Stream Designer looks like the simplest real-time solution for data scientists yet it’s not an easy option for data scientists who use ML models in their real-time pipelines. However, if Confluent were to release a version of Stream Designer for their new Apache Flink integration, that would certainly change the game (as long as they allowed you to write in Python as well as SQL).

Decoupled, service-oriented approach

As I hinted at the start of the article, a decoupled, service-oriented approach is the default paradigm for any kind of application back-end (which is always inherently real-time in nature). That’s why I used AWS Fargate to describe how it works in typical software development.

Online ML models and real-time feature computations are increasingly making their way into standard back-end architectures and are often deployed using the same approach. However, this leaves the data scientist with the difficult choice of trying to learn Docker and AWS Fargate or relying entirely on an ML engineer to get their code deployed. Most large companies go for the latter option, but in lean startups, the data scientist has to build up some infrastructure chops. In a sense, their job security depends on them becoming more T-shaped. As one Hacker News commenter put it:

“The unfortunate truth the school of hard knocks has shown me is that someone without the "roll your sleeves up" attitude to learn Docker is generally speaking just not going to be that effective when push comes to shove. Now if you're using tools to abstract the time of data scientists who are CAPABLE of learning Docker, that is a different story.”

Indeed this is one of the use cases that Quix was designed to address: empowering those data scientists who are capable of learning Docker but whose value is focusing on the processing logic (and algorithms). The Quix platform is intended to abstract away much of the infrastructure complexity associated with Docker and Kubernetes. This allows data scientists to develop, deploy and test their own code in staging environments within containerised services that interact with Kafka to produce and consume data. This encourages them to write better code because performance is often a trade-off based on the shape of the data. Better decisions for optimising their code will be made when they can actually see how system resources deal with production-level data volumes.

In summary, tools like Bytewax and Stream Designer are a good fit if you’re already familiar with the plan and run approach and are making the first steps in moving from batch to real-time stream processing. The way you configure workflows is fairly similar to how you would do it in tools like Dagster and Airflow.

If you have more complex real-time processing requirements where each step in your pipeline needs to be scaled separately, then Quix would be a superior choice to something like Fargate. If you have an army of machine learning engineers to handle the infrastructure, then by all means go for Fargate, or even build something in-house. But if you want to balance simplicity and configurability in the same manner as Metaflow, definitely give Quix a try.

Feature Engineering Has a Language Problem

Tun — Tue, 04 Jul 2023 14:12:12 +0000

Feature engineering is a crucial part of any machine learning (ML) workflow because it enables more complex models to be created than with raw data alone, but it's also one of the most difficult to manage. It's afflicted by a language barrier—a difference in the languages used to encode processing logic. To put it simply, data scientists define their feature computations in one language (e.g. Python or SQL) and data engineers often need to rewrite this logic in another language (e.g. Scala or Java).

My colleague Mike touched on the reasons for this in a previous article "Bridging the gap between data scientists and engineers in machine learning workflows", but I want to zoom in on what exactly this process entails and explore some ideas on how to remove some of the friction.

When do teams encounter language friction?

This problem starts to crop up as companies mature in their level of data sophistication. In-house ML isn't even worth considering until a company has a reliable data pipeline in place to supply models with training data.

However, as data availability and data quality gradually improves, data teams start to create more sophisticated batch processing pipelines that incorporate machine learning. Machine learning models are trained offline and the outputs can begin as artifacts such as CSV files that are assessed by humans before progressing to other types such as class labels in the case of classification models.

Feature transformations as well as training and inference pipelines written by data scientists usually aren't optimised for speed, so ML engineers often rewrite them to run faster. Rewriting the logic for feature engineering is the first place to look for performance gains.

Once an offline ML pipeline has reached a stable state, many companies will look to leverage that data to enhance their product more directly. This often leads to ML models being integrated into application architectures so that, for example, web applications can adapt to customer requirements in real time.

Thus, machine learning as a discipline needs to morph from being an experimental, sporadic, offline process into a repeatable software delivery process. Model files need to be deployed online and produce results in a timely manner. Likewise, the feature computation code from data scientists needs to be adapted for a production environment so the computations can run online. This enables the models to make predictions based on fresh features.

It's at this stage when the impact of language friction starts to become a wider problem:

Rather than explain the theory and best practices behind feature engineering, I'd like to illustrate the language barrier with an example scenario.

An example scenario: feature engineering for AI-powered market predictions

One of the most studied yet mysterious applications of machine learning is using it to predict the movement of certain financial markets. Since the predictions can influence the movement of the price, most organisations keep their prediction models under wraps. However, some trading apps are experimenting with some form of AI-powered prediction. This is especially prevalent in cryptocurrency trading where all trading data is publicly visible on the blockchain.

For example, the SwissBorg trading app features an ML-powered "CyBorg Predictor" that forecasts price movements for certain assets.

Source: swissborg.com

This is a canonical example where real-time ML predictions can bring tangible business value (assuming the predictions are accurate!) so it lends itself nicely to an analysis of online feature engineering.

So, let's say that you work for an up-and-coming crypto trading app that wants to introduce similar functionality.

The key features that you need to train a machine learning model are the OHLC data points: the open, high, low and closing prices for a given time window. This data is typically visualised in the form of a candlestick chart which traders use for technical analysis.

Source: Sabrina Jiang © Investopedia 2020.

There are obviously services that provide precomputed OHLC data, but for argument's sake let's say you want to train a model on features that you've computed yourself. I want to walk through the process of taking this feature from an offline exploratory scenario to a real-time production scenario.

Consequently, this scenario has two sections: prototype and production. Note that this is an oversimplification: in reality, there are more phases involved here (I highly recommend Chip Huyen's piece Real-time machine learning: challenges and solutions for more details). However, for the purposes of explaining the "language barrier", I want to keep things simple.

Prototyping Offline with Python

In the first iteration of your ML model, you might focus on one or two currencies such as ETH or Bitcoin. When prototyping the model, you might train the model offline on historical trading data and backtest it for prediction accuracy.

Let's say your data scientist has a JSON file with some sample historical ticker data (it is ideally in the same JSON structure as data that will come from the live trading feed).

Assume they're using Python for prototyping, they might first calculate ETH's 1-hour OHLC data like this:

import pandas as pd
import json

# Load raw ticker data from the JSON file
with open('ticker_data.json', 'r') as file:
    ticker_data = json.load(file)

# Convert ticker data to a pandas DataFrame
ticker_df = pd.DataFrame(ticker_data)

# Only keep rows with "product\_id" equals "ETH-USD"
eth_usd_ticker_df = ticker_df[ticker_df["product_id"] == "ETH-USD"]

# Convert the time column to pandas datetime
eth_usd_ticker_df['time'] = pd.to_datetime(eth_usd_ticker_df['time'])

# Set the time column as the DataFrame index
eth_usd_ticker_df = eth_usd_ticker_df.set_index('time')

# Calculate the OHLC data based on a 1-minute interval
ohlc_df = eth_usd_ticker_df['price'].astype(float).resample('1H', origin='start').agg(
    {
        "open": "first",
        "high": "max",
        "low": "min",
        "close": "last",
    }
)

# Calculate the volume data based on a 1-minute interval
volume_df = eth_usd_ticker_df['last_size'].astype(float).resample('1H', origin='start').sum()

# Combine OHLC and volume data
ohlc_volume_df = pd.concat([ohlc_df, volume_df], axis=1)

print(ohlc_volume_df)

This script will partition the trading data into fixed 1-hour intervals resembling the following result.

Date	time	open	high	low	close	last_size
09/06/2023	12:26:51.360251	1846.55	1846.56	1846.01	1846.55	13.27384
09/06/2023	13:26:51.360251	1846.53	1846.53	1846.22	1846.22	2.141272
09/06/2023	14:26:51.360251	1864.99	1864.99	1864.68	1864.68	2.16268

This data is OK for prototyping the ML model or providing batched long-term predictions, but not great for fine-grained real-time predictions. Prices can fluctuate wildly even within a 1-hour period so you'll want to catch those as they happen. This means putting the ML model online and combining the historical data with a stream of real-time trading data.

Calculating Features Online with Java

Now suppose that you have adapted the model to use features that are a combination of:

1-hour intervals for the last 30 days.
1-minute intervals for the current day.
A sliding window of the last 60 seconds updating every second.

You want to put this model online so that it provides a stream of predictions that update in real time. These predictions might be used to populate a real-time dashboard or power automated trading bots.

This requires the OHLC calculations to be refactored considerably. This refactoring is influenced by a number of factors that contribute to the so-called language barrier that slows down ML workflows.

These factors are as follows:

Latency and throughput

The query now needs to run on a continuous unbounded stream of data rather than a table. It also needs to maintain a specific rate of throughput to stop the predictions from getting stale. This requires a purpose-built stream-processing engine that can maintain throughput on high volumes of trading data.

Apache Flink is one of the most popular choices for such use cases and although it supports SQL, many developers choose to write processing logic using Flink's lower-level APIs. Calculations run faster when accessing these APIs directly (rather than using an abstraction layer such as PyFlink or SQL).

@Override
public Tuple5<Double, Double, Double, Double, Integer> merge(Tuple5<Double, Double, Double, Double, Integer> a, Tuple5<Double, Double, Double, Double, Integer> b) {
    return new Tuple5<>(
        a.f0,                                   // Open (min)
        Math.max(a.f1, b.f1),                   // High
        Math.min(a.f2, b.f2),                   // Low
        b.f3,                                   // Close (latest value)
        a.f4 + b.f4                             // Volume
    );
}

An excerpt of the math operations after refactoring in Flink.

Different dependencies

If you're going to translate from SQL or Python into Java for Flink, then you'll also need to import different dependencies which need to be accessible in the execution environment. If you've created a custom function in the form of a UDF, you need to ensure that it is also packaged with the job and deployed to the Flink cluster.

import org.apache.flink.streaming.api.windowing.time.Time;
import org.apache.flink.api.common.functions.AggregateFunction;
import org.apache.flink.streaming.api.datastream.*;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.api.java.tuple.*;
import org.apache.flink.streaming.api.windowing.windows.TimeWindow;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer;
import java.util.Properties;

An excerpt of all the extra dependencies required after refactoring code into Java.

Real-time data sources and sinks

To calculate OHLC data on a sliding window, the query now needs to use a different data source. Instead of connecting to a database and querying a table, the process needs to operate on some kind of message queue, which is typically a Kafka topic.

Thus a lot of "connector code" needs to be added so that the process:

Connects to a Kafka message broker.
Reads raw data from one topic and writes results to a second topic.
Efficiently serialises and deserialises the data.

There is also more connector code required to write the feature values themselves to an online feature store such as Redis.

// Create Kafka consumer properties
Properties consumerProps = new Properties();
consumerProps.setProperty("bootstrap.servers", "myserver:9092");
consumerProps.setProperty("group.id", "flink-ohlc-group");

// Create Kafka producer properties
Properties producerProps = new Properties();
producerProps.setProperty("bootstrap.servers", "myserver:9092");

A small excerpt of the extensive Kafka configuration required for Flink.

Windowed aggregations and state management

In the prototyping phase, you might already start testing sliding window calculations, but you'd probably use an in-memory dictionary to store the state. This works fine on one computer. But moving to production, however, you would need to use a processing engine that maintains a shared state in a fault-tolerant manner. This is again why many companies choose Apache Flink which is famous for its reliable stateful processing in a distributed computing environment.

If a replica of a process somehow terminates when it's in the middle of calculating OHLC data for a sliding window, another replica can come and pick up where the previous process left off because the calculation steps are continuously written to a shared storage location.

// Calculate the OHLC data for each ticker over a 30-second sliding window
DataStream<Tuple5<String, Double, Double, Double, Double>> ohlcStream = tickStream
    .keyBy(tick -> tick.ticker)  // Group by ticker
    .timeWindow(Time.seconds(30), Time.seconds(1))  // Sliding window of 30 seconds with 1 second slide
    .aggregate(new OhlcAggregator());

An excerpt of a sliding window calculation using Flink's DataStream API in Java.

As you can see, that's a lot of refactoring. And I haven't even touched on other process changes such as adding the feature to a feature catalog, interacting with an online feature store, testing, deploying and monitoring the online feature calculation.

But rewriting the code from top to bottom alone can slow down a feature's journey from prototype to production.

Solutions to the language barrier

If this problem is so ubiquitous, how do the big players solve it? It turns out that Netflix, Uber, DoorDash have all built their own sophisticated feature platforms that handle feature management as well as stream and batch processing. They still have the feature translation issue, but they're able to automate the translation process for common calculations.

Unified Feature Platforms

The following table comes from another of Chip Huyen's brilliant pieces, this time "Self-serve feature platforms: architectures and APIs". It illustrates just how many proprietary custom-built feature platform features are out there in the wild already. Note that features are typically still defined in multiple languages.

Comparison of feature platforms

	Feature store	Feature API (transformation logic > feature logic)	Stream compute engine
LinkedIn	Venice, Fedex	Python > Python	Samza, Flink
Airbnb	HBase-based	Python > Python	Spark Streaming
Instacart	Scylla, Redis	? > YAML	Flink
DoorDash	Redis, CockroachDB	SQL > YAML	Flink
Snap	KeyDB (multithreaded fork of Redis)	SQL > YAML	Spark Streaming
Stripe	In-house, Redis	Scala > ?	Spark Streaming
Meta (FB)		Scala-like > ?	XStream, Velox
Spotify	Bigtable	Flink SQL > ?	Flink
Uber	Cassandra, DynamoDB	DSL > ?	Flink
Lyft	Redis, DynamoDB	SQL > YAML	Flink
Pinterest	In-house, memcached	R	Flink
Criteo	Couchbase	SQL > JSON	Flink
Binance		Flink SQL > Python	Flink
Twitter	Manhattan, CockroachDB	Scala	Heron
Gojek	DynamoDB	SQL > JSON	Flink
Etsy	Bigtable	Scala > ?	Dataflow

Source: “Self-serve feature platforms: architectures and APIs" by Chip Huyen.

Yet not every company has the time or resources to build their own in-house feature platform. Now that more companies are moving into the later stages of the ML maturity model, there is increasing demand for simpler end-to-end solutions that help ease the language barrier while eliminating infrastructural complexity.

There are now general feature platforms such as Tecton (proprietary) and Feathr (open source) which aim to keep the batch and streaming code tightly synchronised while handling the actual processing itself. This in itself is enough to cut down the time to production. When LinkedIn announced that they were open sourcing Feathr in April 2022, they revealed that it had "reduced engineering time required for adding and experimenting with new features from weeks to days".

Tecton goes further and removes the headache of having to provision extra infrastructure (assuming that you have Databricks, Amazon EMR, or Snowflake set up as an offline feature store). They provide an end-to-end platform for managing, storing and computing online and offline features.

The following screenshot from Tecton should give you a rough idea of how these feature platforms work.

Source: tecton.ai

You essentially store variants of the same feature transformation in one "entry" along with some configuration variables that affect the score of the transformation. Connections to external sources such as Kafka are defined elsewhere in Tecton's configuration, so there is a clean separation of concerns between the transformation code and the streaming transport code.

Caveats

Such systems are still intended for companies who are fairly advanced in their ML maturity. They're in some ways, designed to prevent large enterprises from repeatedly building their own custom feature platforms (although many still do). For this reason, these platforms are still fairly complex, probably because they need to address the highly specific requirements of many enterprises with mature MLOps teams. If you're starting off with a limited feature set, there is a risk that the additional complexity could offset the time-savings that you gain by having a more structured feature management pipeline.

The other issue is that they still use Spark or Flink under the hood to do stream processing, which means that code is still being translated or 'transpiled' at some level. Tecton, for example, uses Spark Structured Streaming for stream processing. Spark's native API is written in Scala, so as with Flink, the Python API is just a wrapper around the native API so using it can introduce extra latency. Additionally, Spark Structured Streaming uses a micro-batch processing model, which generally has higher latency compared to event-driven streaming systems like Apache Flink or Kafka Streams. It also lacks built-in complex event processing (CEP) features that other frameworks like Apache Flink offer.

However, not every application requires CEP or very low-latency processing (sub-second or milliseconds), so in most cases the stream processors built into these feature platforms will do the job.

But what if you want a simpler solution that gives you more direct control over the stream processing logic and while not requiring data scientists to grapple with Java or Scala? That's where the other type of solution comes into play—pure Python stream processing frameworks.

Pure Python stream processing frameworks

A pure Python stream processing framework can enable data scientists to prototype with streaming data very early on in the process. They do so by making it very easy to connect to Kafka and run the typical operations that you would perform on an unbounded stream (i.e. sliding window aggregations). A data scientist might still build their logic on a batch dataset first, but it becomes very simple to adapt that same logic for streaming data. This reduces the language barrier, because the same prototype code can be used in production with very minimal refactoring.

In an ideal scenario, the data scientists can also use Python to define the processing workflows. Many features need to be calculated in multiple steps, so it helps to give data scientists more autonomy in defining workflows as well as the transformation logic itself.

For example, Faust and Bytewax are both pure Python stream processing frameworks that can be used in complex processing pipelines.

Faust

Faust was open sourced by Robinhood in 2018 and has since been taken over by the open source community.

When it was first released, Faust looked very promising. For example, Robinhood's engineering team published a compelling blog post on how they used Faust in combination with Apache Airflow to build a better news system. They used Faust commands via Airflow to continuously pull data from various sources (such as RSS feeds and aggregators) while using Kafka to store the results of every processing step. Faust also supports scalable stateful processing with so-called "stateful tables" and can be configured for exactly once processing via the "processing_guarantee " setting.

However, it appears that Robinhood has abandoned Faust. It's not clear why exactly, but there was plenty of speculation on Reddit. There is now a fork of Robinhood's original Faust repo which is more actively maintained by the open source community. However, it still has a lot of open bugs which are show-stoppers for some teams (see this review of stream processing frameworks for more details on those bugs).

Bytewax

Bytewax is a lot newer, launched in early 2021 and open-sourced in February 2022, but is quickly gaining traction due to it being open source and very user-friendly for data scientists. Unlike Faust, Bytewax aims to be a complete stream processing platform and includes functionality to enable data scientists to build their own dataflows—in other words, processing pipelines that include multiple steps that can be represented as nodes in a graph.

In fact, the example OHLC scenario I provided earlier was inspired by a tutorial that uses a simple Bytewax dataflow to read data from a Coinbase WebSocket and write the OHLC feature values to a feature store (Hopsworks).

Source: "Real-World ML #019: Deploy a real-time feature pipeline to AWS" by Pau Labarta Bajo.

Caveats

Given that the official repo seems to be abandoned, the caveats with Faust should hopefully be clear. Although the Faust fork is more active, it's still uncertain when some of the more serious bugs are going to be fixed. It's worth noting that we also encountered these bugs when trying to do some benchmarking against Faust (for our own Python library).

Bytewax is still fairly new so it will take a while for more reports about how it fares in production to trickle through the ecosystem. When it comes to deploying it, however, you'll still have to deal with some infrastructural complexity—at least for now (they have a managed platform in the works). Looking at their deployment documentation, it's clear that they expect readers to have some knowledge of the infrastructure that will host the stream processing logic. You can choose to run dataflows in local Docker containers, in Kubernetes, AWS EC2 instances, or GCP VM instances. All of these require setup and configuration work that would probably be uninteresting to a data scientist and is probably better handled by a friendly (ML) engineer. Much of this complexity will hopefully go away once their platform becomes generally available.

Conclusion

By now it should be clear the data and ML industry is well aware of the language barrier affecting feature engineering in real-time ML workflows. It has always existed, but was historically solved with in-house solutions hidden from the public. Real-time inference on real-time features was practised by a chosen few with highly specific requirements—so it made sense for them to build their own solutions. Now, with all the increased attention on AI, we're seeing a democratisation of many aspects of MLOps workflows and there are now more standardised approaches to tackling the language barrier such as all-in-one feature platforms and pure Python stream processing frameworks.

Although I've focused on Faust and Bytewax, it would be remiss of me not to mention our own platform Quix which runs Quix Streams— our open source stream processing library. The processing model is not unlike that of Bytewax, but instead of defining data pipelines in Python, you use the Quix Portal UI to piece together your transformation steps (for a peek at how it works in production, see this telemetry case study). The Quix platform is also a fully hosted and managed solution that uses Kafka and Kubernetes under the hood—which makes it pretty much infinitely scalable. We aim to solve the language barrier in the same way as Faust and Bytewax but we want to remove the infrastructure headache too. However, infrastructure is a whole other subject which I hope to tackle in a follow-up post. For now, I hope that my simple example scenario has helped you understand the language barrier in more detail and inspired you to plan for it when you're ready to dive into real-time feature processing.