Forem: Kedro

How to integrate Kedro and Databricks Connect

Juan Luis Cano Rodríguez — Thu, 21 Sep 2023 14:41:03 +0000

In recent months we've updated Kedro documentation to illustrate three different ways of integrating Kedro with Databricks.

You can choose a workflow based on Databricks jobs to deploy a project that finished development.
For faster iteration on changes, the workflow documented in "Use a Databricks workspace to develop a Kedro project" is for those who prefer to develop and test their projects directly within Databricks notebooks, to avoid the overhead of setting up and syncing a local development environment with Databricks.
Alternatively, you can work locally in an IDE as described by the workflow documented in "Use an IDE, dbx and Databricks Repos to develop a Kedro project". You can use your IDE’s capabilities for faster, error-free development, while testing on Databricks. This is ideal if you’re in the early stages of learning Kedro, or if your project requires constant testing and adjustments. However, the experience is still not perfect: you must sync your work inside Databricks with dbx and run the pipeline inside a notebook. Debugging has a lengthy setup for each change and there is less flexibility than inside an IDE.

In this blog post, Diego Lira, a Specialist Data Scientist and a client-facing member of QuantumBlack, AI by McKinsey, explains how to use Databricks Connect with Kedro for a development experience that works completely inside an IDE. He recommends this as a solution where the data-heavy parts of your pipelines are in PySpark. If part of your workflow is in Python (e.g. Pandas) and not Spark (using PySpark), then you will find that Databricks Connect will download your data frame to your local environment to continue running your workflow. This might cause performance issues and introduce compliance risks because the data has left the Databricks workspace.

What is Databricks Connect?

Databricks Connect is Databricks' official method of interacting with a remote Databricks instance while using a local environment.

To configure Databricks Connect for use with Kedro, follow the official setup to create a .databrickscfg file containing your access token. It can be installed with a pip install databricks-connect, and it will substitute your local SparkSession:

from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()

Spark commands are sent and executed on the cluster, and results are returned to the local environment as needed. In the context of Kedro, this has an amazing effect: as long as you don’t explicitly ask for the data to be collected in your local environment, operations will be executed only when saving the outputs of your node. If you use datasets saved to a Databricks path, there will be no performance hit for transferring data between environments.

This tool was recently made available as a thin client for Spark Connect, one of the highlights of Spark 3.4, and configuration was made easier than earlier versions. If your cluster doesn’t support the current Connect, please refer to the documentation as previous versions had different limitations.

How can I use a Databricks Connect workflow with Kedro?

Databricks Connect (and Spark Connect) enables us to have a completely local development flow, while all artifacts can be remote objects. Using Delta tables for all our datasets and MLflow for model objects and tracking, nothing needs to be saved locally. Developers can take full advantage of the Databricks stack while maintaining their full IDE usage.

How to use Databricks as your PySpark engine

Kedro supports integration with PySpark through the use of Hooks. To configure and enable your Databricks session through Spark Connect, simply set up your SPARK_REMOTE environment variable with your Databricks configuration. Here is an example implementation:

import configparser
import os
from pathlib import Path

from kedro.framework.hooks import hook_impl
from pyspark.sql import SparkSession

class SparkHooks:
    @hook_impl
    def after_context_created(self) -> None:
        """Initialises a SparkSession using the config
        from Databricks.
        """
        set_databricks_creds()
        _spark_session = SparkSession.Builder().getOrCreate()

def set_databricks_creds():
    """
    Pass databricks credentials as OS variables if using the local machine.
    If you set DATABRICKS_PROFILE env variable, it will choose the desired profile on .databrickscfg,
    otherwise it will use the DEFAULT profile in databrickscfg.
    """
    DEFAULT = os.getenv("DATABRICKS_PROFILE", "DEFAULT")
    if os.getenv("SPARK_HOME") != "/databricks/spark":
        config = configparser.ConfigParser()
        config.read(Path.home() / ".databrickscfg")

        host = (
            config[DEFAULT]["host"].split("//", 1)[1].strip()[:-1]
        )  # remove "https://" and final "/" from path
        cluster_id = config[DEFAULT]["cluster_id"]
        token = config[DEFAULT]["token"]

        os.environ[
            "SPARK_REMOTE"
        ] = f"sc://{host}:443/;token={token};x-databricks-cluster-id={cluster_id}"

This example will populate SPARK_REMOTE with your local .databrickscfg file. You don't setup the remote connection if the project is being run from inside Databricks (if SPARK_HOME points to Databricks), so you're still able to run it in the usual hybrid development flow. Notice that you don’t need to setup a spark.yml file as is common in other PySpark templates; you’re not passing any configuration, just using the cluster that is in Databricks. You also don’t need to load any extra Spark files (e.g. JARs), as you are using a thin Spark Connect client.

Now all your Spark calls in your pipelines will automatically use the remote cluster. There's no need to change anything in your code. However, notebooks might be part of the project. To use your remote cluster without needing to use environment variables, you can use the DatabricksSession:

from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()

When using the remote cluster, it's preferred to avoid data transfers between the environments, with all catalog entries referencing remote locations. Using kedro_datasets.databricks.ManagedTableDataSet as your dataset type in the catalog also allows you use Delta table features.

How to enable MLflow on Databricks

Using MLflow to save all your artifacts directly to Databricks leads to a powerful workflow. For this you can use kedro-mlflow. Note that kedro-mlflow is built on top of the mlflow library and although the databricks config cannot be found in its documentation, you can read more about it in the documentation from mlflow directly.

After doing the basic setup of the library in your project, you should see a mlflow.yml configuration file. In this file, change the following to set up your URI:

server:
    mlflow_tracking_uri: databricks # if null, will use mlflow.get_tracking_uri() as a default
    mlflow_registry_uri: databricks # if null, mlflow_tracking_uri will be used as mlflow default

Setup your experiment name (this should be a valid Databricks path):

experiment:
    name: /Shared/your_experiment_name

By default, all your parameters will be logged, and objects such as models and metrics can be saved as MLflow objects referenced in the catalog.

Limitations of this workflow

Databricks Connect, built on top of Spark Connect, supports only recent versions of Spark. I recommend looking at the detailed limitations in the official documentation for specific guidance, such as the upload limit of only 128MB for dataframes.

Users also need to be conscious that .toPandas() will move the data to your local pandas environment. Saving results back as MLflow objects is the preferred way to avoid local objects. Examples can be seen in the kedro-mlflow documentation for all types of supported objects.

In the pipeline: September 2023

Juan Luis Cano Rodríguez — Wed, 06 Sep 2023 08:49:16 +0000

This month: a roundup of the summer’s Kedro news, some release updates, and our top picks from recent articles.

Kedro team news

Over the last few months, we’ve been happy to welcome some new team members to the Kedro and Kedro-Viz teams, who have also joined our Technical Steering Committee. Welcome Dmitry Sorokin, Jitendra Gundaniya, Laura Couto, Ravi Kumar Pilla, and Vladimir Nikolic!

We are also pleased to announce a Kedro baby, delivered safely by one of the team, at the end of July!

Contributors news

We reworked the Kedro contributors guide in August, and moved it to the Kedro wiki. There are loads of different ways to contribute to Kedro and if you want to get involved, we encourage you to look at the table that introduces the Kedro contributor guide.

If you spot an article, podcast or video that discusses Kedro, you can also contribute by adding it to the “Awesome Kedro” repository, or letting us know on Slack.

There have been some amazing contributions in recent weeks, including the kedro-vineyard plugin for efficient intermediate sharing in Kedro pipelines, kedro-graphql for serving Kedro projects as a GraphQL API, and kedro-pandera to bring data validation to your Kedro projects.

Release news

August 2023 saw a set of releases to introduce Python 3.11 support across Kedro, Kedro-Viz and Kedro datasets.

Kedro version 0.18.13 included these major features and improvements:

Added support for Python 3.11.
Added new OmegaConfigLoader features: registering of custom resolvers through CONFIG_LOADER_ARGS and support for global variables.
Added kedro catalog resolve CLI command that resolves dataset factories in the catalog with any explicit entries in the project pipeline.
Simplified the conf folder structure for modular pipelines and updated kedro pipeline create and kedro catalog create accordingly.
Made various updates to the Kedro project template and Kedro starters: use of OmegaConfigLoader, transition from setup.py to pyproject.toml, and updated for the simplified conf structure.

Kedro Viz version 6.5 added support for Python 3.11, while Kedro Viz version 6.4 added two new features: feature hint cards to highlight key features of Kedro Viz and support for displaying dataset statistics in the metadata panel for further investigation.

Kedro Datasets version 1.7.0 added polars.GenericDataSet, a dataset backed by polars, a lightning fast dataframe package built entirely using Rust. Kedro Datasets version 1.6.0 added support for Python 3.11.

Recently on the Kedro blog

In the last few weeks we’ve published the following on the Kedro blog:

We’re always looking for collaborators to write about their experiences using Kedro, particularly if you’re working with Kedro datasets or converting an existing project to use Kedro. Get in touch with us on our Slack workspace to tell us your story.

What we’ve learned

We really enjoyed reading more on Medium about the Kedro Vineyard plugin, which is a cloud-native data manager, for data sharing using memory in data science pipelines on Kubernetes.

Quix published an interesting article called “Bridging the gap between data scientists and engineers in machine learning workflows” which is something we regularly discuss within the team.

We found a super-interesting project about font recognition that uses Kedro.

And finally, we enjoyed reading more about data streaming with Kedro over on the QuantumBlack Medium channel.

That’s it for this edition!

How to use Databricks managed Delta tables in a Kedro project

Juan Luis Cano Rodríguez — Thu, 17 Aug 2023 08:55:07 +0000

In this blog post, we'll guide you through the specifics of building a Kedro project that uses managed Delta tables in Databricks using the newly-released ManagedTableDataSet.

What is Kedro?

Kedro is a toolbox for production-ready data science. It's an open-source Python framework that enables the development of clean data science code, borrowing concepts from software engineering and applying them to machine-learning projects. A Kedro project provides scaffolding for complex data and machine-learning pipelines. It enables developers to spend less time on tedious "plumbing" and focus on solving new problems.

What is Databricks?

Databricks is a unified data analytics platform designed for simplifying big data processing and free-form data exploration at any scale. Based on Apache Spark, an open-source distributed computing system, Databricks provides a collaborative cloud-based environment where users can process large amounts of data.

The platform provides collaborative workspaces (notebooks) and computational resources (clusters) to run code with. Clusters are groups of nodes that run Apache Spark. Notebooks are collaborative web-based interfaces where users can write and execute code on an attached cluster.

Why use Kedro on Databricks?

As we've described, Kedro offers a framework for building modular and scalable data pipelines, while Databricks provides a platform for running Spark jobs and managing data. You can combine Kedro and Databricks to build and deploy data pipelines and get the best of both worlds. Kedro's open-source framework will help you to build well-organised and maintainable pipelines, while Databricks' platform will provide you with the scalability you need to run your pipeline in production. Check out the recently-updated Kedro documentation for a set of workflow options for integrating Kedro projects and Databricks. (Additionally, the third-party kedro-mlflow plugin integrates mlflow capabilities inside Kedro projects to enhance reproducibility for machine learning experimentation).

What are Kedro datasets?

Kedro datasets are abstractions for reading and loading data, designed to decouple these operations from your business logic. These datasets manage reading and writing data from a variety of sources, while also ensuring consistency, tracking, and versioning. They allow users to maintain focus on core data processing, leaving data I/O tasks to Kedro.

What is managed data in Databricks?

To understand the concept of managed data in Databricks, it is first necessary to outline how Databricks organises data. At the highest level, Databricks uses metastores to store the metadata associated with data objects. Databricks Unity Catalog is one such metastore. It provides data governance and management across multiple Databricks workspaces. The metastore organises tables (where your data is stored) in a hierarchical structure.

The highest level of organisation in this hierarchy is the catalog. Catalogs are a collection of databases (also referred to as schemas in Databricks' terminology). A database is the second level of organisation in the Unity Catalog namespacing model. Databases are a collection of tables. The tables in a database are the third level of organisation in this hierarchy.

A table is structured data, stored as a directory of files on cloud object storage. By default, Databricks creates tables as Delta tables, which store data using the Delta Lake format. Delta Lake is an open-source storage format that offers ACID transactions, time travel and audit history.

Databricks tables belong to one of two categories: managed and unmanaged (external) tables. Databricks manages both the data and associated metadata of managed tables. If you drop a managed table, you will delete the underlying data. The data of a managed table resides in the location of the database to which it is registered.

On the other hand, for unmanaged tables, Databricks only manages the metadata. If you drop an unmanaged table, you will not delete the underlying data. These tables require a specified location during creation.

How to work with managed Delta tables using Kedro

Let's demonstrate how to use the ManagedTableDataSet with a simple example on Databricks. You'll need to open a new Databricks notebook and attach it to a cluster to follow along with the rest of this example, which runs on a workspace using a Hive metastore. We'll create a dataset containing weather readings, save it to a managed Delta table on Databricks, append some data, and access a specific table version to showcase Delta Lake's time travel capabilities.

Run every separate code snippet in this section in a new notebook cell.

The first steps are to set up your workspace by creating a weather database in your metastore and installing Kedro. Run the following SQL code to create the database:

%sql
create database if not exists weather;

To install Kedro and the ManagedTableDataSet, use the %pip magic:

%pip install kedro kedro-datasets[databricks.ManagedTableDataSet]

The first part of our program will create some weather data. We'll create a Spark DataFrame with four columns: date, location, temperature, and humidity to store our weather data. Then, we'll use a new instance of ManagedTableDataSet to save our DataFrame to a Delta table called 2023_06_22 (the day of the readings) in the weather database.

from pyspark.sql import SparkSession
from pyspark.sql.types import (StructField, StringType, IntegerType, StructType)
from kedro_datasets.databricks import ManagedTableDataSet

spark_session = SparkSession.builder.getOrCreate()

# Define schema
schema = StructType([
    StructField("date", StringType(), True),
    StructField("location", StringType(), True),
    StructField("temperature", IntegerType(), True),
    StructField("humidity", IntegerType(), True),
])

# Create DataFrame
data = [
    ('2023-06-22', 'London', 27, 39),
    ('2023-06-22', 'Warsaw', 28, 40),
    ('2023-06-22', 'Bucharest', 32, 38),
]
spark_df = spark_session.createDataFrame(data, schema)

# Create a ManagedTableDataSet instance using a new table named '2023_06_22'
weather = ManagedTableDataSet(database="weather", table="2023_06_22")

# Save the DataFrame to the table
weather.save(spark_df)

To load our data back into a dataframe, we use the load method on ManagedTableDataSet:

# Load the table data into a DataFrame
reloaded = weather.load()

# Print the first 3 rows of the DataFrame
display(reloaded.take(3))

This code loads the data from the weather table back into a Spark DataFrame and shows the first three rows of the data:

|   date   | location | temperature | humidity |
|:--------:|:--------:|:-----------:|:--------:|
|2023-06-22|Bucharest |     32      |   38     |
|2023-06-22|  London  |     27      |   39     |
|2023-06-22|  Warsaw  |     28      |   40     |

Let's say we take some more weather readings later in the day and want to add them to our Delta table. To do this, we can write to it using a new instance of ManagedTableDataSet initialised with "append" passed in as an argument to write_mode:

# Append new rows to the data
new_rows = [
    ('2023-06-22', 'Cairo', 35, 25),
    ('2023-06-22', 'Lisbon', 28, 44),
]
spark_df = spark_session.createDataFrame(new_rows, schema)

weather = ManagedTableDataSet(
    database="weather",
    table="2023_06_22",
    write_mode="append"
)
weather.save(spark_df)

The code above adds new rows for Cairo and Lisbon to our Delta table, which creates a new version of the table.

The ManagedTableDataSet class allows for saving data with three different write modes: overwrite, append, and upsert:

overwrite mode will completely replace the current data in the table with the new data.
append mode will add new data to the existing table.
upsert mode updates existing rows and inserts new rows, based on a specified primary key. Notably, if the table doesn't exist at save, the upsert mode behaves similarly to append, inserting data into a new table.

Suppose we later want to access our data as it appeared earlier in the day when we had only taken three readings. The ManagedTableDataSet class supports accessing different versions of the Delta table. We can access a specific version by defining a Kedro Version and passing it into a new instance of ManagedTableDataSet:

from kedro.io import Version

# Load version 0 of the table
weather = ManagedTableDataSet(
    database="weather",
    table="2023_06_22",
    version=Version(load=0, save=None)
)
reloaded = weather.load()
display(reloaded)

# Load version 1 of the table
weather = ManagedTableDataSet(
    database="weather",
    table="2023_06_22",
    version=Version(load=1, save=None)
)
reloaded = weather.load()
display(reloaded)

You will see two rendered tables as the output of running this code. The first corresponds to version 0 of the 2023_06_22 table, while the second corresponds to version 1:

|   date   | location | temperature | humidity |
|:--------:|:--------:|:-----------:|:--------:|
|2023-06-22|Bucharest |     32      |   38     |
|2023-06-22|  London  |     27      |   39     |
|2023-06-22|  Warsaw  |     28      |   40     |

|   date   | location | temperature | humidity |
|:--------:|:--------:|:-----------:|:--------:|
|2023-06-22|Bucharest |     32      |   38     |
|2023-06-22|  London  |     27      |   39     |
|2023-06-22|  Warsaw  |     28      |   40     |
|2023-06-22|  Lisbon  |     28      |   44     |
|2023-06-22|  Cairo   |     35      |   25     |

And that's it! We've put together a simple program to show some of the usual tasks that ManagedTableDataSet facilitates, making it easy to save, load, and manage versions of your data in Delta tables on Databricks.

Conclusion

Databricks is a fast-growing deployment vector for Kedro projects. This blog post has demonstrated how to combine the power of both Kedro and Databricks with an open-source ManagedTableDataSet that enables streamlined data I/O operations when deploying a Kedro project on Databricks. ManagedTableDataSet empowers you to spend more time implementing the business logic of your data pipeline or machine learning workflow and less time manually handling data.

A new Kedro dataset for Spark Structured Streaming

Juan Luis Cano Rodríguez — Wed, 12 Jul 2023 07:36:25 +0000

This article guides data practitioners on how to set up a Kedro project to use the new SparkStreaming Kedro dataset, with example use cases, and a deep-dive on some design considerations. It's meant for data practitioners familiar with Kedro so we'll not be covering the basics of a project, but you can familiarise yourself with them in the Kedro documentation.

What is Kedro?

Kedro is an open-source Python toolbox that applies software engineering principles to data science code. It makes it easier for a team to apply software engineering principles to data science code, which reduces the time spent rewriting data science experiments so that they are fit for production.

Kedro was born at QuantumBlack to solve the challenges faced regularly in data science projects and promote teamwork through standardised team workflows. It is now hosted by the LF AI & Data Foundation as an incubating project.

What are Kedro datasets?

What is Spark Structured Streaming?

Spark Structured Streaming is built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data, and the Spark SQL engine will run it incrementally and continuously and update the final result as streaming data continues to arrive.

Integrating Kedro and Spark Structured Streaming

Kedro is easily extensible for your own workflows and this article explains one of the ways to add new functionality. To enable Kedro to work with Spark Structured Streaming, a team inside QuantumBlack Labs developed a new Spark Streaming Dataset, as the existing Kedro Spark dataset was not compatible with Spark Streaming use cases. To ensure seamless streaming, the new dataset has a checkpoint location specification to avoid data duplication in streaming use cases and it uses .start() at the end of the _save method to initiate the stream.

Set up a project to integrate Kedro with Spark Structured streaming

The project uses a Kedro dataset to build a structured data pipeline that can read and write data streams with Spark Structured Streaming and process data streams in realtime. You need to add two separate Hooks to the Kedro project to enable it to function as a streaming application.

Integration involves the following steps:

Create a Kedro project.
Register the necessary PySpark and streaming related Hooks.
Configure the custom dataset in the catalog.yml file, defining the streaming sources and sinks.
Use Kedro’s new dataset for Spark Structured Streaming to store intermediate dataframes generated during the Spark streaming process.

Create a Kedro project

Ensure you have installed a version of Kedro greater than version 0.18.9 and kedro-datasets greater than version 1.4.0.

pip install kedro~=0.18.0 kedro-datasets~=1.4.0

Create a new Kedro project using the Kedro pyspark starter:

kedro new --starter=pyspark

Register the necessary PySpark and streaming related Hooks

To work with multiple streaming nodes, two hooks are required. The first is for integrating PySpark: see Build a Kedro pipeline with PySpark for details. You will also need a Hook for running a streaming query without termination unless an exception occurs.

Add the following code to src/$your_kedro_project_name/hooks.py:

from pyspark import SparkConf
from pyspark.sql import SparkSession

from kedro.framework.hooks import hook_impl


class SparkHooks:
    @hook_impl
    def after_context_created(self, context) -> None:
        """Initialises a SparkSession using the config
        defined in project's conf folder.
        """

        # Load the spark configuration in spark.yaml using the config loader
        parameters = context.config_loader.get("spark*", "spark*/**")
        spark_conf = SparkConf().setAll(parameters.items())

        # Initialise the spark session
        spark_session_conf = (
            SparkSession.builder.appName(context._package_name)
            .enableHiveSupport()
            .config(conf=spark_conf)
        )

        _spark_session = spark_session_conf.getOrCreate()
        _spark_session.sparkContext.setLogLevel("WARN")


class SparkStreamsHook:
    @hook_impl
    def after_pipeline_run(self) -> None:
        """Starts a spark streaming await session
        once the pipeline reaches the last node.
        """

        spark = SparkSession.builder.getOrCreate()
        spark.streams.awaitAnyTermination()

"""Project settings. There is no need to edit this file unless you want to change values
from the Kedro defaults. For further information, including these default values, see
https://kedro.readthedocs.io/en/stable/kedro_project_setup/settings.html."""

from .hooks import SparkHooks, SparkStreamsHook

HOOKS = (SparkHooks(), SparkStreamsHook())

# Instantiated project hooks.
# from streaming.hooks import ProjectHooks
# HOOKS = (ProjectHooks(),)

# Installed plugins for which to disable hook auto-registration.
# DISABLE_HOOKS_FOR_PLUGINS = ("kedro-viz",)

# Class that manages storing KedroSession data.
# from kedro.framework.session.shelvestore import ShelveStore
# SESSION_STORE_CLASS = ShelveStore
# Keyword arguments to pass to the `SESSION_STORE_CLASS` constructor.
# SESSION_STORE_ARGS = {
#     "path": "./sessions"
# }

# Class that manages Kedro's library components.
# from kedro.framework.context import KedroContext
# CONTEXT_CLASS = KedroContext

# Directory that holds configuration.
# CONF_SOURCE = "conf"

# Class that manages how configuration is loaded.
# CONFIG_LOADER_CLASS = ConfigLoader
# Keyword arguments to pass to the `CONFIG_LOADER_CLASS` constructor.
# CONFIG_LOADER_ARGS = {
#       "config_patterns": {
#           "spark" : ["spark*/"],
#           "parameters": ["parameters*", "parameters*/**", "**/parameters*"],
#       }
# }

# Class that manages the Data Catalog.
# from kedro.io import DataCatalog
# DATA_CATALOG_CLASS = DataCatalog

How to set up your Kedro project to read data from streaming sources

Once you have set up your project, you can use the new Kedro Spark streaming dataset. You need to configure the data catalog, in conf/base/catalog.yml as follows to read from a streaming JSON file:

raw_json:
  type: spark.SparkStreamingDataSet
  filepath: data/01_raw/stream/inventory/
  file_format: json

Additional options can be configured via the load_args key.

int.new_inventory:
   type: spark.SparkStreamingDataSet
   filepath: data/02_intermediate/inventory/
   file_format: csv
   load_args:
     header: True

How to set up your Kedro project to write data to streaming sinks

All the additional arguments can be kept under the save_args key:

processed.sensor:
   type: spark.SparkStreamingDataSet
   file_format: csv
   filepath: data/03_primary/processed_sensor/
   save_args:
     output_mode: append
     checkpoint: data/04_checkpoint/processed_sensor
     header: True

Note that when you use the Kafka format, the respective packages should be added to the spark.ymlconfiguration as follows:

spark.jars.packages: org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1

Design considerations

Pipeline design

In order to benefit from Spark's internal query optimisation, we recommend that any interim datasets are stored as memory datasets.

All streams start at the same time, so any nodes that have a dependency on another node that writes to a file sink (i.e. the input to that node is the output of another node) will fail on the first run. This is because there are no files in the file sink for the stream to process when it starts.

We recommended that you either keep intermediate datasets in memory or split out the processing into two pipelines and start by triggering the first pipeline to build up some initial history.

Feature creation

Be aware that windowing operations only allow windowing on time columns.

Watermarks must be defined for joins. Only certain types of joins are allowed, and these depend on the file types (stream-stream, stream-static) which makes joining of multiple tables a little complex at times. For further information or advice about join types and watermarking, take a look at the PySpark documentation or reach out on the Kedro Slack workspace.

Logging

When initiated, the Kedro pipeline will download the JAR required for the Spark Kafka. After the first run, it won't download the file again but simply retrieve it from where the previously downloaded file was stored.

For each node, the logs for the following will be shown: Loading data, Running nodes, Saving data, Completed x out of y tasks.

The completed log doesn't mean that the stream processing in that node has stopped. It means that the Spark plan has been created, and if the output dataset is being saved to a sink, the stream has started.

Once Kedro has run through all the nodes and the full Spark execution plan has been created, you'll see INFO Pipeline execution completed successfully.

This doesn't mean the stream processing has stopped as the post run hook keeps the Spark Session alive. As new data comes in, new Spark logs will be shown, even after the "Pipeline execution completed" log.

If there is an error in the input data, the Spark error logs will come through and Kedro will shut down the SparkContext and all the streams within it.

In summary

In this article, we explained how to take advantage of one of the ways to extend Kedro by building a new dataset to create streaming pipelines. We created a new Kedro project using the Kedro pysparkstarter and illustrated how to work with Hooks, adding them to the Kedro project to enable it to function as a streaming application. The dataset was then easy to configure through the Kedro data catalog, making it possible to use the new dataset, defining the streaming sources and sinks.

There are currently some limitations because it is not yet ready for use with a service broker, e.g. Kafka, as an additional JAR package is required.

If you want to find out more about the ways to extend Kedro, take a look at the advanced Kedro documentation for more about Kedro plugins, datasets, and Hooks.

Contributors

This post was created by Tingting Wan, Tom Kurian, and Haris Michailidis, who are all Data Engineers in the London office of QuantumBlack, AI by McKinsey.

Get up to speed: how to build a custom Kedro runner

Juan Luis Cano Rodríguez — Thu, 22 Jun 2023 09:46:25 +0000

In Kedro, runners are the execution mechanism for data science and machine learning pipelines. The default behaviour of all of Kedro’s built-in runners is to halt pipeline execution if an error occurs that is significant enough to cause any of the nodes to fail, as shown in the following diagram:

In the diagram, the entire run aborts when it encounters a node that it cannot run, terminating all other sections or branches of the pipeline, even those that it could have run.

The custom runner described in this article was specifically developed for a top player in the mining industry that uses Kedro to construct data pipelines for BI dashboards essential for operational excellence.

The client’s pipeline is designed to be resilient towards node failures. Certain nodes operate independently of each other, and especially during the development and exploration stages, the failure of a single node does not necessitate the termination of the entire Kedro run. The desired behaviour is as shown below:

In the diagram, the runner meets a node that cannot run but finds other sections or branches that it can execute.

The client relies on Kedro to execute a substantial pipeline that retrieves data from various sources. Some of the input datasets are manually created, which introduces the possibility of errors if entries are mistyped or omitted. By allowing the pipeline to continue and bypass nodes as they encounter failures, it becomes possible to compile a comprehensive list of data issues during a single run and address them collectively.

In comparison, the default Kedro approach is considerably more time-consuming as it pauses the pipeline upon the failure of a single node, leading to a repetitive cycle of fixing one issue, rerunning the pipeline to encounter the next issue, fixing that, and so on.

Executing all feasible nodes within the pipeline provides an additional advantage. In cases where no data issues arise, completing the pipeline allows the available metrics to be displayed on a BI dashboard, ensuring service continuity. For instance, if only one data source is corrupted, the BI metrics that depend on that specific data need to be withheld, but all others can be showcased. In contrast, the default Kedro behaviour would render all metrics unavailable until the single dataset issue is resolved.

The solution: a customised Kedro runner

As an open-source project, Kedro enables you to define a custom runner for your project. The team took the open-source code for Kedro’s sequential runner and extended it, since the code didn’t need any parallelisation.

“One of the reasons we selected Kedro is that it is open source and highly extensible. We knew from the outset that we could make our own customisations”.

The team created a soft-fail runner to transform errors into warnings, allowing the pipeline to continue executing to the best of its ability while providing a report of any nodes that failed, so that data issues can be addressed. At that point, the pipeline run can be finalised by executing only those missing nodes separately, using appropriate Kedro syntax.

The resulting SoftFailRunner is an implementation of AbstractRunner that runs a pipeline sequentially using a topological sort of provided nodes. Unlike the built-in SequentialRunner, this runner does not terminate the pipeline but runs any remaining nodes as long as their dependencies are fulfilled. The SoftFailRunner implementation adds two arguments: --from-nodes and --runner. The essential code for the SoftFailRunner is shown below and the full code can be found on GitHub.

The logic behind the runner is as follows:

Addition of a new skip_nodes variable to keep track of which nodes should be skipped.
Every time a node is about to run - the skip_nodes list is checked.
When a node fails, all of its descendants are added into skip_nodes with Breadth-first search (BFS).

In summary

The customised Kedro runner was straightforward to create and a satisfactory solution to enable maximum efficiency when handling this particular pipeline and dataset.

“These results could certainly be achieved with an orchestrator, but using an open-source project with customisation is a quick win for delivering business value”.

Collaborative experiment tracking in Kedro-Viz

Juan Luis Cano Rodríguez — Fri, 02 Jun 2023 14:09:58 +0000

When training a model in machine learning, the goal is to determine the optimal configuration of attributes such as hyper-parameters, metrics, and training data. The process of identifying the best combinations requires running a lot of experiments and comparing them. As I mentioned in my previous article, experiment tracking is a way to record all the metadata you need to compare machine-learning experiments and recreate them for your project.

What is Kedro-Viz?

Kedro-Viz is an interactive development tool for building and visualising data science pipelines with Kedro. It enables you to monitor the status of your ML project, present it to stakeholders, and easily bring new team members up to speed. You can try it out using our hosted demo.

“There's no better method to give an overview of a pipeline's structure in such an engaging, interactive, and thorough way. Our asset's pipelines are very complex, but are structured with modular pipelines, so being able to show the overall structure at the modular pipeline level, before jumping into each individual pipeline helps prevent the audience from getting overwhelmed by the number of nodes and datasets shown”.

Senior Data Scientist at Consultancy

What is experiment tracking in Kedro-Viz?

Experiment tracking on Kedro-Viz enables users to select, plot, and compare how multiple metrics change over time, and identify the best-performing ML experiment, with no additional dependencies to manage or infrastructure needed.

The video below demonstrates experiment tracking on Kedro-Viz:

During a project with multiple team members, you could end up with a scenario where the results of your experiments are spread across many machines because people are iterating on their individual computers. This makes the tracking process difficult to manage at a team level, as suggested by this feedback from our users.

"You might train one model locally on your computer. You might train another one in the cloud. Joe might run another pipeline or another experiment. Having all of those experiments in one place as a single source of truth is really powerful.

"If we could write our metrics files to an S3 bucket and then run experiment tracking pointing at that S3 bucket, that simplifies our workflow in many different ways and would be really helpful. And it would make Kedro experiment tracking just as easy, if not easier, than MLFlow for us."

"Can you use an existing database so that we can keep track of runs happening in different places?"

We have found a way to address this pain point and enable you to collaborate more easily. We are excited to announce that we've launched collaborative experiment tracking in Kedro-Viz 6.2.0. The new feature enables a team of users to log their experiments to a shared cloud storage service and view and compare each others' experiments in their own experiment tracking view. This simplifies their workflow, providing a single ‘source of truth’ and encourages multi-user collaboration.

We are releasing this feature in stages across different versions, and the first phase is Kedro-Viz 6.2.0. This version enables users to read experiments of other users that are stored on Amazon S3 or similar storage solutions on other cloud providers, as long as they are supported by fsspec. Future versions of collaborative experiment tracking aim to improve the user experience through automatic reloading and optimisation by caching.

Get started with collaborative experiment tracking

Follow these steps to set up collaborative experiment tracking in Kedro-Viz:

Step 1: Update Kedro-Viz

Ensure you have the latest version of Kedro-Viz (6.2.0 or later).

pip install kedro-viz --upgrade

Step 2: Set up cloud storage

Kedro-Viz uses fsspec to save and read session_store files from a variety of data stores, including local file systems, network file systems, cloud object stores (e.g., Amazon S3, Azure Blob Storage, Google Cloud Storage), and HDFS.

Set up a central cloud storage repository such as a AWS S3 bucket to store all your team's experiments.

Step 3: Configure your Kedro project

Locate the settings.py file in your Kedro project directory and add the following:

from kedro_viz.integrations.kedro.sqlite_store import SQLiteStore
from pathlib import Path

SESSION_STORE_CLASS = SQLiteStore
SESSION_STORE_ARGS = {
    "path": str(Path(__file__).parents[2] / "data"),
    "remote_path": "s3://my-bucket-name/path/to/experiments",
}

Step 4: Set up a unique username

Kedro-Viz saves your experiments as SQLite database files on the central cloud storage. To ensure that all users have unique filenames, you need to set up your **KEDRO_SQLITE_STORE_USERNAME** in the environment variables. By default, Kedro-Viz will take your computer username if this is not specified.

export KEDRO_SQLITE_STORE_USERNAME ="your_unique__username"

Step 5: Configure cloud storage credentials

From Kedro-Viz version 6.2, the only way to set up credentials for accessing your cloud storage is through environment variables, as shown below for Amazon S3 cloud storage.

export AWS_ACCESS_KEY_ID="your_access_key_id"
export AWS_SECRET_ACCESS_KEY="your_secret_access_key"
export AWS_REGION="your_aws_region"

In the screenshot below we show an example of the session store and Kedro-Viz output for three team members (Huong, Tynan, and Rashida):

Session store showing the 3 objects for Huong, Tynan, and Rashida.

Three separate Kedro-Viz runs by Huong, Tynan, and Rashida.

This tutorial offers a very swift run through of the configuration process. For further information, check out the documentation on the experiment tracking feature and keep up-to-date with the latest news about Kedro and Kedro-Viz on our Slack workspace.

Many thanks to the Kedro-Viz team especially @Rashida Kanchwala for contributing to this post.

A Polars exploration into Kedro

Juan Luis Cano Rodríguez — Wed, 17 May 2023 14:50:58 +0000

One year ago I travelled to Lithuania for the first time to present at PyCon/PyData Lithuania, and I had a great time there. The topic of my talk was an evaluation of some alternative dataframe libraries, including Polars, the one that I ended up enjoying the most.

I enjoyed it so much that this week I’m in Vilnius again, and I’ll be delivering a workshop at PyCon Lithuania 2023 called “Analyze your data at the speed of light with Polars and Kedro”.

In this blog post you will learn how using Polars in Kedro can make your data pipelines much faster, what’s the current status of Polars in Kedro, and what can be expected in the near future. In case it’s the first time you’ve heard about Polars, I have included a short introduction at the beginning.

Let’s dive in!

What is the Polars library?

Polars is an open-source library for Python, Rust, and NodeJS that provides in-memory dataframes, out-of-core processing capabilities, and more. It is based on the Rust implementation of the Apache Arrow columnar data format (you can read more about Arrow on my earlier blog post “Demystifying Apache Arrow”), and it is optimised to be blazing fast.

The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like Dask, cuDF, or Modin, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.

I started experimenting with Polars one year ago, and it has now become my go-to data manipulation library. I gave several talks about it, for example at PyData NYC, and the room was full.

How do Polars and Kedro get used together?

If you want to learn more about Kedro, you can watch a video introduction on our YouTube channel:

Traditionally Kedro has favoured pandas as a dataframe library because of its ubiquity and popularity. This means that, for example, to read a CSV file, you would add a corresponding entry to the catalog:

openrepair-0_3-categories:
  type: pandas.CSVDataSet
  filepath: data/01_raw/OpenRepairData_v0.3_Product_Categories.csv

And then, you would use that dataset as input for your node functions, which would, in turn, receive pandas DataFrame objects:

def join_events_categories(
    events: pd.DataFrame,
    categories: pd.DataFrame,
) -> pd.DataFrame:
        ...

(This is just one of the formats supported by Kedro datasets of course! You can also load Parquet, GeoJSON, images… have a look at the kedro-datasets reference for a list of datasets maintained by the core team, or the #kedro-plugin topic on GitHub for some contributed by the community!)

The idea of this blog post is to teach you how can you use Polars instead of pandas for your catalog entries, which in turn allow you to write all your data transformation pipelines using Polars dataframes. For that, I crafted some examples that use the Open Repair Alliance dataset, containing more than 80 000 records of repair events across Europe.

And if you’re ready to start, let’s go!

Get started with Polars for Kedro

First of all, you will need to add kedro-datasets[polars.CSVDataSet] to your requirements. At the time of writing (May 2023), the code below requires development versions of both kedro and kedro-datasets, which you can declare on your requirements.txt or pyproject.toml as follows:

# requirements.txt

kedro @ git+https://github.com/kedro-org/kedro@3ea7231
kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets

# pyproject.toml

[project]
dependencies = [
    "kedro @ git+https://github.com/kedro-org/kedro@3ea7231",
    "kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets",
]

If you are using the legacy setup.py files, the syntax is very similar:

setup(
    requires=[
        "kedro @ git+https://github.com/kedro-org/kedro@3ea7231",
        "kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets",
    ]
)

After you install these dependencies, you can start using the polars.CSVDataSet by using the appropriate type in your catalog entries:

openrepair-0_3-categories:
  type: polars.CSVDataSet
  filepath: data/01_raw/OpenRepairData_v0.3_Product_Categories.csv

and that’s it!

Reading real world CSV files with `polars.CSVDataSet`

It turns out that reading CSV files is not always that easy. The good news is that you can use the load_args parameter of the catalog entry to pass extra options to the polars.CSVDataSet, which mirror the function arguments of polars.read_csv. For example, if you want to attempt parsing the date columns in the CSV, you can set the try_parse_dates option to true:

openrepair-0_3-categories:
  type: polars.CSVDataSet
  filepath: data/01_raw/OpenRepairData_v0.3_Product_Categories.csv
  load_args:
    # Doesn't make much sense in this case,
    # but serves for demonstration purposes
    try_parse_dates: true

Some of these parameters are required to be Python objects: for example, polars.read_csv takes an optional dtypes parameter that can be used to specify the dtypes of the columns, as follows:

pl.read_csv(
    "data/01_raw/OpenRepairData_v0.3_aggregate_202210.csv",
    dtypes={
        "product_age": pl.Float64,
        "group_identifier": pl.Utf8,
    }
)

Kedro catalog files only support primitive types. But fear not! You can use more sophisticated configuration loaders in Kedro that allow you to tweak how such files are parsed and loaded.

To pass the appropriate dtypes to read this CSV file, you can use the TemplatedConfigLoader, or alternatively the shiny new OmegaConfigLoader with a custom omegaconf resolver. Such resolver will take care of parsing the strings in the YAML catalog and transforming them into the objects Polars needs. Place this code in your settings.py:

# settings.py

import polars as pl
from omegaconf import OmegaConf
from kedro.config import OmegaConfigLoader

if not OmegaConf.has_resolver("polars"):
    OmegaConf.register_new_resolver("polars", lambda attr: getattr(pl, attr))

CONFIG_LOADER_CLASS = OmegaConfigLoader

And now you can use the special OmegaConf syntax in the catalog:

openrepair-0_3-events-raw:
  type: polars.CSVDataSet
  filepath: data/01_raw/OpenRepairData_v0.3_aggregate_202210.csv
  load_args:
    dtypes:
      # Notice the OmegaConf resolver syntax!
      product_age: ${polars:Float64}
      group_identifier: ${polars:Utf8}
    try_parse_dates: true

Now you can access Polars data types with ease from the catalog!

Future plans for Polars integration in Kedro

This all looks very promising, but it’s only the tip of the iceberg. First of all, these changes need to land in stable versions of kedro and kedro-datasets. More importantly, we are working on a generic Polars dataset that will be able to read other file formats, for example Parquet, which is faster, more compact, and easier to use.

Polars makes me so excited about the future of data manipulation in Python, and I hope that all Kedro users are able to leverage this amazing project on their data pipelines very soon!

In the Pipeline: May 2023

Juan Luis Cano Rodríguez — Mon, 08 May 2023 15:43:50 +0000

We're launching a new monthly blog post that'll keep you updated on all the exciting things happening in the Kedro community. From the latest Kedro news to upcoming events and interesting topics, “In the Pipeline” has got you covered.

This month: a new pair of releases, Technical Steering Committee news, upcoming events, and our top picks from recent articles and podcasts.

The latest releases of Kedro and Kedro-Viz are here

Earlier this week, Merel announced on Slack that Kedro 0.18.8 has been released.

Here are the headlines. You can see the full set of release notes on GitHub.

🚀 Major features and changes

Added KEDRO_LOGGING_CONFIG environment variable, which can be used to configure logging from the beginning of the kedro process.
Removed logs folder from the Kedro new project template. File-based logging will remain but just be level INFO and above and go to project root instead.
A set of bug fixes and other changes 🪲

✍️ Documentation changes

Improvements to Sphinx toolchain including incrementing to use a newer version.
Improvements to documentation on visualising Kedro projects on Databricks, and additional documentation about the development workflow for Kedro projects on Databricks.
Improvements to documentation about configuration.
Updated table of contents for documentation to reduce scrolling.
And more!

Note that using kedro.extras.datasets has been officially deprecated, and will be removed from Kedro in 0.19. Installing kedro_datasets is now the preferred approach.

We would like to thank our community contributors Maxime Steinmetz, Brian Cechmanek, and Matt Rossetti for their input to this release.

In the last week of April, Nero announced the release of Kedro-Viz 6.1.0

Kedro-Viz is an interactive development tool for building and visualising data science pipelines with Kedro. It enables you to monitor the status of your ML project, present it to stakeholders, and smoothly bring new team members onboard. It also offers experiment tracking, and the ability to preview code and datasets.

How do I get Kedro-Viz?

Python:

pip install kedro-viz==6.0.0
React:

npm install @quantumblack/kedro-viz@latest

🚀 What can you expect in this release?

Experiment tracking updates allowing users to filter (show/hide) metrics in the time series & parallel coordinates metrics plots.📈
A set of bug fixes and other changes 🪲

You can see the full Kedro-Viz release notes on GitHub.

🔮 What's coming next?

Collaboration features within Kedro-Viz.
Create your own reports.

Technical Steering Committee news

We’ve recently welcomed @marrrcin to the Kedro Technical Steering Committee! You can read more about this fantastic news on our blog.

We’d also like to share some numbers that we collected recently:

GitHub Stars on https://github.com/kedro-org/kedro: 8.3K
Monthly Downloads: 467,000

Upcoming events

4th May 2023

The Kedro team is organising a 2-hour virtual training session on Thursday, May 4th, 2023 that is open to everyone. The session introduces you to Kedro and explains how to turn a Jupyter notebooks into reusable Python libraries. You’ll learn the benefits of Kedro pipelines and how to visualise them using Kedro-Viz in an interactive session with plenty of Q&A.

18th May 2023

Juan Luis, Kedro’s Developer Advocate, is giving a talk on 18th May at PyCon Lithuania. His talk is titled “Analyze your data at the speed of light with Polars and Kedro” and presents how to combine Kedro with Polars, a new dataframe library backed by Arrow and Rust, for lightning fast data manipulation and exploratory data analysis

In the pipeline: top picks from the Kedro team

Towards Data Science recently published a pair of nice posts by Jõao Pedro about writing a data pipeline with Airflow and AWS Tools (S3, Lambda & Glue) and automatically managing data pipeline infrastructures With Terraform
GetInData | Part of Xebia publish a weekly newsletter on LinkedIn called Data Pill and it hit its 50th edition this week, so celebrated with a compilation of the most popular case studies from previous editions.
The NerdOut@Spotify podcast is always a must-listen. It’s produced by the nerds at Spotify, and made for the nerds inside all of us. You get to hear from Spotify engineers about challenging tech problems and get a firsthand look into what they’re doing. The most recent episode is a fascinating look into building at scale.
Speaking of podcasts and Spotify, the R&D Engineering team recently blogged about the generation of podcast previews using Google Dataflow. The result: a neat way of providing users with audio teasers so they can make listening decisions that aren’t based just on static content, such as cover art and descriptions.
In last month’s virtual Kedro update meeting, we walked the community through the new OmegaConfigLoader, described user research and ongoing collaboration with Databricks, and discussed experiment tracking in Kedro-Viz. If you missed the session, you can catch up with a recording on the Kedro YouTube channel.

That’s it for May 2023

And that’s a wrap for this month. But if you can’t wait for next month’s In the Pipeline news, we also toot out regular updates onto Mastodon (https://social.lfx.dev/@kedro) and across the popular channels of the Slack community.

Spoiler alert! Next month, we might unveil a fresh new look. But shh, let's keep it between us for now. Make sure to bookmark this blog or add our RSS feed to your favorite reader to stay in the loop and join us in the first week of June for another update from the Kedro team.

Introducing your new team lead…Kedro

Jo Stichbury — Wed, 19 Apr 2023 12:59:06 +0000

This post explains how Kedro can guide an analytics team to follow best practices and avoid technical debt.

In a recent article, I explained that following software principles can help you create a well-ordered analytics project to share, extend and reuse in the future. In this post we'll review how you can benefit from using Kedro as a toolbox to apply best practices to data science code.

How data science projects fail

As data scientists, we aspire to unlock valuable insights by building
well-engineered prototypes that we can take forward into production.
Instead, there is a tendency for us to make poor engineering decisions
in the face of tight deadlines or write code of dubious quality through
a lack of expertise.

The result is technical debt and prototype code that is difficult to understand,
maintain, extend, and fix. Projects that once looked promising fail to transition past the experimental stage into production.

"A cycle of quick and exciting research leads to high expectations of
great improvement, followed by a long series of delays and
disappointments where frustrating integration work fails to recreate
those elusive improvements, made all the worse by the feeling of sunk
costs and a need to justify the time spent."

Joe Plattenburg, Data Scientist at Root Insurance

How to write well-engineered data science code

When you start to cut code on a prototype, you may not prioritize
maintainability and consistency. Adopting a team culture and way of
working to minimize technical debt can make the difference between
success and failure.

Some of the most valuable techniques a data scientist can pick up are
those that generations of software engineers already use, such as the
following guidelines:

Use a standard and logical project structure: It is easier to
understand a project, and share it with others, if you follow a standard
structure.

Don't use hardcoded values: instead, use precisely named constants
and put them all into a single configuration file so you can find and
update them easily.

Refactor your code: In data science terms, it often makes sense to
use a Jupyter notebook for experimentation. But once your experiment is
done, it's time to clean up the code to remove elements that make it
unmaintainable, and to remove accidental complexity. Refactor the code
into Python functions and packages to form a pipeline that can be
routinely tested to ensure repeatable behaviour.

"Testing after each change means that when I make a mistake, I only
have a small change to consider in order to spot the error, which
makes it far easier to find and fix."

Martin Fowler, Author of Refactoring: Improving the Design of Existing
Code

Make code reusable by making it readable: Write your pipelines as a
series of small functions that do just one task, with single return
paths and a limited number of arguments.

Many data scientists say they've learned from their colleagues through
pair programming, code reviews and in-house mentoring that enables them
to build expertise suitable to their roles and requirements.

We see Kedro as the always-available team lead that steers the direction
of the analytics project from the outset and encourages use of a
well-organized folder structure, software design that supports regular
testing, and a culture of writing readable, clean code.

What is Kedro?

Kedro is an open-source toolbox for production-ready data science. The
framework was born at QuantumBlack to solve the challenges faced
regularly in data science projects and promote teamwork through
standardised team workflows. It is now hosted by the LF AI & Data
Foundation as an incubating project.

Kedro = Consistent project structure

Kedro is built on the learnings of Cookie Cutter Data Science. It helps you to standardise how configuration, source
code, tests, documentation, and notebooks are organised with an
adaptable project template. If your team needs to build with multiple
projects that have similar structure, you can also create your own
Cookie Cutter project templates with Kedro starters.

Kedro = Maintainable code

Kedro helps you refactor your business logic and data processing into
Python modules and packages to form pipelines, so you can keep your
notebooks clean and tidy.
Kedro-Viz then visualises the pipelines to help you navigate .

"People started from scratch each time, the same pitfalls were
experienced independently, reproducibility was time consuming and only
members of the original project team really understood each
codebase...

We needed to enforce consistency and software engineering best
practices across our own work. Kedro gave us the super-power to move
people from project to project and it was game-changing. After working
with Kedro once, you can land in another project and know how the
codebase is structured, where everything is and most importantly how
you can help".

Joel Schwarzmann, Principal Product Manager, QuantumBlack Labs, blog
post

Kedro = Code quality

Kedro makes it easy to avoid common code smells such as hard-coded
constants and magic numbers. The configuration library enables your code
to be reusable through data, model, and logging configuration. An
ever-expanding data catalog supports multiple formats of data access.

Kedro also makes it keep your code quality up to standard, through
support for black, isort, and flake8 for code linting and formatting,
pytest for testing, and Sphinx for documentation.

Kedro = Standardisation

Kedro integrates with standard data science tools, such as TensorFlow,
scikit-learn, or Jupyter notebooks for experimentation, and commonly
used routes to deployment such as Databricks.

Summary

Kedro is an open-source Python toolbox that applies software engineering
principles to data science code. It makes it easier for a team to apply
software engineering principles to data science code, which reduces the
time spent rewriting data science experiments so that they are fit for
production.

When you follow established best practice, you have a better chance of
success.

Software engineering principles only work if the entire team follows
them. A tool like Kedro can guide you just like an experienced technical
lead, making it second nature to use established best practices, and
supporting a culture and set of processes based upon software
engineering.

Look forward to greater collaboration and productivity with Kedro in
your team!

Find out more about Kedro

There are many ways to learn more about Kedro:

Join our Slack organisation to reach out to us directly if you've a question or want to stay up to date with our news. There's an archive of past past conversations on Slack too.
Read our docs or look at the Kedro source code on GitHub.
Check out our "Crash course in Kedro video on YouTube.

Look out for an upcoming training session tailored to help your team get
on-board with Kedro.

How do data scientists combine Kedro and Databricks?

Jo Stichbury — Wed, 19 Apr 2023 12:14:37 +0000

In recent research, we found that Databricks is the dominant
machine-learning platform used by Kedro users.

The purpose of the research was to identify any barriers to using Kedro with Databricks; we are collaborating with the Databricks team to create a prioritised list of opportunities to facilitate integration.

For example, Kedro is best used with an IDE, but IDE support on Databricks is still evolving, so we are keen to understand the pain points that Kedro users face when combining it with Databricks.

Our research took qualitative data from 16 interviews, and quantitative data from a poll (140 participants) and a survey (46 participants) across the McKinsey and open-source Kedro user bases. We analysed two user journeys.

How to ensure a Kedro pipeline is available in a Databricks workspace

The first user journey we considered is how a user ensures the latest version of their pipeline codebase is available within the Databricks workspace. The most common workflow is to use Git, but almost a third of the users in our research set said there were a lot of steps to follow.

The alternative workflow, which is to use dbx sync to Databricks repos, was used by less than 10% of the users we researched, indicating that awareness of this option is low.

How to run Kedro pipelines using a Databricks cluster

The second user journey is how users run Kedro pipelines using a Databricks cluster. The most popular method, used by over 80% of participants in our research, is to use a Databricks notebook, which serves as an entry point to run Kedro pipelines.

We discovered that many users were unaware of the IPython extension that significantly reduces the amount of code required to run Kedro pipelines in Databricks notebooks.

We also found that some users run their Kedro pipelines by packaging them and running the resulting Python package on Databricks. However, Kedro did not support the packaging of configurations until version 18.5, which has caused problems.

The final option some users select is to use Databricks Connect, but this is not recommended since it is soon
to be sunsetted by Databricks.

The output of our research

To make it easier to pair Kedro and Databricks, we are updating Kedro's documentation to cover the latest Databricks features and tools, particularly the development and deployment workflows for Kedro on Databricks with DBx. The goal is to help Kedro users take advantage of the benefits of working locally in an IDE and still deploy to Databricks
with ease.

You can expect this new documentation to be released in the next one to two weeks.

We will also be creating a Kedro Databricks plugin or starter project template to automate the recommended steps in the documentation.

Coming soon...

We have a managed Delta table dataset available in our Kedro datasets repo, which will be available for public consumption soon. We are also planning to support managed MLflow on Databricks.

We have set up a milestone on GitHub so you can check in on our progress and contribute if you want to. To suggest features to us, report bugs, or just see what we're working on right now, visit the Kedro projects on GitHub.

We welcome every contribution, large or small.

Find out more about Kedro

There are many ways to learn more about Kedro:

Join our Slack organisation to reach out to us directly if you've a question or want to stay up to date with our news. There's an archive of past past conversations on Slack too.
Read our docs.
Check out our "Crash course in Kedro video on YouTube.

Look out for an upcoming training session tailored to help your team get on-board with Kedro.

A new home for the Kedro blog and some recent releases

Jo Stichbury — Tue, 04 Apr 2023 14:51:47 +0000

In this post, we describe recent releases to Kedro, Kedro-Viz and some new Kedro datasets.

Kedro has a new blog over at kedro.org/blog!

We’ve previously published on QuantumBlack’s Medium channel, but recent updates and improvements here on the Kedro website mean that we’re now able to bring you a dedicated blog for, and about, the open-source Kedro community.

We plan to publish a range of articles by contributors from within the team and beyond. If you’re a Kedroid with an idea for a post, please reach out to us using one of the channels on the Slack organisation, or raise an issue on GitHub.

Kedro releases

We last gave an update on Kedro in late 2022, when we described the features in Kedro version 0.18.4. Since then, we’ve released three additional non-breaking versions of Kedro in the 0.18.x series, with the goal of a regular release cadence at the end of most two-week development sprints.

Some of the highlights of our releases are described below along with links to the full release notes. For each of these releases there’s a straightforward upgrade path with pip or conda. For example, to upgrade to Kedro version 0.18.7 from version 0.18.4:

pip install kedro==0.18.7

conda install -c conda-forge kedro==0.18.7

We received many contributions to these new versions from our open-source community and want to thank every contributor for taking the time to extend and improve Kedro.

Kedro version 0.18.7

These are the headline changes (You can find all the details about the Kedro 0.18.7 release on GitHub):

We added new Kedro CLI command kedro jupyter setup to set up a Jupyter Kernel for Kedro that automatically loads the Kedro extension for ease of use.
The kedro package command now includes the project configuration in a compressed tar.gz file.
We’ve added functionality to package and read your configuration as a compressed file. You can now use OmegaConfigLoader to load configuration from compressed files of zip or tar format. (This feature requires fsspec>=2023.1.0).
In documentation news, we moved seamlessly from kedro.readthedocs.io to docs.kedro.org in this release. We also made some significant improvements to on-boarding documentation that covers setup for new Kedro users and major changes to the spaceflights tutorial to make it faster to work through. We think it’s a better read. Tell us if it’s not.

Kedro version 0.18.6

This was a small release to fix a bug introduced in Kedro 0.18.5 that was causing experiment tracking in Kedro-Viz to fail. You can find all the details about the release of Kedro version 0.18.6 on GitHub.

Kedro version 0.18.5

In February 2023, we released Kedro version 0.18.5, to introduce a brand new config loader powered by OmegaConf. You can now use the omegaconf syntax with kedro run --param.

We also added the following:

Some improvements to the kedro run command used in the CLI. One changes is to make it more consistent. The flags --node, --tag, and --load-version are deprecated in favour of plural equivalents (--nodes, --tags, and --load-versions) and will be removed in Kedro 0.19.0. An additional change means that you can filter and run nodes by node namespace using the --namespace flag with kedro run.
There is now support for using generator functions as nodes, i.e. using yield instead of return.
We added a new node argument to all four dataset hooks

You can find all the details about the Kedro version 0.18.5 release on GitHub.

Kedro datasets releases

Kedro provides numerous different built-in datasets for various file types and file systems, to save you from having to write the logic for reading or writing data, including Pandas, Spark, Dask, NetworkX, Pickle, and more.

There have been several datasets contributed by community members over the past months which include the addition of:

snowflake.SnowparkTableDataSet by Vladimir Filimonov and Heber Urdaneta
polars.CSVDataSet by Walber Moreira

As we mentioned in “Keeping up with Kedro”, Kedro version 0.19.0 will move Kedro’s datasets from the main framework project into a separate package called Kedro-Datasets.

Kedro-Viz releases

If you've not yet used it, Kedro-Viz is the interactive development tool for building data science pipelines with Kedro. It comes with an experiment tracking feature enabling you to view and compare different runs of your Kedro project. Check out the Kedro-Viz demo at demo.kedro.

We’ve made three releases of Kedro-Viz this year, plus a patch release. You can find further details of the Kedro-Viz releases on GitHub.

To get the latest release of Kedro-Viz, you can use pip:

pip install kedro-viz==6.0.0

or npm

npm install @quantumblack/kedro-viz@latest

Here’s a summary of what we’ve been working on:

Kedro-Viz version 6.0.0

In this release we bumped the major version to 6.0.0 because of a change in the frontend React code (we bumped the minimum version of React from 16.8.6 to 17.0.2). Additional changes include:

We added a change so you can now see a preview of your data in the metadata panel.
You can remove metrics plots from metadata panel and add links to the plots on experiment tracking.
You can also link plot and JSON dataset names from experiment tracking to the flowchart.
Kedro-Viz no longer depends on pandas or Plotly.

Kedro-Viz versions 5.3.0 and 5.2.0

We introduced a raft of updates to experiment tracking, the largest being the addition of time series & parallel coordinates metrics plots and delta values.

We’ve enabled the display of json objects with react-json-viewer in experiment tracking.
We added a feature to show/hide modular pipelines on the pipeline flowchart.
It’s now possible to retrieve and share URL parameters for each element/section in the flowchart.

We've recently published a blog post about experiment tracking to highlight the latest features and discuss what is coming next.

What's next for the Kedro projects?

We have a broad range of milestones for the Kedro framework that cover areas such as integration with Databricks, enhancements for Jupyter Notebook users and ongoing changes such as the transition of datasets into their own package.

On the to-do list for Kedro-Viz, we’ve included enhanced navigation between flowchart and experiment tracking and collaboration features within Kedro-Viz.

Stand by for a pair of virtual Kedro showcases on 5th April 2023 (9am BST and 4pm BST) to demonstrate some of the features added in the recent releases to the global community.

To suggest features to us, report bugs, or just see what we’re working on right now, visit the Kedro projects on GitHub. We welcome every contribution, large or small.

Forem: Kedro

How to integrate Kedro and Databricks Connect

What is Databricks Connect?

How can I use a Databricks Connect workflow with Kedro?

How to use Databricks as your PySpark engine

How to enable MLflow on Databricks

Limitations of this workflow

In the pipeline: September 2023

Kedro team news

Contributors news

Release news

Recently on the Kedro blog

What we’ve learned

How to use Databricks managed Delta tables in a Kedro project

What is Kedro?

What is Databricks?

Why use Kedro on Databricks?

What are Kedro datasets?

What is managed data in Databricks?

How to work with managed Delta tables using Kedro

Conclusion

A new Kedro dataset for Spark Structured Streaming

What is Kedro?

What are Kedro datasets?

What is Spark Structured Streaming?

Integrating Kedro and Spark Structured Streaming

Set up a project to integrate Kedro with Spark Structured streaming

Create a Kedro project

Register the necessary PySpark and streaming related Hooks

How to set up your Kedro project to read data from streaming sources

How to set up your Kedro project to write data to streaming sinks

Design considerations

Pipeline design

Feature creation

Logging

In summary

Contributors

Get up to speed: how to build a custom Kedro runner

The solution: a customised Kedro runner

In summary

Collaborative experiment tracking in Kedro-Viz

What is Kedro-Viz?

What is experiment tracking in Kedro-Viz?

Get started with collaborative experiment tracking

Step 1: Update Kedro-Viz

Step 2: Set up cloud storage

Step 3: Configure your Kedro project

Step 4: Set up a unique username

Step 5: Configure cloud storage credentials

A Polars exploration into Kedro

What is the Polars library?

How do Polars and Kedro get used together?

Get started with Polars for Kedro

Reading real world CSV files with polars.CSVDataSet

Future plans for Polars integration in Kedro

In the Pipeline: May 2023

The latest releases of Kedro and Kedro-Viz are here

Technical Steering Committee news

Upcoming events

4th May 2023

18th May 2023

In the pipeline: top picks from the Kedro team

That’s it for May 2023

Introducing your new team lead…Kedro

How data science projects fail

How to write well-engineered data science code

What is Kedro?

Kedro = Consistent project structure

Kedro = Maintainable code

Kedro = Code quality

Kedro = Standardisation

Summary

Find out more about Kedro

How do data scientists combine Kedro and Databricks?

How to ensure a Kedro pipeline is available in a Databricks workspace

How to run Kedro pipelines using a Databricks cluster

The output of our research

Coming soon...

Find out more about Kedro

A new home for the Kedro blog and some recent releases

Reading real world CSV files with `polars.CSVDataSet`