<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Forem: Kedro</title>
    <description>The latest articles on Forem by Kedro (@kedro).</description>
    <link>https://forem.com/kedro</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Forganization%2Fprofile_image%2F6802%2Fb8bd874f-1d94-48f7-a5a5-fdbd660bbf62.png</url>
      <title>Forem: Kedro</title>
      <link>https://forem.com/kedro</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://forem.com/feed/kedro"/>
    <language>en</language>
    <item>
      <title>How to integrate Kedro and Databricks Connect</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Thu, 21 Sep 2023 14:41:03 +0000</pubDate>
      <link>https://forem.com/kedro/how-to-integrate-kedro-and-databricks-connect-3ep7</link>
      <guid>https://forem.com/kedro/how-to-integrate-kedro-and-databricks-connect-3ep7</guid>
      <description>&lt;p&gt;In recent months we've updated Kedro documentation to illustrate &lt;a href="https://docs.kedro.org/en/stable/deployment/databricks/index.html" rel="noopener noreferrer"&gt;three different ways of integrating Kedro with Databricks&lt;/a&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;You can choose a &lt;a href="https://docs.kedro.org/en/stable/deployment/databricks/databricks_deployment_workflow.html" rel="noopener noreferrer"&gt;workflow based on Databricks jobs&lt;/a&gt; to deploy a project that finished development.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;For faster iteration on changes, the workflow documented in &lt;a href="https://docs.kedro.org/en/stable/deployment/databricks/databricks_notebooks_development_workflow.html" rel="noopener noreferrer"&gt;"Use a Databricks workspace to develop a Kedro project"&lt;/a&gt; is for those who prefer to develop and test their projects directly within Databricks notebooks, to avoid the overhead of setting up and syncing a local development environment with Databricks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Alternatively, you can work locally in an IDE as described by the workflow documented in &lt;a href="https://docs.kedro.org/en/stable/deployment/databricks/databricks_ide_development_workflow.html" rel="noopener noreferrer"&gt;"Use an IDE, dbx and Databricks Repos to develop a Kedro project"&lt;/a&gt;. You can use your IDE’s capabilities for faster, error-free development, while testing on Databricks. This is ideal if you’re in the early stages of learning Kedro, or if your project requires constant testing and adjustments. However, the experience is still not perfect: you must sync your work inside Databricks with dbx and run the pipeline inside a notebook. Debugging has a lengthy setup for each change and there is less flexibility than inside an IDE.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this blog post, Diego Lira, a Specialist Data Scientist and a client-facing member of &lt;a href="https://www.mckinsey.com/capabilities/quantumblack/how-we-help-clients" rel="noopener noreferrer"&gt;QuantumBlack, AI by McKinsey&lt;/a&gt;, explains how to use Databricks Connect with Kedro for a development experience that works completely inside an IDE. He recommends this as a solution where the data-heavy parts of your pipelines are in PySpark. If part of your workflow is in Python (e.g. Pandas) and not Spark (using PySpark), then you will find that Databricks Connect will download your data frame to your local environment to continue running your workflow. This might cause performance issues and introduce compliance risks because the data has left the Databricks workspace.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Databricks Connect?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.databricks.com/dev-tools/databricks-connect-ref.html" rel="noopener noreferrer"&gt;Databricks Connect&lt;/a&gt; is Databricks' official method of interacting with a remote Databricks instance while using a local environment.&lt;/p&gt;

&lt;p&gt;To configure Databricks Connect for use with Kedro, follow the official setup to create a &lt;code&gt;.databrickscfg&lt;/code&gt; file containing your access token. It can be installed with a &lt;code&gt;pip install databricks-connect&lt;/code&gt;, and it will substitute your local SparkSession:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;databricks.connect&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;DatabricksSession&lt;/span&gt;
&lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;DatabricksSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Spark commands are sent and executed on the cluster, and results are returned to the local environment as needed. In the context of Kedro, this has an amazing effect: as long as you don’t explicitly ask for the data to be collected in your local environment, operations will be executed only when saving the outputs of your node. If you use datasets saved to a Databricks path, there will be no performance hit for transferring data between environments.&lt;/p&gt;

&lt;p&gt;This tool was recently made available as a thin client for &lt;a href="https://spark.apache.org/docs/latest/spark-connect-overview.html" rel="noopener noreferrer"&gt;Spark Connect&lt;/a&gt;, one of the highlights of Spark 3.4, and configuration was made easier than earlier versions. If your cluster doesn’t support the current Connect, please refer to the &lt;a href="https://docs.databricks.com/en/dev-tools/databricks-connect-legacy.html" rel="noopener noreferrer"&gt;documentation&lt;/a&gt; as previous versions had different limitations.&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/p9IRFSjuLBE"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;h2&gt;
  
  
  How can I use a Databricks Connect workflow with Kedro?
&lt;/h2&gt;

&lt;p&gt;Databricks Connect (and Spark Connect) enables us to have a completely local development flow, while all artifacts can be remote objects. Using Delta tables for all our datasets and MLflow for model objects and tracking, nothing needs to be saved locally. Developers can take full advantage of the Databricks stack while maintaining their full IDE usage.    &lt;/p&gt;

&lt;h2&gt;
  
  
  How to use Databricks as your PySpark engine
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.kedro.org/en/stable/integrations/pyspark_integration.html" rel="noopener noreferrer"&gt;Kedro supports integration with PySpark&lt;/a&gt; through the use of Hooks. To configure and enable your Databricks session through Spark Connect, simply set up your &lt;code&gt;SPARK_REMOTE&lt;/code&gt; environment variable with your Databricks configuration. Here is an example implementation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;configparser&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kedro.framework.hooks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hook_impl&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SparkHooks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nd"&gt;@hook_impl&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;after_context_created&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Initialises a SparkSession using the config
        from Databricks.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="nf"&gt;set_databricks_creds&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;_spark_session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Builder&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;set_databricks_creds&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Pass databricks credentials as OS variables if using the local machine.
    If you set DATABRICKS_PROFILE env variable, it will choose the desired profile on .databrickscfg,
    otherwise it will use the DEFAULT profile in databrickscfg.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;DEFAULT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DATABRICKS_PROFILE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;DEFAULT&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SPARK_HOME&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;/databricks/spark&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;configparser&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ConfigParser&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;home&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.databrickscfg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;host&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;host&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;//&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()[:&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# remove "https://" and final "/" from path
&lt;/span&gt;        &lt;span class="n"&gt;cluster_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cluster_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;DEFAULT&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;token&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

        &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SPARK_REMOTE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sc://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;host&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;:443/;token=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;;x-databricks-cluster-id=&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;cluster_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This example will populate &lt;code&gt;SPARK_REMOTE&lt;/code&gt; with your local &lt;code&gt;.databrickscfg&lt;/code&gt; file. You don't setup the remote connection if the project is being run from inside Databricks (if &lt;code&gt;SPARK_HOME&lt;/code&gt; points to Databricks), so you're still able to run it in the usual &lt;a href="https://docs.kedro.org/en/stable/deployment/databricks/databricks_ide_development_workflow.html" rel="noopener noreferrer"&gt;hybrid development flow&lt;/a&gt;. Notice that you don’t need to setup a &lt;code&gt;spark.yml&lt;/code&gt; file as is common in other PySpark templates; you’re not passing any configuration, just using the cluster that is in Databricks. You also don’t need to load any extra Spark files (e.g. JARs), as you are using a thin Spark Connect client.&lt;/p&gt;

&lt;p&gt;Now all your Spark calls in your pipelines will automatically use the remote cluster. There's no need to change anything in your code. However, notebooks might be part of the project. To use your remote cluster without needing to use environment variables, you can use the &lt;code&gt;DatabricksSession&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;from databricks.connect import DatabricksSession
spark = DatabricksSession.builder.getOrCreate()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When using the remote cluster, it's preferred to avoid data transfers between the environments, with all catalog entries referencing remote locations. Using &lt;code&gt;kedro_datasets.databricks.ManagedTableDataSet&lt;/code&gt; as your dataset type in the catalog also allows you use Delta table features.&lt;/p&gt;

&lt;h1&gt;
  
  
  How to enable MLflow on Databricks
&lt;/h1&gt;

&lt;p&gt;Using &lt;a href="https://mlflow.org/" rel="noopener noreferrer"&gt;MLflow&lt;/a&gt; to save all your artifacts directly to Databricks leads to a powerful workflow. For this you can use &lt;a href="https://github.com/Galileo-Galilei/kedro-mlflow" rel="noopener noreferrer"&gt;kedro-mlflow&lt;/a&gt;. Note that &lt;code&gt;kedro-mlflow&lt;/code&gt; is built on top of the mlflow library and although the databricks config cannot be found in its documentation, you can read more about it in the &lt;a href="https://mlflow.org/docs/latest/index.html" rel="noopener noreferrer"&gt;documentation from mlflow directly&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;After doing the &lt;a href="https://kedro-mlflow.readthedocs.io/en/stable/source/02_installation/02_setup.html#activate-kedro-mlflow-in-your-kedro-project" rel="noopener noreferrer"&gt;basic setup of the library&lt;/a&gt; in your project, you should see a &lt;code&gt;mlflow.yml&lt;/code&gt; configuration file. In this file, change the following to set up your URI:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;server:
    mlflow_tracking_uri: databricks # if null, will use mlflow.get_tracking_uri() as a default
    mlflow_registry_uri: databricks # if null, mlflow_tracking_uri will be used as mlflow default
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Setup your experiment name (this should be a valid Databricks path):&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;experiment:
    name: /Shared/your_experiment_name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By default, all your parameters will be logged, and objects such as models and metrics can be saved as MLflow objects referenced in the catalog.&lt;/p&gt;

&lt;h2&gt;
  
  
   Limitations of this workflow
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.databricks.com/dev-tools/databricks-connect-ref.html" rel="noopener noreferrer"&gt;Databricks Connect&lt;/a&gt;, built on top of Spark Connect, supports only recent versions of Spark. I recommend looking at the detailed limitations in the official documentation for specific guidance, such as the upload limit of only 128MB for dataframes.&lt;/p&gt;

&lt;p&gt;Users also need to be conscious that &lt;code&gt;.toPandas()&lt;/code&gt; will move the data to your local pandas environment. Saving results back as MLflow objects is the preferred way to avoid local objects. Examples can be seen in the &lt;a href="https://kedro-mlflow.readthedocs.io/en/stable/source/04_experimentation_tracking/index.html" rel="noopener noreferrer"&gt;kedro-mlflow documentation&lt;/a&gt; for all types of supported objects.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>In the pipeline: September 2023</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Wed, 06 Sep 2023 08:49:16 +0000</pubDate>
      <link>https://forem.com/kedro/in-the-pipeline-september-2023-14ek</link>
      <guid>https://forem.com/kedro/in-the-pipeline-september-2023-14ek</guid>
      <description>&lt;p&gt;This month: a roundup of the summer’s Kedro news, some release updates, and our top picks from recent articles.&lt;/p&gt;

&lt;h2&gt;
  
  
  Kedro team news
&lt;/h2&gt;

&lt;p&gt;Over the last few months, we’ve been happy to welcome some new team members to the Kedro and Kedro-Viz teams, who have also joined our &lt;a href="https://docs.kedro.org/en/stable/contribution/technical_steering_committee.html" rel="noopener noreferrer"&gt;Technical Steering Committee&lt;/a&gt;. Welcome &lt;a href="https://github.com/DimedS" rel="noopener noreferrer"&gt;Dmitry Sorokin&lt;/a&gt;, &lt;a href="https://github.com/jitu5" rel="noopener noreferrer"&gt;Jitendra Gundaniya&lt;/a&gt;, &lt;a href="https://github.com/lrcouto" rel="noopener noreferrer"&gt;Laura Couto&lt;/a&gt;, &lt;a href="https://github.com/ravi-kumar-pilla" rel="noopener noreferrer"&gt;Ravi Kumar Pilla&lt;/a&gt;, and &lt;a href="https://github.com/vladimir-mck" rel="noopener noreferrer"&gt;Vladimir Nikolic&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;We are also pleased to announce a Kedro baby, delivered safely by one of the team, at the end of July!&lt;/p&gt;

&lt;h2&gt;
  
  
  Contributors news
&lt;/h2&gt;

&lt;p&gt;We reworked the Kedro contributors guide in August, and moved it to the &lt;a href="https://github.com/kedro-org/kedro/wiki" rel="noopener noreferrer"&gt;Kedro wiki&lt;/a&gt;. There are loads of different ways to contribute to Kedro and if you want to get involved, we encourage you to look at the &lt;a href="https://github.com/kedro-org/kedro/wiki/Contribute-to-Kedro#how-to-contribute" rel="noopener noreferrer"&gt;table that introduces the Kedro contributor guide&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwstew1ucbg0zamtlgukn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwstew1ucbg0zamtlgukn.png" alt="These are some of the ways to contribute to Kedro" width="800" height="587"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you spot an article, podcast or video that discusses Kedro, you can also contribute by adding it to the “&lt;a href="https://github.com/kedro-org/awesome-kedro" rel="noopener noreferrer"&gt;Awesome Kedro&lt;/a&gt;” repository, or letting us know on &lt;a href="https://slack.kedro.org" rel="noopener noreferrer"&gt;Slack&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;There have been some amazing contributions in recent weeks, including the &lt;a href="https://pypi.org/project/vineyard-kedro/" rel="noopener noreferrer"&gt;kedro-vineyard plugin&lt;/a&gt; for efficient intermediate sharing in Kedro pipelines, &lt;a href="https://pypi.org/project/kedro-graphql/#data" rel="noopener noreferrer"&gt;kedro-graphql&lt;/a&gt; for serving Kedro projects as a GraphQL API, and &lt;a href="https://pypi.org/project/kedro-pandera/" rel="noopener noreferrer"&gt;kedro-pandera&lt;/a&gt; to bring data validation to your Kedro projects.&lt;/p&gt;

&lt;h2&gt;
  
  
  Release news
&lt;/h2&gt;

&lt;p&gt;August 2023 saw a set of &lt;a href="https://linen-slack.kedro.org/t/15611709/hi-channel-we-are-excited-to-announce-several-new-releases-m#5fa69a60-84b7-4b82-adca-a16f87fac6b1" rel="noopener noreferrer"&gt;releases to introduce Python 3.11&lt;/a&gt; support across Kedro, Kedro-Viz and Kedro datasets.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2zzibh5l4xenccdt4zt8.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2zzibh5l4xenccdt4zt8.jpg" alt="All the Kedro things support Python 3.11" width="667" height="500"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kedro-org/kedro/releases/tag/0.18.13" rel="noopener noreferrer"&gt;Kedro version 0.18.13&lt;/a&gt; included these major features and improvements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Added support for Python 3.11.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Added new &lt;code&gt;OmegaConfigLoader&lt;/code&gt; features: registering of custom resolvers through &lt;code&gt;CONFIG_LOADER_ARGS&lt;/code&gt; and support for global variables.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Added &lt;code&gt;kedro catalog resolve&lt;/code&gt; CLI command that resolves dataset factories in the catalog with any explicit entries in the project pipeline.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Simplified the &lt;code&gt;conf&lt;/code&gt; folder structure for modular pipelines and updated kedro pipeline create and kedro catalog create accordingly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Made various updates to the Kedro project template and Kedro starters: use of OmegaConfigLoader, transition from &lt;code&gt;setup.py&lt;/code&gt; to &lt;code&gt;pyproject.toml&lt;/code&gt;, and updated for the simplified &lt;code&gt;conf&lt;/code&gt; structure.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://github.com/kedro-org/kedro-viz/releases/tag/v6.5.0" rel="noopener noreferrer"&gt;Kedro Viz version 6.5&lt;/a&gt; added support for Python 3.11, while &lt;a href="https://github.com/kedro-org/kedro-viz/releases/tag/v6.4.0" rel="noopener noreferrer"&gt;Kedro Viz version 6.4&lt;/a&gt; added two new features: feature hint cards to highlight key features of Kedro Viz and support for displaying dataset statistics in the metadata panel for further investigation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/kedro-org/kedro-plugins/releases/tag/kedro-datasets-1.7.0" rel="noopener noreferrer"&gt;Kedro Datasets version 1.7.0&lt;/a&gt; added &lt;code&gt;polars.GenericDataSet&lt;/code&gt;, a dataset backed by &lt;a href="https://www.pola.rs/" rel="noopener noreferrer"&gt;polars&lt;/a&gt;, a lightning fast dataframe package built entirely using Rust. &lt;a href="https://github.com/kedro-org/kedro-plugins/releases/tag/kedro-datasets-1.6.0" rel="noopener noreferrer"&gt;Kedro Datasets version 1.6.0&lt;/a&gt; added support for Python 3.11.&lt;/p&gt;

&lt;h2&gt;
  
  
  Recently on the Kedro blog
&lt;/h2&gt;

&lt;p&gt;In the last few weeks we’ve published the following on the Kedro blog:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://kedro.org/blog/how-to-integrate-kedro-and-databricks-connect" rel="noopener noreferrer"&gt;How to integrate Kedro and Databricks Connect&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://kedro.org/blog/managed-delta-tables-kedro-dataset" rel="noopener noreferrer"&gt;How to use Databricks managed Delta tables in a Kedro project&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://kedro.org/blog/kedro-dataset-for-spark-structured-streaming" rel="noopener noreferrer"&gt;A new Kedro dataset for Spark Structured Streaming&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://kedro.org/blog/collaborative-experiment-tracking-in-kedro-viz" rel="noopener noreferrer"&gt;Collaborative experiment tracking in Kedro-Viz&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://kedro.org/blog/build-a-custom-kedro-runner" rel="noopener noreferrer"&gt;Get up to speed: How to build a custom Kedro runner&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We’re always looking for collaborators to write about their experiences using Kedro, particularly if you’re working with Kedro datasets or converting an existing project to use Kedro. Get in touch with us on our &lt;a href="https://slack.kedro.org" rel="noopener noreferrer"&gt;Slack workspace&lt;/a&gt; to tell us your story.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8nnuhluwy5em5oqbh0gb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8nnuhluwy5em5oqbh0gb.png" alt="Powered by Kedro badge" width="526" height="138"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What we’ve learned
&lt;/h2&gt;

&lt;p&gt;We really enjoyed reading more on Medium about the &lt;a href="https://medium.com/cncf-vineyard/efficient-data-sharing-in-data-science-pipelines-on-kubernetes-bb42d36c739" rel="noopener noreferrer"&gt;Kedro Vineyard plugin&lt;/a&gt;, which is a cloud-native data manager, for data sharing using memory in data science pipelines on Kubernetes.&lt;/p&gt;

&lt;p&gt;Quix published an interesting article called “&lt;a href="https://www.notion.so/In-the-pipeline-September-2023-39eeb4c7219442b3b0dfc7df9d854b4d?pvs=21" rel="noopener noreferrer"&gt;Bridging the gap between data scientists and engineers in machine learning workflows&lt;/a&gt;” which is something we regularly discuss within the team.&lt;/p&gt;

&lt;p&gt;We found a &lt;a href="https://github.com/madziejm/project-fontr" rel="noopener noreferrer"&gt;super-interesting project about font recognition&lt;/a&gt; that uses Kedro.&lt;/p&gt;

&lt;p&gt;And finally, we enjoyed reading more about &lt;a href="https://medium.com/quantumblack/kedro-goes-streaming-34e1094c354c" rel="noopener noreferrer"&gt;data streaming with Kedro&lt;/a&gt; over on the QuantumBlack Medium channel.&lt;/p&gt;

&lt;p&gt;That’s it for this edition!&lt;/p&gt;

</description>
      <category>kedro</category>
      <category>python</category>
      <category>datascience</category>
      <category>news</category>
    </item>
    <item>
      <title>How to use Databricks managed Delta tables in a Kedro project</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Thu, 17 Aug 2023 08:55:07 +0000</pubDate>
      <link>https://forem.com/kedro/how-to-use-databricks-managed-delta-tables-in-a-kedro-project-jj</link>
      <guid>https://forem.com/kedro/how-to-use-databricks-managed-delta-tables-in-a-kedro-project-jj</guid>
      <description>&lt;p&gt;In this blog post, we'll guide you through the specifics of building a Kedro project that uses managed Delta tables in Databricks using the newly-released &lt;a href="https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets/databricks" rel="noopener noreferrer"&gt;ManagedTableDataSet&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Kedro?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://kedro.org" rel="noopener noreferrer"&gt;Kedro&lt;/a&gt; is a toolbox for production-ready data science. It's an open-source Python framework that enables the development of clean data science code, borrowing concepts from software engineering and applying them to machine-learning projects. A Kedro project provides scaffolding for complex data and machine-learning pipelines. It enables developers to spend less time on tedious "plumbing" and focus on solving new problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Databricks?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.databricks.com/" rel="noopener noreferrer"&gt;Databricks&lt;/a&gt; is a unified data analytics platform designed for simplifying big data processing and free-form data exploration at any scale. Based on &lt;a href="https://spark.apache.org/" rel="noopener noreferrer"&gt;Apache Spark&lt;/a&gt;, an open-source distributed computing system, Databricks provides a collaborative cloud-based environment where users can process large amounts of data.&lt;/p&gt;

&lt;p&gt;The platform provides collaborative workspaces (notebooks) and computational resources (clusters) to run code with. Clusters are groups of nodes that run Apache Spark. Notebooks are collaborative web-based interfaces where users can write and execute code on an attached cluster.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why use Kedro on Databricks?
&lt;/h2&gt;

&lt;p&gt;As we've described, Kedro offers a framework for building modular and scalable data pipelines, while Databricks provides a platform for running Spark jobs and managing data. You can combine Kedro and Databricks to build and deploy data pipelines and get the best of both worlds. Kedro's open-source framework will help you to build well-organised and maintainable pipelines, while Databricks' platform will provide you with the scalability you need to run your pipeline in production. Check out the &lt;a href="https://docs.kedro.org/en/stable/deployment/databricks/index.html" rel="noopener noreferrer"&gt;recently-updated Kedro documentation&lt;/a&gt; for a set of workflow options for integrating Kedro projects and Databricks. (Additionally, the third-party &lt;a href="https://github.com/Galileo-Galilei/kedro-mlflow" rel="noopener noreferrer"&gt;kedro-mlflow&lt;/a&gt; plugin integrates &lt;a href="https://mlflow.org/docs/latest/index.html" rel="noopener noreferrer"&gt;mlflow&lt;/a&gt; capabilities inside Kedro projects to enhance reproducibility for machine learning experimentation).&lt;/p&gt;

&lt;h2&gt;
  
  
  What are Kedro datasets?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://docs.kedro.org/en/stable/data/data_catalog.html" rel="noopener noreferrer"&gt;Kedro datasets&lt;/a&gt; are abstractions for reading and loading data, designed to decouple these operations from your business logic. These datasets manage reading and writing data from a variety of sources, while also ensuring consistency, tracking, and versioning. They allow users to maintain focus on core data processing, leaving data I/O tasks to Kedro.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is managed data in Databricks?
&lt;/h2&gt;

&lt;p&gt;To understand the concept of managed data in Databricks, it is first necessary to outline how Databricks organises data. At the highest level, Databricks uses metastores to store the metadata associated with data objects. Databricks Unity Catalog is one such metastore. It provides data governance and management across multiple Databricks workspaces. The metastore organises tables (where your data is stored) in a hierarchical structure.&lt;/p&gt;

&lt;p&gt;The highest level of organisation in this hierarchy is the catalog. Catalogs are a collection of databases (also referred to as schemas in Databricks' terminology). A database is the second level of organisation in the Unity Catalog namespacing model. Databases are a collection of tables. The tables in a database are the third level of organisation in this hierarchy.&lt;/p&gt;

&lt;p&gt;A table is structured data, stored as a directory of files on cloud object storage. By default, Databricks creates tables as Delta tables, which store data using the &lt;a href="https://delta.io/" rel="noopener noreferrer"&gt;Delta Lake&lt;/a&gt; format. &lt;a href="https://delta.io/" rel="noopener noreferrer"&gt;Delta Lake&lt;/a&gt; is an open-source storage format that offers ACID transactions, time travel and audit history.&lt;/p&gt;

&lt;p&gt;Databricks tables belong to one of two categories: managed and unmanaged (external) tables. Databricks manages both the data and associated metadata of managed tables. If you drop a managed table, you will delete the underlying data. The data of a managed table resides in the location of the database to which it is registered.&lt;/p&gt;

&lt;p&gt;On the other hand, for unmanaged tables, Databricks only manages the metadata. If you drop an unmanaged table, you will not delete the underlying data. These tables require a specified location during creation.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to work with managed Delta tables using Kedro
&lt;/h2&gt;

&lt;p&gt;Let's demonstrate how to use the &lt;a href="https://docs.kedro.org/en/stable/kedro_datasets.databricks.ManagedTableDataSet.html" rel="noopener noreferrer"&gt;ManagedTableDataSet&lt;/a&gt; with a simple example on Databricks. You'll need to open a new Databricks notebook and attach it to a cluster to follow along with the rest of this example, which runs on a workspace using a Hive metastore. We'll create a dataset containing weather readings, save it to a managed Delta table on Databricks, append some data, and access a specific table version to showcase Delta Lake's time travel capabilities.&lt;/p&gt;

&lt;p&gt;Run every separate code snippet in this section in a new notebook cell.&lt;/p&gt;

&lt;p&gt;The first steps are to set up your workspace by creating a &lt;code&gt;weather&lt;/code&gt; database in your metastore and installing Kedro. Run the following SQL code to create the database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%sql
create database if not exists weather;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To install Kedro and the &lt;code&gt;ManagedTableDataSet&lt;/code&gt;, use the &lt;code&gt;%pip&lt;/code&gt; magic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;%pip install kedro kedro-datasets[databricks.ManagedTableDataSet]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The first part of our program will create some weather data. We'll create a Spark DataFrame with four columns: &lt;code&gt;date&lt;/code&gt;, &lt;code&gt;location&lt;/code&gt;, &lt;code&gt;temperature&lt;/code&gt;, and &lt;code&gt;humidity&lt;/code&gt; to store our weather data. Then, we'll use a new instance of &lt;code&gt;ManagedTableDataSet&lt;/code&gt; to save our DataFrame to a Delta table called &lt;code&gt;2023_06_22&lt;/code&gt; (the day of the readings) in the &lt;code&gt;weather&lt;/code&gt; database.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql.types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;IntegerType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kedro_datasets.databricks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ManagedTableDataSet&lt;/span&gt;

&lt;span class="n"&gt;spark_session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Define schema
&lt;/span&gt;&lt;span class="n"&gt;schema&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StructType&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;location&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;StringType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;IntegerType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="nc"&gt;StructField&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;humidity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;IntegerType&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="c1"&gt;# Create DataFrame
&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2023-06-22&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;London&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;27&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;39&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2023-06-22&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Warsaw&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;40&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2023-06-22&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Bucharest&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;38&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;spark_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Create a ManagedTableDataSet instance using a new table named '2023_06_22'
&lt;/span&gt;&lt;span class="n"&gt;weather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ManagedTableDataSet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2023_06_22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Save the DataFrame to the table
&lt;/span&gt;&lt;span class="n"&gt;weather&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;To load our data back into a dataframe, we use the &lt;code&gt;load&lt;/code&gt; method on &lt;code&gt;ManagedTableDataSet&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Load the table data into a DataFrame
&lt;/span&gt;&lt;span class="n"&gt;reloaded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weather&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Print the first 3 rows of the DataFrame
&lt;/span&gt;&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reloaded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;take&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code loads the data from the &lt;code&gt;weather&lt;/code&gt; table back into a Spark DataFrame and shows the first three rows of the data:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;|   date   | location | temperature | humidity |
|:--------:|:--------:|:-----------:|:--------:|
|2023-06-22|Bucharest |     32      |   38     |
|2023-06-22|  London  |     27      |   39     |
|2023-06-22|  Warsaw  |     28      |   40     |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Let's say we take some more weather readings later in the day and want to add them to our Delta table. To do this, we can write to it using a new instance of &lt;code&gt;ManagedTableDataSet&lt;/code&gt; initialised with &lt;code&gt;"append"&lt;/code&gt; passed in as an argument to &lt;code&gt;write_mode&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Append new rows to the data
&lt;/span&gt;&lt;span class="n"&gt;new_rows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2023-06-22&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Cairo&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;35&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;25&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;2023-06-22&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;Lisbon&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;28&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;44&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;spark_df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createDataFrame&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_rows&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;schema&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;weather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ManagedTableDataSet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2023_06_22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;write_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;append&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;weather&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;spark_df&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The code above adds new rows for Cairo and Lisbon to our Delta table, which creates a new version of the table.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;ManagedTableDataSet&lt;/code&gt; class allows for saving data with three different write modes: &lt;code&gt;overwrite&lt;/code&gt;, &lt;code&gt;append&lt;/code&gt;, and &lt;code&gt;upsert&lt;/code&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;overwrite&lt;/code&gt; mode will completely replace the current data in the table with the new data.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;append&lt;/code&gt; mode will add new data to the existing table.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;upsert&lt;/code&gt; mode updates existing rows and inserts new rows, based on a specified primary key. Notably, if the table doesn't exist at save, the &lt;code&gt;upsert&lt;/code&gt; mode behaves similarly to append, inserting data into a new table.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Suppose we later want to access our data as it appeared earlier in the day when we had only taken three readings. The &lt;code&gt;ManagedTableDataSet&lt;/code&gt; class supports accessing different versions of the Delta table. We can access a specific version by defining a Kedro &lt;code&gt;Version&lt;/code&gt; and passing it into a new instance of &lt;code&gt;ManagedTableDataSet&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kedro.io&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Version&lt;/span&gt;

&lt;span class="c1"&gt;# Load version 0 of the table
&lt;/span&gt;&lt;span class="n"&gt;weather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ManagedTableDataSet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2023_06_22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Version&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;save&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;reloaded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weather&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reloaded&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Load version 1 of the table
&lt;/span&gt;&lt;span class="n"&gt;weather&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ManagedTableDataSet&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;database&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;weather&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;table&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2023_06_22&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;version&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;Version&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;load&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;save&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;reloaded&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;weather&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;load&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="nf"&gt;display&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reloaded&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You will see two rendered tables as the output of running this code. The first corresponds to version 0 of the &lt;code&gt;2023_06_22&lt;/code&gt; table, while the second corresponds to version 1:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight markdown"&gt;&lt;code&gt;|   date   | location | temperature | humidity |
|:--------:|:--------:|:-----------:|:--------:|
|2023-06-22|Bucharest |     32      |   38     |
|2023-06-22|  London  |     27      |   39     |
|2023-06-22|  Warsaw  |     28      |   40     |

|   date   | location | temperature | humidity |
|:--------:|:--------:|:-----------:|:--------:|
|2023-06-22|Bucharest |     32      |   38     |
|2023-06-22|  London  |     27      |   39     |
|2023-06-22|  Warsaw  |     28      |   40     |
|2023-06-22|  Lisbon  |     28      |   44     |
|2023-06-22|  Cairo   |     35      |   25     |
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And that's it! We've put together a simple program to show some of the usual tasks that &lt;code&gt;ManagedTableDataSet&lt;/code&gt; facilitates, making it easy to save, load, and manage versions of your data in Delta tables on Databricks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Databricks is a fast-growing deployment vector for Kedro projects. This blog post has demonstrated how to combine the power of both Kedro and Databricks with an open-source &lt;code&gt;ManagedTableDataSet&lt;/code&gt; that enables streamlined data I/O operations when deploying a Kedro project on Databricks. &lt;code&gt;ManagedTableDataSet&lt;/code&gt; empowers you to spend more time implementing the business logic of your data pipeline or machine learning workflow and less time manually handling data.&lt;/p&gt;

</description>
      <category>kedro</category>
      <category>python</category>
      <category>databricks</category>
      <category>deltalake</category>
    </item>
    <item>
      <title>A new Kedro dataset for Spark Structured Streaming</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Wed, 12 Jul 2023 07:36:25 +0000</pubDate>
      <link>https://forem.com/kedro/a-new-kedro-dataset-for-spark-structured-streaming-n39</link>
      <guid>https://forem.com/kedro/a-new-kedro-dataset-for-spark-structured-streaming-n39</guid>
      <description>&lt;p&gt;This article guides data practitioners on how to set up a Kedro project to use the new &lt;code&gt;SparkStreaming&lt;/code&gt; Kedro dataset, with example use cases, and a deep-dive on some design considerations. It's meant for data practitioners familiar with Kedro so we'll not be covering the basics of a project, but you can familiarise yourself with them in the &lt;a href="https://docs.kedro.org/en/stable/get_started/install.html" rel="noopener noreferrer"&gt;Kedro documentation&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is Kedro?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://kedro.org" rel="noopener noreferrer"&gt;Kedro&lt;/a&gt; is an open-source Python toolbox that applies software engineering principles to data science code. It makes it easier for a team to apply software engineering principles to data science code, which reduces the time spent rewriting data science experiments so that they are fit for production.&lt;/p&gt;

&lt;p&gt;Kedro was born at QuantumBlack to solve the challenges faced regularly in data science projects and promote teamwork through standardised team workflows. It is now hosted by the &lt;a href="https://lfaidata.foundation/" rel="noopener noreferrer"&gt;LF AI &amp;amp; Data Foundation&lt;/a&gt; as an incubating project.&lt;/p&gt;

&lt;h3&gt;
  
  
  What are Kedro datasets?
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://docs.kedro.org/en/stable/data/data_catalog.html" rel="noopener noreferrer"&gt;Kedro datasets&lt;/a&gt; are abstractions for reading and loading data, designed to decouple these operations from your business logic. These datasets manage reading and writing data from a variety of sources, while also ensuring consistency, tracking, and versioning. They allow users to maintain focus on core data processing, leaving data I/O tasks to Kedro.&lt;/p&gt;

&lt;h2&gt;
  
  
   What is Spark Structured Streaming?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html" rel="noopener noreferrer"&gt;Spark Structured Streaming&lt;/a&gt; is built on the Spark SQL engine. You can express your streaming computation the same way you would express a batch computation on static data, and the Spark SQL engine will run it incrementally and continuously and update the final result as streaming data continues to arrive.&lt;/p&gt;

&lt;h2&gt;
  
  
   Integrating Kedro and Spark Structured Streaming
&lt;/h2&gt;

&lt;p&gt;Kedro is easily extensible for your own workflows and this article explains one of the ways to add new functionality. To enable Kedro to work with Spark Structured Streaming, a team inside &lt;a href="https://www.mckinsey.com/capabilities/quantumblack/labs" rel="noopener noreferrer"&gt;QuantumBlack Labs&lt;/a&gt; developed a new &lt;a href="https://docs.kedro.org/en/stable/kedro_datasets.spark.SparkStreamingDataSet.html" rel="noopener noreferrer"&gt;Spark Streaming Dataset&lt;/a&gt;, as the existing Kedro Spark dataset was not compatible with Spark Streaming use cases. To ensure seamless streaming, the new dataset has a checkpoint location specification to avoid data duplication in streaming use cases and it uses &lt;code&gt;.start()&lt;/code&gt; at the end of the &lt;code&gt;_save&lt;/code&gt; method to initiate the stream.&lt;/p&gt;

&lt;h2&gt;
  
  
   Set up a project to integrate Kedro with Spark Structured streaming
&lt;/h2&gt;

&lt;p&gt;The project uses a Kedro dataset to build a structured data pipeline that can read and write data streams with Spark Structured Streaming and process data streams in realtime. You need to add two separate Hooks to the Kedro project to enable it to function as a streaming application.&lt;/p&gt;

&lt;p&gt;Integration involves the following steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Create a Kedro project.&lt;/li&gt;
&lt;li&gt;Register the necessary PySpark and streaming related Hooks. &lt;/li&gt;
&lt;li&gt;Configure the custom dataset in the &lt;code&gt;catalog.yml&lt;/code&gt; file, defining the streaming sources and sinks. &lt;/li&gt;
&lt;li&gt;Use Kedro’s new &lt;a href="https://github.com/kedro-org/kedro-plugins/tree/main/kedro-datasets/kedro_datasets/spark" rel="noopener noreferrer"&gt;dataset for Spark Structured Streaming&lt;/a&gt; to store intermediate dataframes generated during the Spark streaming process.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Create a Kedro project
&lt;/h3&gt;

&lt;p&gt;Ensure you have installed a version of Kedro greater than version 0.18.9 and &lt;code&gt;kedro-datasets&lt;/code&gt; greater than version 1.4.0.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install kedro~=0.18.0 kedro-datasets~=1.4.0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Create a new Kedro project using the Kedro &lt;code&gt;pyspark&lt;/code&gt; starter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;kedro new --starter=pyspark
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Register the necessary PySpark and streaming related Hooks
&lt;/h3&gt;

&lt;p&gt;To work with multiple streaming nodes, two hooks are required. The first is for integrating PySpark: see &lt;a href="https://docs.kedro.org/en/stable/integrations/pyspark_integration.html" rel="noopener noreferrer"&gt;Build a Kedro pipeline with PySpark&lt;/a&gt; for details. You will also need a Hook for running a streaming query without termination unless an exception occurs.&lt;/p&gt;

&lt;p&gt;Add the following code to &lt;code&gt;src/$your_kedro_project_name/hooks.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkConf&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pyspark.sql&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kedro.framework.hooks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;hook_impl&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SparkHooks&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nd"&gt;@hook_impl&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;after_context_created&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Initialises a SparkSession using the config
        defined in project&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s conf folder.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

        &lt;span class="c1"&gt;# Load the spark configuration in spark.yaml using the config loader
&lt;/span&gt;        &lt;span class="n"&gt;parameters&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config_loader&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark*&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;spark*/**&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;spark_conf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SparkConf&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;setAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;parameters&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

        &lt;span class="c1"&gt;# Initialise the spark session
&lt;/span&gt;        &lt;span class="n"&gt;spark_session_conf&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;appName&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;_package_name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enableHiveSupport&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;config&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;spark_conf&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;_spark_session&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;spark_session_conf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;_spark_session&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;sparkContext&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setLogLevel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WARN&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SparkStreamsHook&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nd"&gt;@hook_impl&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;after_pipeline_run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Starts a spark streaming await session
        once the pipeline reaches the last node.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

        &lt;span class="n"&gt;spark&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SparkSession&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;builder&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getOrCreate&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;spark&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;streams&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;awaitAnyTermination&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Register the Hooks in &lt;code&gt;src/$your_kedro_project_name/settings.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Project settings. There is no need to edit this file unless you want to change values
from the Kedro defaults. For further information, including these default values, see
https://kedro.readthedocs.io/en/stable/kedro_project_setup/settings.html.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.hooks&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SparkHooks&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SparkStreamsHook&lt;/span&gt;

&lt;span class="n"&gt;HOOKS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;SparkHooks&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nc"&gt;SparkStreamsHook&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

&lt;span class="c1"&gt;# Instantiated project hooks.
# from streaming.hooks import ProjectHooks
# HOOKS = (ProjectHooks(),)
&lt;/span&gt;
&lt;span class="c1"&gt;# Installed plugins for which to disable hook auto-registration.
# DISABLE_HOOKS_FOR_PLUGINS = ("kedro-viz",)
&lt;/span&gt;
&lt;span class="c1"&gt;# Class that manages storing KedroSession data.
# from kedro.framework.session.shelvestore import ShelveStore
# SESSION_STORE_CLASS = ShelveStore
# Keyword arguments to pass to the `SESSION_STORE_CLASS` constructor.
# SESSION_STORE_ARGS = {
#     "path": "./sessions"
# }
&lt;/span&gt;
&lt;span class="c1"&gt;# Class that manages Kedro's library components.
# from kedro.framework.context import KedroContext
# CONTEXT_CLASS = KedroContext
&lt;/span&gt;
&lt;span class="c1"&gt;# Directory that holds configuration.
# CONF_SOURCE = "conf"
&lt;/span&gt;
&lt;span class="c1"&gt;# Class that manages how configuration is loaded.
# CONFIG_LOADER_CLASS = ConfigLoader
# Keyword arguments to pass to the `CONFIG_LOADER_CLASS` constructor.
# CONFIG_LOADER_ARGS = {
#       "config_patterns": {
#           "spark" : ["spark*/"],
#           "parameters": ["parameters*", "parameters*/**", "**/parameters*"],
#       }
# }
&lt;/span&gt;
&lt;span class="c1"&gt;# Class that manages the Data Catalog.
# from kedro.io import DataCatalog
# DATA_CATALOG_CLASS = DataCatalog
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
   How to set up your Kedro project to read data from streaming sources
&lt;/h2&gt;

&lt;p&gt;Once you have set up your project, you can use the new &lt;a href="https://docs.kedro.org/en/stable/kedro_datasets.spark.SparkStreamingDataSet.html" rel="noopener noreferrer"&gt;Kedro Spark streaming dataset&lt;/a&gt;. You need to configure the data catalog, in &lt;code&gt;conf/base/catalog.yml&lt;/code&gt; as follows to read from a streaming JSON file:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;raw_json&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spark.SparkStreamingDataSet&lt;/span&gt;
  &lt;span class="na"&gt;filepath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/01_raw/stream/inventory/&lt;/span&gt;
  &lt;span class="na"&gt;file_format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;json&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Additional options can be configured via the &lt;code&gt;load_args&lt;/code&gt; key.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;int.new_inventory&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
   &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spark.SparkStreamingDataSet&lt;/span&gt;
   &lt;span class="na"&gt;filepath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/02_intermediate/inventory/&lt;/span&gt;
   &lt;span class="na"&gt;file_format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;csv&lt;/span&gt;
   &lt;span class="na"&gt;load_args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
     &lt;span class="na"&gt;header&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
   How to set up your Kedro project to write data to streaming sinks
&lt;/h2&gt;

&lt;p&gt;All the additional arguments can be kept under the &lt;code&gt;save_args&lt;/code&gt; key:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;processed.sensor&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
   &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;spark.SparkStreamingDataSet&lt;/span&gt;
   &lt;span class="na"&gt;file_format&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;csv&lt;/span&gt;
   &lt;span class="na"&gt;filepath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/03_primary/processed_sensor/&lt;/span&gt;
   &lt;span class="na"&gt;save_args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
     &lt;span class="na"&gt;output_mode&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;append&lt;/span&gt;
     &lt;span class="na"&gt;checkpoint&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/04_checkpoint/processed_sensor&lt;/span&gt;
     &lt;span class="na"&gt;header&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note that when you use the Kafka format, the respective packages should be added to the &lt;code&gt;spark.yml&lt;/code&gt;configuration as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;spark.jars.packages: org.apache.spark:spark-sql-kafka-0-10_2.12:3.3.1 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
   Design considerations
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Pipeline design
&lt;/h3&gt;

&lt;p&gt;In order to benefit from Spark's internal query optimisation, we recommend that any interim datasets are stored as memory datasets.&lt;/p&gt;

&lt;p&gt;All streams start at the same time, so any nodes that have a dependency on another node that writes to a file sink (i.e. the input to that node is the output of another node) will fail on the first run. This is because there are no files in the file sink for the stream to process when it starts.&lt;/p&gt;

&lt;p&gt;We recommended that you either keep intermediate datasets in memory or split out the processing into two pipelines and start by triggering the first pipeline to build up some initial history.&lt;/p&gt;

&lt;h3&gt;
  
  
  Feature creation
&lt;/h3&gt;

&lt;p&gt;Be aware that windowing operations only allow windowing on time columns.&lt;/p&gt;

&lt;p&gt;Watermarks must be defined for joins. Only certain types of joins are allowed, and these depend on the file types (stream-stream, stream-static) which makes joining of multiple tables a little complex at times. For further information or advice about join types and watermarking, take a look at the &lt;a href="https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#join-operations" rel="noopener noreferrer"&gt;PySpark documentation&lt;/a&gt; or reach out on the &lt;a href="https://slack.kedro.org" rel="noopener noreferrer"&gt;Kedro Slack workspace&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
   Logging
&lt;/h2&gt;

&lt;p&gt;When initiated, the Kedro pipeline will download the JAR required for the Spark Kafka. After the first run, it won't download the file again but simply retrieve it from where the previously downloaded file was stored.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttg7xtyy9c59x6zlnn74.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fttg7xtyy9c59x6zlnn74.png" alt="Spark logging" width="800" height="754"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For each node, the logs for the following will be shown: Loading data, Running nodes, Saving data, Completed x out of y tasks.&lt;/p&gt;

&lt;p&gt;The completed log doesn't mean that the stream processing in that node has stopped. It means that the Spark plan has been created, and if the output dataset is being saved to a sink, the stream has started.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsakexf7ctormpsi1ifq2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsakexf7ctormpsi1ifq2.png" alt="Spark logging" width="800" height="250"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once Kedro has run through all the nodes and the full Spark execution plan has been created, you'll see &lt;code&gt;INFO Pipeline execution completed successfully&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;This doesn't mean the stream processing has stopped as the post run hook keeps the Spark Session alive. As new data comes in, new Spark logs will be shown, even after the "Pipeline execution completed" log.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgiocc9hj7jh8o05xe3ji.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgiocc9hj7jh8o05xe3ji.png" alt="Spark logging" width="800" height="332"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If there is an error in the input data, the Spark error logs will come through and Kedro will shut down the SparkContext and all the streams within it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzwultqq2h553hem0dbfa.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzwultqq2h553hem0dbfa.png" alt="Spark logging" width="800" height="281"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
   In summary
&lt;/h2&gt;

&lt;p&gt;In this article, we explained how to take advantage of one of the ways to extend Kedro by building a new dataset to create streaming pipelines. We created a new Kedro project using the Kedro &lt;code&gt;pyspark&lt;/code&gt;starter and illustrated how to work with Hooks, adding them to the Kedro project to enable it to function as a streaming application. The dataset was then easy to configure through the Kedro data catalog, making it possible to use the new dataset, defining the streaming sources and sinks.&lt;/p&gt;

&lt;p&gt;There are currently some limitations because it is not yet ready for use with a service broker, e.g. Kafka, as an additional JAR package is required.&lt;/p&gt;

&lt;p&gt;If you want to find out more about the ways to extend Kedro, take a look at the &lt;a href="https://docs.kedro.org/en/stable/extend_kedro/index.html" rel="noopener noreferrer"&gt;advanced Kedro documentation&lt;/a&gt; for more about Kedro plugins, datasets, and Hooks.&lt;/p&gt;

&lt;h2&gt;
  
  
   Contributors
&lt;/h2&gt;

&lt;p&gt;This post was created by &lt;a href="https://www.linkedin.com/in/tingting-w-93b32516a/" rel="noopener noreferrer"&gt;Tingting Wan&lt;/a&gt;, &lt;a href="https://www.linkedin.com/in/chivo369/" rel="noopener noreferrer"&gt;Tom Kurian&lt;/a&gt;, and &lt;a href="https://uk.linkedin.com/in/harismichailidis" rel="noopener noreferrer"&gt;Haris Michailidis&lt;/a&gt;, who are all Data Engineers in the London office of &lt;a href="https://www.mckinsey.com/capabilities/quantumblack/how-we-help-clients" rel="noopener noreferrer"&gt;QuantumBlack, AI by McKinsey&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
      <category>kedro</category>
      <category>spark</category>
      <category>streaming</category>
    </item>
    <item>
      <title>Get up to speed: how to build a custom Kedro runner</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Thu, 22 Jun 2023 09:46:25 +0000</pubDate>
      <link>https://forem.com/kedro/get-up-to-speed-how-to-build-a-custom-kedro-runner-2dj3</link>
      <guid>https://forem.com/kedro/get-up-to-speed-how-to-build-a-custom-kedro-runner-2dj3</guid>
      <description>&lt;p&gt;In Kedro, &lt;a href="https://docs.kedro.org/en/stable/nodes_and_pipelines/run_a_pipeline.html" rel="noopener noreferrer"&gt;runners are the execution mechanism&lt;/a&gt; for data science and machine learning pipelines. The default behaviour of all of Kedro’s built-in runners is to halt pipeline execution if an error occurs that is significant enough to cause any of the nodes to fail, as shown in the following diagram:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxzzcpeuxofk4qta7mzeg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxzzcpeuxofk4qta7mzeg.png" alt="Sequential runner when a node fails" width="800" height="523"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the diagram, the entire run aborts when it encounters a node that it cannot run, terminating all other sections or branches of the pipeline, even those that it could have run.&lt;/p&gt;

&lt;p&gt;The custom runner described in this article was specifically developed for a top player in the mining industry that uses Kedro to construct data pipelines for BI dashboards essential for operational excellence.&lt;/p&gt;

&lt;p&gt;The client’s pipeline is designed to be resilient towards node failures. Certain nodes operate independently of each other, and especially during the development and exploration stages, the failure of a single node does not necessitate the termination of the entire Kedro run. The desired behaviour is as shown below:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fytrrj37ihr5o3u8f45uv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fytrrj37ihr5o3u8f45uv.png" alt="Custom runner that does not halt all nodes when a failure is encountered" width="800" height="467"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the diagram, the runner meets a node that cannot run but finds other sections or branches that it can execute.&lt;/p&gt;

&lt;p&gt;The client relies on Kedro to execute a substantial pipeline that retrieves data from various sources. Some of the input datasets are manually created, which introduces the possibility of errors if entries are mistyped or omitted. By allowing the pipeline to continue and bypass nodes as they encounter failures, it becomes possible to compile a comprehensive list of data issues during a single run and address them collectively.&lt;/p&gt;

&lt;p&gt;In comparison, the default Kedro approach is considerably more time-consuming as it pauses the pipeline upon the failure of a single node, leading to a repetitive cycle of fixing one issue, rerunning the pipeline to encounter the next issue, fixing that, and so on.&lt;/p&gt;

&lt;p&gt;Executing all feasible nodes within the pipeline provides an additional advantage. In cases where no data issues arise, completing the pipeline allows the available metrics to be displayed on a BI dashboard, ensuring service continuity. For instance, if only one data source is corrupted, the BI metrics that depend on that specific data need to be withheld, but all others can be showcased. In contrast, the default Kedro behaviour would render all metrics unavailable until the single dataset issue is resolved.&lt;/p&gt;

&lt;h2&gt;
  
  
  The solution: a customised Kedro runner
&lt;/h2&gt;

&lt;p&gt;As an open-source project, Kedro enables you to define a custom runner for your project. The team took the open-source code for Kedro’s sequential runner and extended it, since the code didn’t need any parallelisation.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“One of the reasons we selected Kedro is that it is open source and highly extensible. We knew from the outset that we could make our own customisations”.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The team created a soft-fail runner to transform errors into warnings, allowing the pipeline to continue executing to the best of its ability while providing a report of any nodes that failed, so that data issues can be addressed. At that point, the pipeline run can be finalised by executing only those missing nodes separately, using appropriate Kedro syntax.&lt;/p&gt;

&lt;p&gt;The resulting &lt;code&gt;SoftFailRunner&lt;/code&gt; is an implementation of &lt;a href="https://docs.kedro.org/en/stable/kedro.runner.AbstractRunner.html" rel="noopener noreferrer"&gt;&lt;code&gt;AbstractRunner&lt;/code&gt;&lt;/a&gt; that runs a pipeline sequentially using a topological sort of provided nodes. Unlike the built-in &lt;a href="https://docs.kedro.org/en/stable/kedro.runner.SequentialRunner.html" rel="noopener noreferrer"&gt;&lt;code&gt;SequentialRunner&lt;/code&gt;&lt;/a&gt;, this runner does not terminate the pipeline but runs any remaining nodes as long as their dependencies are fulfilled. The &lt;code&gt;SoftFailRunner&lt;/code&gt; implementation adds two arguments: &lt;code&gt;--from-nodes&lt;/code&gt; and &lt;code&gt;--runner&lt;/code&gt;. The essential code for the &lt;code&gt;SoftFailRunner&lt;/code&gt; is shown below and the full code &lt;a href="https://github.com/kedro-org/kedro/blob/feat/softfail-runner/kedro/runner/softfail_runner.py" rel="noopener noreferrer"&gt;can be found on GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftxn3477gb3px8sk16fhf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftxn3477gb3px8sk16fhf.png" alt="Code for the soft-fail runner" width="800" height="979"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The logic behind the runner is as follows:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Addition of a new &lt;code&gt;skip_nodes&lt;/code&gt; variable to keep track of which nodes should be skipped.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Every time a node is about to run - the &lt;code&gt;skip_nodes&lt;/code&gt; list is checked.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;When a node fails, all of its descendants are added into &lt;code&gt;skip_nodes&lt;/code&gt; with &lt;a href="https://en.wikipedia.org/wiki/Breadth-first_search" rel="noopener noreferrer"&gt;Breadth-first search (BFS)&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
   In summary
&lt;/h2&gt;

&lt;p&gt;The customised Kedro runner was straightforward to create and a satisfactory solution to enable maximum efficiency when handling this particular pipeline and dataset.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“These results could certainly be achieved with an orchestrator, but using an open-source project with customisation is a quick win for delivering business value”.&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>kedro</category>
      <category>python</category>
      <category>datascience</category>
      <category>mlops</category>
    </item>
    <item>
      <title>Collaborative experiment tracking in Kedro-Viz</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Fri, 02 Jun 2023 14:09:58 +0000</pubDate>
      <link>https://forem.com/kedro/collaborative-experiment-tracking-in-kedro-viz-3697</link>
      <guid>https://forem.com/kedro/collaborative-experiment-tracking-in-kedro-viz-3697</guid>
      <description>&lt;p&gt;When training a model in machine learning, the goal is to determine the optimal configuration of attributes such as hyper-parameters, metrics, and training data. The process of identifying the best combinations requires running a lot of experiments and comparing them. As I mentioned in my &lt;a href="https://kedro.org/blog/experiment-tracking-with-kedro" rel="noopener noreferrer"&gt;previous article&lt;/a&gt;, experiment tracking is a way to record all the metadata you need to compare machine-learning experiments and recreate them for your project.&lt;/p&gt;

&lt;h2&gt;
  
  
   What is Kedro-Viz?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://github.com/kedro-org/kedro-viz" rel="noopener noreferrer"&gt;Kedro-Viz&lt;/a&gt; is an interactive development tool for building and visualising data science pipelines with &lt;a href="https://github.com/kedro-org/kedro" rel="noopener noreferrer"&gt;Kedro&lt;/a&gt;. It enables you to monitor the status of your ML project, present it to stakeholders, and easily bring new team members up to speed. You can try it out using our &lt;a href="https://demo.kedro.org/" rel="noopener noreferrer"&gt;hosted demo&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;“&lt;em&gt;There's no better method to give an overview of a pipeline's structure in such an engaging, interactive, and thorough way. Our asset's pipelines are very complex, but are structured with modular pipelines, so being able to show the overall structure at the modular pipeline level, before jumping into each individual pipeline helps prevent the audience from getting overwhelmed by the number of nodes and datasets shown&lt;/em&gt;”.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Senior Data Scientist at Consultancy&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What is experiment tracking in Kedro-Viz?
&lt;/h2&gt;

&lt;p&gt;Experiment tracking on Kedro-Viz enables users to select, plot, and compare how multiple metrics change over time, and identify the best-performing ML experiment, with no additional dependencies to manage or infrastructure needed.&lt;/p&gt;

&lt;p&gt;The video below demonstrates experiment tracking on Kedro-Viz:&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/odXhTEa50PU"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;During a project with multiple team members, you could end up with a scenario where the results of your experiments are spread across many machines because people are iterating on their individual computers. This makes the tracking process difficult to manage at a team level, as suggested by this feedback from our users.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"You might train one model locally on your computer. You might train another one in the cloud. Joe might run another pipeline or another experiment. Having all of those experiments in one place as a single source of truth is really powerful.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"If we could write our metrics files to an S3 bucket and then run experiment tracking pointing at that S3 bucket, that simplifies our workflow in many different ways and would be really helpful. And it would make Kedro experiment tracking just as easy, if not easier, than MLFlow for us."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;"Can you use an existing database so that we can keep track of runs happening in different places?&lt;/em&gt;"&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We have found a way to address this pain point and enable you to collaborate more easily. We are excited to announce that we've &lt;a href="https://www.linen.dev/s/kedro/t/12096327/kedro-kedro-kedro-kedro-kedro-kedro-viz-6-2-0-is-out-kedro-k#ba733439-8aac-46f5-9c37-d015287835cc" rel="noopener noreferrer"&gt;launched collaborative experiment tracking&lt;/a&gt; in Kedro-Viz 6.2.0. The new feature enables a team of users to log their experiments to a shared cloud storage service and view and compare each others' experiments in their own experiment tracking view. This simplifies their workflow, providing a single ‘source of truth’ and encourages multi-user collaboration.&lt;/p&gt;

&lt;p&gt;We are releasing this feature in stages across different versions, and the first phase is &lt;a href="https://github.com/kedro-org/kedro-viz/releases" rel="noopener noreferrer"&gt;Kedro-Viz 6.2.0&lt;/a&gt;. This version enables users to read experiments of other users that are stored on Amazon S3 or similar storage solutions on other cloud providers, as long as they are supported by &lt;a href="https://filesystem-spec.readthedocs.io/en/latest/" rel="noopener noreferrer"&gt;fsspec&lt;/a&gt;. Future versions of collaborative experiment tracking aim to improve the user experience through automatic reloading and optimisation by caching.&lt;/p&gt;

&lt;h2&gt;
  
  
   Get started with collaborative experiment tracking
&lt;/h2&gt;

&lt;p&gt;Follow these steps to set up collaborative experiment tracking in Kedro-Viz:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Update Kedro-Viz
&lt;/h3&gt;

&lt;p&gt;Ensure you have the latest version of Kedro-Viz (6.2.0 or later).&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;kedro-viz &lt;span class="nt"&gt;--upgrade&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Set up cloud storage
&lt;/h3&gt;

&lt;p&gt;Kedro-Viz uses &lt;a href="https://filesystem-spec.readthedocs.io/en/latest/api.html?highlight=s3#other-known-implementations" rel="noopener noreferrer"&gt;fsspec&lt;/a&gt; to save and read &lt;code&gt;session_store&lt;/code&gt; files from a variety of data stores, including local file systems, network file systems, cloud object stores (e.g., Amazon S3, Azure Blob Storage, Google Cloud Storage), and HDFS.&lt;/p&gt;

&lt;p&gt;Set up a central cloud storage repository such as a AWS S3 bucket to store all your team's experiments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3: Configure your Kedro project
&lt;/h3&gt;

&lt;p&gt;Locate the &lt;code&gt;settings.py&lt;/code&gt; file in your Kedro project directory and add the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kedro_viz.integrations.kedro.sqlite_store&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SQLiteStore&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="n"&gt;SESSION_STORE_CLASS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;SQLiteStore&lt;/span&gt;
&lt;span class="n"&gt;SESSION_STORE_ARGS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remote_path&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://my-bucket-name/path/to/experiments&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Set up a unique username
&lt;/h3&gt;

&lt;p&gt;Kedro-Viz saves your experiments as SQLite database files on the central cloud storage. To ensure that all users have unique filenames, you need to set up your &lt;code&gt;**KEDRO_SQLITE_STORE_USERNAME**&lt;/code&gt; in the environment variables. By default, Kedro-Viz will take your computer username if this is not specified.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;KEDRO_SQLITE_STORE_USERNAME &lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_unique__username"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  &lt;strong&gt;Step 5: Configure cloud storage credentials&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;From Kedro-Viz version 6.2, the only way to set up credentials for accessing your cloud storage is through environment variables, as shown below for &lt;a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-envvars.html" rel="noopener noreferrer"&gt;Amazon S3 cloud storage&lt;/a&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AWS_ACCESS_KEY_ID&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_access_key_id"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AWS_SECRET_ACCESS_KEY&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_secret_access_key"&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;AWS_REGION&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"your_aws_region"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In the screenshot below we show an example of the session store and Kedro-Viz output for three team members (Huong, Tynan, and Rashida):&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvt9sgfuto8gzhbf11h6v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvt9sgfuto8gzhbf11h6v.png" alt="Session store showing the 3 objects for Huong, Tynan, and Rashida" width="800" height="331"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Session store showing the 3 objects for Huong, Tynan, and Rashida.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjuydxwja7hlf4r61lwe0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjuydxwja7hlf4r61lwe0.png" alt="Three separate Kedro-Viz runs by Huong, Tynan, and Rashida" width="800" height="404"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Three separate Kedro-Viz runs by Huong, Tynan, and Rashida.&lt;/p&gt;

&lt;p&gt;This tutorial offers a very swift run through of the configuration process. For further information, check out the &lt;a href="https://docs.kedro.org/en/stable/experiment_tracking/index.html" rel="noopener noreferrer"&gt;documentation on the experiment tracking feature&lt;/a&gt; and keep up-to-date with the latest news about Kedro and Kedro-Viz on our &lt;a href="https://slack.kedro.org" rel="noopener noreferrer"&gt;Slack workspace&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Many thanks to the Kedro-Viz team especially &lt;a href="https://github.com/rashidakanchwala" rel="noopener noreferrer"&gt;@Rashida Kanchwala&lt;/a&gt; for contributing to this post.&lt;/p&gt;

</description>
      <category>kedro</category>
      <category>machinelearning</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>A Polars exploration into Kedro</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Wed, 17 May 2023 14:50:58 +0000</pubDate>
      <link>https://forem.com/kedro/a-polars-exploration-into-kedro-3cab</link>
      <guid>https://forem.com/kedro/a-polars-exploration-into-kedro-3cab</guid>
      <description>&lt;p&gt;One year ago I travelled to Lithuania for the first time to present at PyCon/PyData Lithuania, and I had a great time there. The topic of my talk was an evaluation of some alternative dataframe libraries, including Polars, the one that I ended up enjoying the most. &lt;/p&gt;

&lt;p&gt;I enjoyed it so much that this week I’m in Vilnius again, and I’ll be delivering a workshop at PyCon Lithuania 2023 called &lt;a href="https://pycon.lt/2023/activities/talks/KAJGPU" rel="noopener noreferrer"&gt;“Analyze your data at the speed of light with Polars and Kedro”&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this blog post you will learn how using &lt;a href="https://www.pola.rs/" rel="noopener noreferrer"&gt;Polars&lt;/a&gt; in Kedro can make your data pipelines much faster, what’s the current status of Polars in Kedro, and what can be expected in the near future. In case it’s the first time you’ve heard about Polars, I have included a short introduction at the beginning.&lt;/p&gt;

&lt;p&gt;Let’s dive in!&lt;/p&gt;

&lt;h2&gt;
  
  
  What is the Polars library?
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.pola.rs/" rel="noopener noreferrer"&gt;Polars&lt;/a&gt; is an open-source library for Python, Rust, and NodeJS that provides in-memory dataframes, out-of-core processing capabilities, and more. It is based on the Rust implementation of the &lt;a href="https://arrow.apache.org/" rel="noopener noreferrer"&gt;Apache Arrow&lt;/a&gt; columnar data format (you can read more about Arrow on my earlier blog post &lt;a href="https://dev.to/astrojuanlu/demystifying-apache-arrow-5b0a/"&gt;“Demystifying Apache Arrow”&lt;/a&gt;), and it is optimised to be blazing fast.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9e8a9ozp51rfaqed0puj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9e8a9ozp51rfaqed0puj.png" alt="Snippet of Polars code" width="800" height="403"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The interesting thing about Polars is that it does not try to be a drop-in replacement to pandas, like &lt;a href="https://www.dask.org/" rel="noopener noreferrer"&gt;Dask&lt;/a&gt;, &lt;a href="https://rapids.ai/" rel="noopener noreferrer"&gt;cuDF&lt;/a&gt;, or &lt;a href="https://modin.readthedocs.io/" rel="noopener noreferrer"&gt;Modin&lt;/a&gt;, and instead has its own expressive API. Despite being a young project, it quickly got popular thanks to its easy installation process and its “lightning fast” performance.&lt;/p&gt;

&lt;p&gt;I started experimenting with Polars one year ago, and it has now become my go-to data manipulation library. I gave several talks about it, for example &lt;a href="https://youtu.be/LGAHTp4DYZY" rel="noopener noreferrer"&gt;at PyData NYC&lt;/a&gt;, and the room was full.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do Polars and Kedro get used together?
&lt;/h2&gt;

&lt;p&gt;If you want to learn more about Kedro, you can watch a video introduction on &lt;a href="https://www.youtube.com/@kedro-python" rel="noopener noreferrer"&gt;our YouTube channel&lt;/a&gt;:&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/qClSGY6B0r0"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;Traditionally Kedro has favoured &lt;a href="https://pandas.pydata.org/" rel="noopener noreferrer"&gt;pandas&lt;/a&gt; as a dataframe library because of its ubiquity and popularity. This means that, for example, to read a CSV file, you would add a corresponding entry to &lt;a href="https://docs.kedro.org/en/stable/data/data_catalog.html" rel="noopener noreferrer"&gt;the catalog&lt;/a&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;openrepair-0_3-categories&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;pandas.CSVDataSet&lt;/span&gt;
  &lt;span class="na"&gt;filepath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/01_raw/OpenRepairData_v0.3_Product_Categories.csv&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And then, you would use that dataset as input for &lt;a href="https://docs.kedro.org/en/stable/nodes_and_pipelines/nodes.html" rel="noopener noreferrer"&gt;your node functions&lt;/a&gt;, which would, in turn, receive pandas &lt;code&gt;DataFrame&lt;/code&gt; objects:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;join_events_categories&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;categories&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DataFrame&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;(This is just one of the formats supported by Kedro datasets of course! You can also load Parquet, GeoJSON, images… have a look at &lt;a href="https://docs.kedro.org/en/stable/kedro_datasets.html" rel="noopener noreferrer"&gt;the &lt;code&gt;kedro-datasets&lt;/code&gt; reference&lt;/a&gt; for a list of datasets maintained by the core team, or &lt;a href="https://github.com/topics/kedro-plugin" rel="noopener noreferrer"&gt;the &lt;code&gt;#kedro-plugin&lt;/code&gt; topic on GitHub&lt;/a&gt; for some contributed by the community!)&lt;/p&gt;

&lt;p&gt;The idea of this blog post is to teach you how can you use Polars instead of pandas for your catalog entries, which in turn allow you to write all your data transformation pipelines using Polars dataframes. For that, I crafted some examples that use &lt;a href="https://openrepair.org/open-data/downloads/" rel="noopener noreferrer"&gt;the Open Repair Alliance dataset&lt;/a&gt;, containing more than 80 000 records of repair events across Europe.&lt;/p&gt;

&lt;p&gt;And if you’re ready to start, let’s go!&lt;/p&gt;

&lt;h2&gt;
  
  
  Get started with Polars for Kedro
&lt;/h2&gt;

&lt;p&gt;First of all, you will need to add &lt;code&gt;kedro-datasets[polars.CSVDataSet]&lt;/code&gt; to your requirements. At the time of writing (May 2023), the code below requires development versions of both &lt;code&gt;kedro&lt;/code&gt; and &lt;code&gt;kedro-datasets&lt;/code&gt;, which you can declare on your &lt;code&gt;requirements.txt&lt;/code&gt; or &lt;code&gt;pyproject.toml&lt;/code&gt; as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# requirements.txt

kedro @ git+https://github.com/kedro-org/kedro@3ea7231
kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight toml"&gt;&lt;code&gt;&lt;span class="c"&gt;# pyproject.toml&lt;/span&gt;

&lt;span class="nn"&gt;[project]&lt;/span&gt;
&lt;span class="py"&gt;dependencies&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="s"&gt;"kedro @ git+https://github.com/kedro-org/kedro@3ea7231"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="s"&gt;"kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you are using the legacy &lt;code&gt;setup.py&lt;/code&gt; files, the syntax is very similar:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nf"&gt;setup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;requires&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kedro @ git+https://github.com/kedro-org/kedro@3ea7231&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kedro-datasets[pandas.CSVDataSet,polars.CSVDataSet] @ git+https://github.com/kedro-org/kedro-plugins@3b42fae#subdirectory=kedro-datasets&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After you install these dependencies, you can start using the &lt;code&gt;polars.CSVDataSet&lt;/code&gt; by using the appropriate &lt;code&gt;type&lt;/code&gt; in your catalog entries:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;openrepair-0_3-categories&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;polars.CSVDataSet&lt;/span&gt;
  &lt;span class="na"&gt;filepath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/01_raw/OpenRepairData_v0.3_Product_Categories.csv&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;and that’s it!&lt;/p&gt;

&lt;h2&gt;
  
  
  Reading real world CSV files with &lt;code&gt;polars.CSVDataSet&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;It turns out that reading CSV files is not always that easy. The good news is that you can use the &lt;code&gt;load_args&lt;/code&gt; parameter of the catalog entry to pass extra options to the &lt;code&gt;polars.CSVDataSet&lt;/code&gt;, which mirror the function arguments of &lt;code&gt;polars.read_csv&lt;/code&gt;. For example, if you want to attempt parsing the date columns in the CSV, you can set the &lt;code&gt;try_parse_dates&lt;/code&gt; option to &lt;code&gt;true&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;openrepair-0_3-categories&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;polars.CSVDataSet&lt;/span&gt;
  &lt;span class="na"&gt;filepath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/01_raw/OpenRepairData_v0.3_Product_Categories.csv&lt;/span&gt;
  &lt;span class="na"&gt;load_args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="c1"&gt;# Doesn't make much sense in this case,&lt;/span&gt;
    &lt;span class="c1"&gt;# but serves for demonstration purposes&lt;/span&gt;
    &lt;span class="na"&gt;try_parse_dates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Some of these parameters are required to be Python objects: for example, &lt;code&gt;polars.read_csv&lt;/code&gt; takes an optional &lt;code&gt;dtypes&lt;/code&gt; parameter that can be used to specify the dtypes of the columns, as follows:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data/01_raw/OpenRepairData_v0.3_aggregate_202210.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;dtypes&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;product_age&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Float64&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;group_identifier&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Utf8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Kedro catalog files only support primitive types. But fear not! You can use more sophisticated configuration loaders in Kedro that allow you to tweak how such files are parsed and loaded.&lt;/p&gt;

&lt;p&gt;To pass the appropriate &lt;code&gt;dtypes&lt;/code&gt; to read this CSV file, you can use the &lt;code&gt;TemplatedConfigLoader&lt;/code&gt;, or alternatively &lt;a href="https://docs.kedro.org/en/stable/configuration/advanced_configuration.html#omegaconfigloader" rel="noopener noreferrer"&gt;the shiny new &lt;code&gt;OmegaConfigLoader&lt;/code&gt;&lt;/a&gt; with a custom &lt;code&gt;omegaconf&lt;/code&gt; resolver. Such resolver will take care of parsing the strings in the YAML catalog and transforming them into the objects Polars needs. Place this code in your &lt;code&gt;settings.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# settings.py
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;polars&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pl&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;omegaconf&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OmegaConf&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;kedro.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OmegaConfigLoader&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;OmegaConf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;has_resolver&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;OmegaConf&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;register_new_resolver&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;polars&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pl&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;attr&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;span class="n"&gt;CONFIG_LOADER_CLASS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;OmegaConfigLoader&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And now you can use the special OmegaConf syntax in the catalog:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;openrepair-0_3-events-raw&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;polars.CSVDataSet&lt;/span&gt;
  &lt;span class="na"&gt;filepath&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;data/01_raw/OpenRepairData_v0.3_aggregate_202210.csv&lt;/span&gt;
  &lt;span class="na"&gt;load_args&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;dtypes&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="c1"&gt;# Notice the OmegaConf resolver syntax!&lt;/span&gt;
      &lt;span class="na"&gt;product_age&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${polars:Float64}&lt;/span&gt;
      &lt;span class="na"&gt;group_identifier&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${polars:Utf8}&lt;/span&gt;
    &lt;span class="na"&gt;try_parse_dates&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can access Polars data types with ease from the catalog!&lt;/p&gt;

&lt;h2&gt;
  
  
  Future plans for Polars integration in Kedro
&lt;/h2&gt;

&lt;p&gt;This all looks very promising, but it’s only the tip of the iceberg. First of all, these changes need to land in stable versions of &lt;code&gt;kedro&lt;/code&gt; and &lt;code&gt;kedro-datasets&lt;/code&gt;. More importantly, we are working on &lt;a href="https://github.com/kedro-org/kedro-plugins/pull/170" rel="noopener noreferrer"&gt;a generic Polars dataset&lt;/a&gt; that will be able to read other file formats, for example Parquet, which is faster, more compact, and easier to use.&lt;/p&gt;

&lt;p&gt;Polars makes me so excited about the future of data manipulation in Python, and I hope that all Kedro users are able to leverage this amazing project on their data pipelines very soon!&lt;/p&gt;

</description>
      <category>kedro</category>
      <category>python</category>
      <category>polars</category>
      <category>datascience</category>
    </item>
    <item>
      <title>In the Pipeline: May 2023</title>
      <dc:creator>Juan Luis Cano Rodríguez</dc:creator>
      <pubDate>Mon, 08 May 2023 15:43:50 +0000</pubDate>
      <link>https://forem.com/kedro/in-the-pipeline-may-2023-32cb</link>
      <guid>https://forem.com/kedro/in-the-pipeline-may-2023-32cb</guid>
      <description>&lt;p&gt;We're launching a new monthly blog post that'll keep you updated on all the exciting things happening in the Kedro community. From the latest Kedro news to upcoming events and interesting topics, “In the Pipeline” has got you covered.&lt;/p&gt;

&lt;p&gt;This month: a new pair of releases, Technical Steering Committee news, upcoming events, and our top picks from recent articles and podcasts.&lt;/p&gt;

&lt;h2&gt;
  
  
  The latest releases of Kedro and Kedro-Viz are here
&lt;/h2&gt;

&lt;p&gt;Earlier this week, &lt;a href="https://kedro-org.slack.com/archives/C03RKAQ0MGQ/p1683045212017599" rel="noopener noreferrer"&gt;Merel announced on Slack&lt;/a&gt; that &lt;strong&gt;Kedro&lt;/strong&gt; &lt;strong&gt;&lt;code&gt;0.18.8&lt;/code&gt;&lt;/strong&gt; has been released.&lt;/p&gt;

&lt;p&gt;Here are the headlines. You can see the &lt;a href="https://github.com/kedro-org/kedro/releases/tag/0.18.8" rel="noopener noreferrer"&gt;full set of release notes on GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;🚀 Major features and changes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Added  &lt;code&gt;KEDRO_LOGGING_CONFIG&lt;/code&gt; environment variable, which can be used to configure logging from the beginning of the kedro process.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Removed logs folder from the Kedro new project template. File-based logging will remain but just be level &lt;code&gt;INFO&lt;/code&gt; and above and go to project root instead.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A set of bug fixes and other changes 🪲&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;✍️ &lt;strong&gt;Documentation changes&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Improvements to Sphinx toolchain including incrementing to use a newer version.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Improvements to documentation on visualising Kedro projects on Databricks, and additional documentation about the development workflow for Kedro projects on Databricks.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Improvements to documentation about configuration.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Updated table of contents for documentation to reduce scrolling.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;And more!&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Note that using &lt;code&gt;kedro.extras.datasets&lt;/code&gt; has been officially deprecated, and will be removed from Kedro in 0.19. Installing &lt;code&gt;kedro_datasets&lt;/code&gt; is now the &lt;a href="https://docs.kedro.org/en/stable/kedro_datasets.html" rel="noopener noreferrer"&gt;preferred approach&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We would like to thank our community contributors &lt;a href="https://github.com/MaximeSteinmetz" rel="noopener noreferrer"&gt;Maxime Steinmetz&lt;/a&gt;&lt;strong&gt;,&lt;/strong&gt; &lt;a href="https://github.com/BrianCechmanek" rel="noopener noreferrer"&gt;Brian Cechmanek&lt;/a&gt;, and &lt;a href="https://github.com/MattRossetti" rel="noopener noreferrer"&gt;Matt Rossetti&lt;/a&gt; for their input to this release.&lt;/p&gt;




&lt;p&gt;In the last week of April, &lt;a href="https://kedro-org.slack.com/archives/C03RKAQ0MGQ/p1682700107674719" rel="noopener noreferrer"&gt;Nero announced&lt;/a&gt; the release of &lt;strong&gt;Kedro-Viz&lt;/strong&gt; &lt;strong&gt;&lt;code&gt;6.1.0&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Kedro-Viz is an interactive development tool for building and visualising data science pipelines with &lt;a href="https://github.com/kedro-org/kedro" rel="noopener noreferrer"&gt;Kedro&lt;/a&gt;. It enables you to monitor the status of your ML project, present it to stakeholders, and smoothly bring new team members onboard. It also offers &lt;a href="https://kedro.org/blog/experiment-tracking-with-kedro" rel="noopener noreferrer"&gt;experiment tracking&lt;/a&gt;, and the ability to preview code and datasets.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;How do I get Kedro-Viz?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Python: &lt;/p&gt;

&lt;p&gt;&lt;code&gt;pip install kedro-viz==6.0.0&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;React: &lt;/p&gt;

&lt;p&gt;&lt;code&gt;npm install @quantumblack/kedro-viz@latest&lt;/code&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;🚀 What can you expect in this release?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Experiment tracking updates allowing users to filter (show/hide) metrics in the time series &amp;amp; parallel coordinates metrics plots.📈&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;A set of bug fixes and other changes 🪲&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can see the full &lt;a href="https://github.com/kedro-org/kedro-viz/releases/tag/v6.1.0" rel="noopener noreferrer"&gt;Kedro-Viz release notes on GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;🔮 &lt;strong&gt;What's coming next?&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Collaboration features within Kedro-Viz.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Create your own reports.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Technical Steering Committee news
&lt;/h2&gt;

&lt;p&gt;We’ve recently welcomed &lt;a href="https://www.linkedin.com/in/marrrcin/" rel="noopener noreferrer"&gt;@marrrcin&lt;/a&gt; to the Kedro Technical Steering Committee! You can read more &lt;a href="https://kedro.org/blog/news-from-the-kedro-technical-steering-committee" rel="noopener noreferrer"&gt;about this fantastic news on our blog&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;We’d also like to share some numbers that we collected recently:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;GitHub Stars on &lt;a href="https://github.com/kedro-org/kedro" rel="noopener noreferrer"&gt;https://github.com/kedro-org/kedro&lt;/a&gt;: 8.3K&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Monthly Downloads: 467,000&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Upcoming events
&lt;/h2&gt;

&lt;h3&gt;
  
  
  4th May 2023
&lt;/h3&gt;

&lt;p&gt;The &lt;a href="https://kedro.org/" rel="noopener noreferrer"&gt;Kedro&lt;/a&gt; team is organising a 2-hour virtual training session on Thursday, May 4th, 2023 that is open to everyone. The session introduces you to Kedro and explains how to turn a Jupyter notebooks into reusable Python libraries. You’ll learn the benefits of Kedro pipelines and how to visualise them using Kedro-Viz in an interactive session with plenty of Q&amp;amp;A.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://events.quantumblack.com/kedro-intro-23-05" rel="noopener noreferrer"&gt;Register now&lt;/a&gt; to reserve your slot on 4th May 2023 at 4:00pm–6:00pm CEST (which is 10:00am–12:00pm EDT).&lt;/p&gt;

&lt;h3&gt;
  
  
  18th May 2023
&lt;/h3&gt;

&lt;p&gt;Juan Luis, Kedro’s Developer Advocate, is giving a talk on &lt;a href="https://pycon.lt/2023/activities/talks/KAJGPU" rel="noopener noreferrer"&gt;18th May at PyCon Lithuania&lt;/a&gt;. His talk is titled “Analyze your data at the speed of light with Polars and Kedro” and presents how to combine Kedro with Polars, a new dataframe library backed by Arrow and Rust, for lightning fast data manipulation and exploratory data analysis&lt;/p&gt;

&lt;h2&gt;
  
  
  In the pipeline: top picks from the Kedro team
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Towards Data Science recently published a pair of nice posts by Jõao Pedro about &lt;a href="https://towardsdatascience.com/data-pipeline-with-airflow-and-aws-tools-s3-lambda-glue-18585d269761" rel="noopener noreferrer"&gt;writing a data pipeline with Airflow and AWS Tools (S3, Lambda &amp;amp; Glue)&lt;/a&gt; and &lt;a href="https://towardsdatascience.com/automatically-managing-data-pipeline-infrastructures-with-terraform-323fd1808a47" rel="noopener noreferrer"&gt;automatically managing data pipeline infrastructures With Terraform&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;GetInData | Part of Xebia publish a weekly newsletter on LinkedIn called Data Pill and it hit its 50th edition this week, so celebrated with a &lt;a href="https://www.linkedin.com/newsletters/data-pill-6944719603960840192/" rel="noopener noreferrer"&gt;compilation of the most popular case studies&lt;/a&gt; from previous editions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;a href="https://engineering.atspotify.com/podcasts/nerdout-at-spotify/" rel="noopener noreferrer"&gt;NerdOut@Spotify&lt;/a&gt; podcast is always a must-listen. It’s produced by the nerds at Spotify, and made for the nerds inside all of us. You get to hear from Spotify engineers about challenging tech problems and get a firsthand look into what they’re doing. The most recent episode is a fascinating look into building at scale.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Speaking of podcasts and Spotify, the R&amp;amp;D Engineering team recently blogged about the &lt;a href="https://engineering.atspotify.com/2023/04/large-scale-generation-of-ml-podcast-previews-at-spotify-with-google-dataflow/" rel="noopener noreferrer"&gt;generation of podcast previews using Google Dataflow&lt;/a&gt;. The result: a neat way of providing users with audio teasers so they can make listening decisions that aren’t based just on static content, such as cover art and descriptions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In last month’s virtual Kedro update meeting, we walked the community through the new OmegaConfigLoader, described user research and ongoing collaboration with Databricks, and discussed experiment tracking in Kedro-Viz. If you missed the session, you can catch up with a recording on the &lt;a href="https://www.youtube.com/watch?v=ACwLKx8TEXc" rel="noopener noreferrer"&gt;Kedro YouTube channel&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  That’s it for May 2023
&lt;/h2&gt;

&lt;p&gt;And that’s a wrap for this month. But if you can’t wait for next month’s &lt;em&gt;&lt;strong&gt;In the Pipeline&lt;/strong&gt;&lt;/em&gt; news, we also toot out regular updates onto Mastodon (&lt;a href="https://social.lfx.dev/@kedro" rel="noopener noreferrer"&gt;https://social.lfx.dev/@kedro&lt;/a&gt;) and across the popular channels of the &lt;a href="https://slack.kedro.org/" rel="noopener noreferrer"&gt;Slack community&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;Spoiler alert! Next month, we might unveil a fresh new look. But shh, let's keep it between us for now. Make sure to bookmark this blog or &lt;a href="https://kedro.org/blog/rss" rel="noopener noreferrer"&gt;add our RSS feed to your favorite reader&lt;/a&gt; to stay in the loop and join us in the first week of June for another update from the Kedro team.&lt;/p&gt;




</description>
      <category>kedro</category>
      <category>python</category>
      <category>datascience</category>
      <category>news</category>
    </item>
    <item>
      <title>Introducing your new team lead…Kedro</title>
      <dc:creator>Jo Stichbury</dc:creator>
      <pubDate>Wed, 19 Apr 2023 12:59:06 +0000</pubDate>
      <link>https://forem.com/kedro/introducing-your-new-team-leadkedro-nhl</link>
      <guid>https://forem.com/kedro/introducing-your-new-team-leadkedro-nhl</guid>
      <description>&lt;p&gt;This post explains how Kedro can guide an analytics team to follow best practices and avoid technical debt.&lt;/p&gt;

&lt;p&gt;In a recent article, I explained that &lt;a href="https://towardsdatascience.com/five-software-engineering-principles-for-collaborative-data-science-ab26667a311"&gt;following software principles can help you create a well-ordered analytics project&lt;/a&gt; to share, extend and reuse in the future. In this post we'll review how you can benefit from using Kedro as a toolbox to apply best practices to data science code.&lt;/p&gt;

&lt;h2&gt;
  
  
  How data science projects fail
&lt;/h2&gt;

&lt;p&gt;As data scientists, we aspire to unlock valuable insights by building&lt;br&gt;
well-engineered prototypes that we can take forward into production.&lt;br&gt;
Instead, there is a tendency for us to make poor engineering decisions&lt;br&gt;
in the face of tight deadlines or write code of dubious quality through&lt;br&gt;
a lack of expertise. &lt;/p&gt;

&lt;p&gt;The result is &lt;a href="https://www.splunk.com/en_us/data-insider/what-is-tech-debt.html"&gt;technical debt&lt;/a&gt; and prototype code that is difficult to understand,&lt;br&gt;
maintain, extend, and fix. Projects that once looked promising fail to transition past the experimental stage into production.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"A cycle of quick and exciting research leads to high expectations of&lt;br&gt;
great improvement, followed by a long series of delays and&lt;br&gt;
disappointments where frustrating integration work fails to recreate&lt;br&gt;
those elusive improvements, made all the worse by the feeling of sunk&lt;br&gt;
costs and a need to justify the time spent."&lt;/p&gt;

&lt;p&gt;Joe Plattenburg, Data Scientist at Root Insurance&lt;/p&gt;
&lt;/blockquote&gt;
&lt;h2&gt;
  
  
  How to write well-engineered data science code
&lt;/h2&gt;

&lt;p&gt;When you start to cut code on a prototype, you may not prioritize&lt;br&gt;
maintainability and consistency. Adopting a team culture and way of&lt;br&gt;
working to minimize technical debt can make the difference between&lt;br&gt;
success and failure.&lt;/p&gt;

&lt;p&gt;Some of the most valuable techniques a data scientist can pick up are&lt;br&gt;
those that generations of software engineers already use, such as the&lt;br&gt;
following guidelines:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use a standard and logical project structure&lt;/strong&gt;: It is easier to&lt;br&gt;
understand a project, and share it with others, if you follow a standard&lt;br&gt;
structure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don't use hardcoded values&lt;/strong&gt;: instead, use precisely named constants&lt;br&gt;
and put them all into a single configuration file so you can find and&lt;br&gt;
update them easily.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Refactor your code&lt;/strong&gt;: In data science terms, it often makes sense to&lt;br&gt;
use a Jupyter notebook for experimentation. But once your experiment is&lt;br&gt;
done, it's time to clean up the code to remove elements that make it&lt;br&gt;
unmaintainable, and to remove accidental complexity. Refactor the code&lt;br&gt;
into Python functions and packages to form a pipeline that can be&lt;br&gt;
routinely tested to ensure repeatable behaviour.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Testing after each change means that when I make a mistake, I only&lt;br&gt;
have a small change to consider in order to spot the error, which&lt;br&gt;
makes it far easier to find and fix."&lt;/p&gt;

&lt;p&gt;Martin Fowler, Author of Refactoring: Improving the Design of Existing&lt;br&gt;
Code&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;strong&gt;Make code reusable by making it readable&lt;/strong&gt;: Write your pipelines as a&lt;br&gt;
series of small functions that do just one task, with single return&lt;br&gt;
paths and a limited number of arguments.&lt;/p&gt;

&lt;p&gt;Many data scientists say they've learned from their colleagues through&lt;br&gt;
pair programming, code reviews and in-house mentoring that enables them&lt;br&gt;
to build expertise suitable to their roles and requirements.&lt;/p&gt;

&lt;p&gt;We see Kedro as the always-available team lead that steers the direction&lt;br&gt;
of the analytics project from the outset and encourages use of a&lt;br&gt;
well-organized folder structure, software design that supports regular&lt;br&gt;
testing, and a culture of writing readable, clean code.&lt;/p&gt;
&lt;h2&gt;
  
  
  What is Kedro?
&lt;/h2&gt;

&lt;p&gt;Kedro is an open-source toolbox for production-ready data science. The&lt;br&gt;
framework was born at QuantumBlack to solve the challenges faced&lt;br&gt;
regularly in data science projects and promote teamwork through&lt;br&gt;
standardised team workflows. It is now hosted by the &lt;a href="https://lfaidata.foundation/"&gt;LF AI &amp;amp; Data&lt;br&gt;
Foundation&lt;/a&gt; as an incubating project.&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/yEQqf3XUvzk"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;h3&gt;
  
  
  Kedro = Consistent project structure
&lt;/h3&gt;

&lt;p&gt;Kedro is built on the learnings of &lt;a href="https://drivendata.github.io/cookiecutter-data-science/"&gt;Cookie Cutter Data Science&lt;/a&gt;. It helps you to standardise how configuration, source&lt;br&gt;
code, tests, documentation, and notebooks are organised with an&lt;br&gt;
adaptable project template. If your team needs to build with multiple&lt;br&gt;
projects that have similar structure, you can also create your own&lt;br&gt;
Cookie Cutter project templates with Kedro starters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kedro = Maintainable code
&lt;/h3&gt;

&lt;p&gt;Kedro helps you refactor your business logic and data processing into&lt;br&gt;
Python modules and packages to form pipelines, so you can keep your&lt;br&gt;
notebooks clean and tidy.&lt;br&gt;
&lt;a href="https://demo.kedro.org"&gt;Kedro-Viz&lt;/a&gt; then visualises the pipelines to help you navigate .&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"People started from scratch each time, the same pitfalls were&lt;br&gt;
experienced independently, reproducibility was time consuming and only&lt;br&gt;
members of the original project team really understood each&lt;br&gt;
codebase...&lt;/p&gt;

&lt;p&gt;We needed to enforce consistency and software engineering best&lt;br&gt;
practices across our own work. Kedro gave us the super-power to move&lt;br&gt;
people from project to project and it was game-changing. After working&lt;br&gt;
with Kedro once, you can land in another project and know how the&lt;br&gt;
codebase is structured, where everything is and most importantly how&lt;br&gt;
you can help".&lt;/p&gt;

&lt;p&gt;Joel Schwarzmann, Principal Product Manager, QuantumBlack Labs, &lt;a href="https://medium.com/towards-data-science/five-software-engineering-principles-for-collaborative-data-science-ab26667a311"&gt;blog&lt;br&gt;
post&lt;/a&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h3&gt;
  
  
  Kedro = Code quality
&lt;/h3&gt;

&lt;p&gt;Kedro makes it easy to avoid common code smells such as hard-coded&lt;br&gt;
constants and magic numbers. The configuration library enables your code&lt;br&gt;
to be reusable through data, model, and logging configuration. An&lt;br&gt;
ever-expanding data catalog supports multiple formats of data access.&lt;/p&gt;

&lt;p&gt;Kedro also makes it keep your code quality up to standard, through&lt;br&gt;
support for black, isort, and flake8 for code linting and formatting,&lt;br&gt;
pytest for testing, and Sphinx for documentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kedro = Standardisation
&lt;/h3&gt;

&lt;p&gt;Kedro integrates with standard data science tools, such as TensorFlow,&lt;br&gt;
scikit-learn, or Jupyter notebooks for experimentation, and commonly&lt;br&gt;
used routes to deployment such as Databricks.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;Kedro is an open-source Python toolbox that applies software engineering&lt;br&gt;
principles to data science code. It makes it easier for a team to apply&lt;br&gt;
software engineering principles to data science code, which reduces the&lt;br&gt;
time spent rewriting data science experiments so that they are fit for&lt;br&gt;
production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When you follow established best practice, you have a better chance of&lt;br&gt;
success.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Software engineering principles only work if the entire team follows&lt;br&gt;
them. A tool like Kedro can guide you just like an experienced technical&lt;br&gt;
lead, making it second nature to use established best practices, and&lt;br&gt;
supporting a culture and set of processes based upon software&lt;br&gt;
engineering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Look forward to greater collaboration and productivity with Kedro in&lt;br&gt;
your team!&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Find out more about Kedro
&lt;/h2&gt;

&lt;p&gt;There are many ways to learn more about Kedro:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Join our &lt;a href="https://slack.kedro.org/"&gt;Slack organisation&lt;/a&gt; to reach out to us directly if you've a question or want to stay up to date with our news. There's an &lt;a href="https://www.linen.dev/s/kedro"&gt;archive of past past conversations on Slack&lt;/a&gt; too.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.kedro.org/"&gt;Read our docs&lt;/a&gt; or look at the &lt;a href="https://github.com/kedro-org/kedro"&gt;Kedro source code on GitHub&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Check out our "&lt;a href="https://www.youtube.com/watch?v=NU7LmDZGb6E"&gt;Crash course in Kedro&lt;/a&gt; video on YouTube.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Look out for an upcoming training session tailored to help your team get&lt;br&gt;
on-board with Kedro.&lt;/p&gt;

</description>
      <category>python</category>
      <category>datascience</category>
      <category>kedro</category>
      <category>opensource</category>
    </item>
    <item>
      <title>How do data scientists combine Kedro and Databricks?</title>
      <dc:creator>Jo Stichbury</dc:creator>
      <pubDate>Wed, 19 Apr 2023 12:14:37 +0000</pubDate>
      <link>https://forem.com/kedro/how-do-data-scientists-combine-kedro-and-databricks-4pjd</link>
      <guid>https://forem.com/kedro/how-do-data-scientists-combine-kedro-and-databricks-4pjd</guid>
      <description>&lt;p&gt;In recent research, we found that Databricks is the dominant&lt;br&gt;
machine-learning platform used by Kedro users.&lt;/p&gt;

&lt;p&gt;The purpose of the research was to identify any barriers to using Kedro with Databricks; we are collaborating with the Databricks team to create a prioritised list of opportunities to facilitate integration. &lt;/p&gt;

&lt;p&gt;For example, Kedro is best used with an IDE, but IDE support on Databricks is still evolving, so we are keen to understand the pain points that Kedro users face when combining it with Databricks.&lt;/p&gt;

&lt;p&gt;Our research took qualitative data from 16 interviews, and quantitative data from a poll (140 participants) and a survey (46 participants) across the McKinsey and open-source Kedro user bases. We analysed two user journeys.&lt;/p&gt;

&lt;h2&gt;
  
  
  How to ensure a Kedro pipeline is available in a Databricks workspace
&lt;/h2&gt;

&lt;p&gt;The first user journey we considered is how a user ensures the latest version of their pipeline codebase is available within the Databricks workspace. The most common workflow is to use Git, but almost a third of the users in our research set said there were a lot of steps to follow.&lt;/p&gt;

&lt;p&gt;The alternative workflow, which is to use dbx sync to Databricks repos, was used by less than 10% of the users we researched, indicating that awareness of this option is low.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--ENvLTCgi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/f0r395hhea9dlgh3enl8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--ENvLTCgi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/f0r395hhea9dlgh3enl8.png" alt="Slide from presentation about Kedro and Databricks research" width="800" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How to run Kedro pipelines using a Databricks cluster
&lt;/h2&gt;

&lt;p&gt;The second user journey is how users run Kedro pipelines using a Databricks cluster. The most popular method, used by over 80% of participants in our research, is to use a Databricks notebook, which serves as an entry point to run Kedro pipelines. &lt;/p&gt;

&lt;p&gt;We discovered that many users were unaware of the IPython extension that significantly reduces the amount of code required to run Kedro pipelines in Databricks notebooks.&lt;/p&gt;

&lt;p&gt;We also found that some users run their Kedro pipelines by packaging them and running the resulting Python package on Databricks. However, Kedro did not support the packaging of configurations until version 18.5, which has caused problems. &lt;/p&gt;

&lt;p&gt;The final option some users select is to use Databricks Connect, but this is not recommended since it is soon&lt;br&gt;
to be sunsetted by Databricks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--uMr4-AR---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lrz30w22fdk36o1whzwn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uMr4-AR---/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lrz30w22fdk36o1whzwn.png" alt="Slide from presentation about Kedro and Databricks research" width="800" height="349"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The output of our research
&lt;/h2&gt;

&lt;p&gt;To make it easier to pair Kedro and Databricks, we are updating Kedro's documentation to cover the latest Databricks features and tools, particularly the development and deployment workflows for Kedro on Databricks with DBx. The goal is to help Kedro users take advantage of the benefits of working locally in an IDE and still deploy to Databricks&lt;br&gt;
with ease.&lt;/p&gt;

&lt;p&gt;You can expect this new documentation to be released in the next one to two weeks.&lt;/p&gt;

&lt;p&gt;We will also be creating a Kedro Databricks plugin or starter project template to automate the recommended steps in the documentation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Coming soon...
&lt;/h2&gt;

&lt;p&gt;We have a managed Delta table dataset available in our Kedro datasets repo, which will be available for public consumption soon. We are also planning to support managed MLflow on Databricks.&lt;/p&gt;

&lt;p&gt;We have set up a &lt;a href="https://github.com/kedro-org/kedro/milestone/17"&gt;milestone on GitHub&lt;/a&gt; so you can check in on our progress and contribute if you want to. To suggest features to us, report bugs, or just see what we're working on right now, visit the Kedro projects on &lt;a href="https://github.com/kedro-org"&gt;GitHub&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;We welcome every contribution, large or small.&lt;/p&gt;

&lt;h2&gt;
  
  
  Find out more about Kedro
&lt;/h2&gt;

&lt;p&gt;There are many ways to learn more about Kedro:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Join our &lt;a href="https://slack.kedro.org/"&gt;Slack organisation&lt;/a&gt; to reach out to us directly if you've a question or want to stay up to date with our news. There's an &lt;a href="https://www.linen.dev/s/kedro"&gt;archive of past past conversations on Slack&lt;/a&gt; too.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://docs.kedro.org/"&gt;Read our docs&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Check out our "&lt;a href="https://www.youtube.com/watch?v=NU7LmDZGb6E"&gt;Crash course in Kedro&lt;/a&gt; video on YouTube.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Look out for an upcoming training session tailored to help your team get on-board with Kedro.&lt;/p&gt;

</description>
      <category>python</category>
      <category>kedro</category>
      <category>databricks</category>
      <category>datascience</category>
    </item>
    <item>
      <title>A new home for the Kedro blog and some recent releases</title>
      <dc:creator>Jo Stichbury</dc:creator>
      <pubDate>Tue, 04 Apr 2023 14:51:47 +0000</pubDate>
      <link>https://forem.com/kedro/a-new-home-for-the-kedro-blog-and-some-recent-releases-mg6</link>
      <guid>https://forem.com/kedro/a-new-home-for-the-kedro-blog-and-some-recent-releases-mg6</guid>
      <description>&lt;p&gt;In this post, we describe recent releases to Kedro, Kedro-Viz and some new Kedro datasets. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--aYupLtp7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ofli6panzt85bk3zs6qs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--aYupLtp7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ofli6panzt85bk3zs6qs.png" alt="Image description" width="592" height="592"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Kedro has a new blog over at &lt;a href="https://kedro.org/blog"&gt;kedro.org/blog&lt;/a&gt;! &lt;/p&gt;

&lt;p&gt;We’ve previously published on &lt;a href="https://medium.com/quantumblack" rel="noreferrer"&gt;QuantumBlack’s Medium channel&lt;/a&gt;, but recent updates and improvements here on the Kedro website mean that we’re now able to bring you a dedicated blog for, and about, the open-source Kedro community. &lt;/p&gt;
&lt;p&gt;We plan to publish a range of articles by contributors from within the team and beyond. If you’re a Kedroid with an idea for a post, please reach out to us using one of the channels on the &lt;a href="https://slack.kedro.org/" rel="noreferrer"&gt;Slack organisation&lt;/a&gt;, or &lt;a href="https://github.com/kedro-org/kedro-devrel/issues?q=is%3Aissue+is%3Aopen+label%3A%22blog+post%22" rel="noreferrer"&gt;raise an issue on GitHub&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id="kedro-releases"&gt;Kedro releases&lt;/h2&gt;
&lt;p&gt;We last gave an update on Kedro in late 2022, when &lt;a href="https://medium.com/quantumblack/the-latest-kedro-developments-9a4d15a7ceb5" rel="noreferrer"&gt;we described the features in Kedro version 0.18.4&lt;/a&gt;. Since then, we’ve released three additional non-breaking versions of Kedro in the 0.18.x series, with the goal of a regular release cadence at the end of most two-week development sprints.&lt;/p&gt;
&lt;p&gt;Some of the highlights of our releases are described below along with links to the full release notes. For each of these releases there’s a straightforward upgrade path with pip or conda. For example, to upgrade to Kedro version 0.18.7 from version 0.18.4:&lt;/p&gt;
&lt;p&gt;&lt;code&gt;pip install kedro==0.18.7&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;or&lt;/p&gt;
&lt;p&gt;&lt;code&gt;conda install -c conda-forge kedro==0.18.7&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;&lt;b&gt;We received many contributions to these new versions from our open-source community and want to thank every contributor for taking the time to extend and improve Kedro.&lt;/b&gt;&lt;/p&gt;
&lt;h3&gt;Kedro version 0.18.7 &lt;/h3&gt;
&lt;p&gt;These are the headline changes (You can find all the &lt;a href="https://github.com/kedro-org/kedro/releases/tag/0.18.7" rel="noreferrer"&gt;details about the Kedro 0.18.7 release&lt;/a&gt; on GitHub):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;We added new Kedro CLI command &lt;code&gt;kedro jupyter setup&lt;/code&gt; to set up a Jupyter Kernel for Kedro that automatically loads the Kedro extension for ease of use.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;The &lt;code&gt;kedro package&lt;/code&gt; command now includes the project configuration in a compressed &lt;code&gt;tar.gz&lt;/code&gt; file.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We’ve added functionality to package and read your configuration as a compressed file. You can now use &lt;code&gt;OmegaConfigLoader&lt;/code&gt; to load configuration from compressed files of zip or tar format. (This feature requires &lt;code&gt;fsspec&amp;gt;=2023.1.0&lt;/code&gt;).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;In documentation news, we moved seamlessly from &lt;code&gt;kedro.readthedocs.io&lt;/code&gt; to &lt;a href="https://docs.kedro.org/" rel="noreferrer"&gt;docs.kedro.org&lt;/a&gt; in this release. We also made some significant improvements to on-boarding documentation that covers setup for new Kedro users and major changes to the spaceflights tutorial to make it faster to work through. We think it’s a better read. &lt;a href="https://github.com/kedro-org/kedro/issues/new/choose" rel="noreferrer"&gt;Tell us if it’s not&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Kedro version 0.18.6&lt;/h3&gt;
&lt;p&gt;This was a small release to fix a bug introduced in Kedro 0.18.5 that was causing experiment tracking in Kedro-Viz to fail. You can find all the &lt;a href="https://github.com/kedro-org/kedro/releases/tag/0.18.6" rel="noreferrer"&gt;details about the release of Kedro version 0.18.6&lt;/a&gt; on GitHub.&lt;/p&gt;
&lt;h3&gt;Kedro version 0.18.5&lt;/h3&gt;
&lt;p&gt;In February 2023, we released Kedro version 0.18.5, to introduce a brand new config loader powered by &lt;a href="https://omegaconf.readthedocs.io/en/2.3_branch/" rel="noreferrer"&gt;OmegaConf&lt;/a&gt;. You can now use the &lt;code&gt;omegaconf&lt;/code&gt; syntax with &lt;code&gt;kedro run --param&lt;/code&gt;. &lt;/p&gt;
&lt;p&gt;We also added the following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Some improvements to the &lt;code&gt;kedro run&lt;/code&gt; command used in the CLI. One changes is to make it more consistent. The flags &lt;code&gt;--node&lt;/code&gt;, &lt;code&gt;--tag&lt;/code&gt;, and &lt;code&gt;--load-version&lt;/code&gt; are deprecated in favour of plural equivalents (&lt;code&gt;--nodes&lt;/code&gt;, &lt;code&gt;--tags&lt;/code&gt;, and &lt;code&gt;--load-versions&lt;/code&gt;) and will be removed in Kedro 0.19.0. An additional change means that you can filter and run nodes by node namespace using the &lt;code&gt;--namespace&lt;/code&gt; flag with &lt;code&gt;kedro run&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;There is now support for using generator functions as nodes, i.e. using yield instead of return.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We added a new node argument to all four dataset hooks&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;You can find all the &lt;a href="https://github.com/kedro-org/kedro/releases/tag/0.18.5" rel="noreferrer"&gt;details about the Kedro version 0.18.5 release&lt;/a&gt; on GitHub.&lt;/p&gt;
&lt;h2 id="kedro-datasets-releases-"&gt;Kedro datasets releases &lt;/h2&gt;
&lt;p&gt;Kedro provides numerous different built-in datasets for various file types and file systems, to save you from having to write the logic for reading or writing data, including Pandas, Spark, Dask, NetworkX, Pickle, and more.&lt;/p&gt;
&lt;p&gt;There have been several datasets contributed by community members over the past months which include the addition of:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;snowflake.SnowparkTableDataSet&lt;/code&gt; by &lt;a href="https://github.com/Vladimir-Filimonov" rel="noreferrer"&gt;Vladimir Filimonov&lt;/a&gt; and &lt;a href="https://github.com/heber-urdaneta" rel="noreferrer"&gt;Heber Urdaneta&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;code&gt;polars.CSVDataSet&lt;/code&gt; by &lt;a href="https://github.com/wmoreiraa" rel="noreferrer"&gt;Walber Moreira&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As we mentioned in “&lt;a href="https://medium.com/quantumblack/keeping-up-with-kedro-the-latest-developments-in-our-development-workflow-framework-cbcc415eea9c" rel="noreferrer"&gt;Keeping up with Kedro&lt;/a&gt;”, Kedro version 0.19.0 will move Kedro’s datasets from the main framework project into a separate package called Kedro-Datasets.&lt;/p&gt;
&lt;h2 id="kedro-viz-releases"&gt;Kedro-Viz releases&lt;/h2&gt;
&lt;p&gt;If you've not yet used it, Kedro-Viz is the interactive development tool for building data science pipelines with Kedro. It comes with an experiment tracking feature enabling you to view and compare different runs of your Kedro project. Check out the &lt;a href="http://demo.kedro.org/" rel="noreferrer"&gt;Kedro-Viz demo at demo.kedro&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;We’ve made three releases of Kedro-Viz this year, plus a patch release. You can find further &lt;a href="https://github.com/kedro-org/kedro-viz/releases" rel="noreferrer"&gt;details of the Kedro-Viz releases on GitHub&lt;/a&gt;. &lt;/p&gt;
&lt;p&gt;To get the latest release of Kedro-Viz, you can use pip: &lt;/p&gt;
&lt;p&gt;&lt;code&gt;pip install kedro-viz==6.0.0&lt;/code&gt; &lt;/p&gt;
&lt;p&gt;or npm &lt;/p&gt;
&lt;p&gt;&lt;code&gt;npm install @quantumblack/kedro-viz@latest&lt;/code&gt;&lt;/p&gt;
&lt;p&gt;Here’s a summary of what we’ve been working on:&lt;/p&gt;
&lt;h3&gt;Kedro-Viz version 6.0.0&lt;/h3&gt;
&lt;p&gt;In this release we bumped the major version to 6.0.0 because of a change in the frontend React code (we bumped the minimum version of React from 16.8.6 to 17.0.2). Additional changes include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;We added a change so you can now see a preview of your data in the metadata panel.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You can remove metrics plots from metadata panel and add links to the plots on experiment tracking. &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;You can also link plot and JSON dataset names from experiment tracking to the flowchart.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Kedro-Viz no longer depends on pandas or Plotly.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Kedro-Viz versions 5.3.0 and 5.2.0&lt;/h3&gt;
&lt;p&gt;We introduced a raft of updates to experiment tracking, the largest being the addition of time series &amp;amp; parallel coordinates metrics plots and delta values.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;p&gt;We’ve enabled the display of json objects with &lt;code&gt;react-json-viewer&lt;/code&gt; in experiment tracking.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;We added a feature to show/hide modular pipelines on the pipeline flowchart.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;It’s now possible to retrieve and share URL parameters for each element/section in the flowchart.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We've recently published a &lt;a href="https://kedro.org/blog/experiment-tracking-with-kedro" rel="noreferrer"&gt;blog post about experiment tracking&lt;/a&gt; to highlight the latest features and discuss what is coming next.&lt;/p&gt;
&lt;h2 id="whats-next-for-the-kedro-projects"&gt;What's next for the Kedro projects?&lt;/h2&gt;
&lt;p&gt;We have a broad range of &lt;a href="https://github.com/kedro-org/kedro/milestones" rel="noreferrer"&gt;milestones for the Kedro framework&lt;/a&gt; that cover areas such as integration with Databricks, enhancements for Jupyter Notebook users and ongoing changes such as the &lt;a href="https://medium.com/quantumblack/the-latest-kedro-developments-9a4d15a7ceb5" rel="noreferrer"&gt;transition of datasets into their own package&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;On the to-do list for Kedro-Viz, we’ve included enhanced navigation between flowchart and experiment tracking and collaboration features within Kedro-Viz.&lt;/p&gt;
&lt;p&gt;Stand by for a pair of virtual Kedro showcases on 5th April 2023 (&lt;a href="https://www.meetup.com/meetup-group-zltyafrj/events/292529526/" rel="noreferrer"&gt;9am BST&lt;/a&gt; and &lt;a href="https://www.meetup.com/meetup-group-zltyafrj/events/292533186/" rel="noreferrer"&gt;4pm BST&lt;/a&gt;) to demonstrate some of the features added in the recent releases to the global community. &lt;/p&gt;
&lt;p&gt;To suggest features to us, report bugs, or just see what we’re working on right now, visit the Kedro projects on &lt;a href="https://github.com/kedro-org" rel="noreferrer"&gt;GitHub&lt;/a&gt;. We welcome every contribution, large or small.&lt;/p&gt;

</description>
      <category>python</category>
      <category>kedro</category>
      <category>datascience</category>
      <category>news</category>
    </item>
  </channel>
</rss>
